Combination of expert knowledge and a genetic fuzzy ... - IEEE Xplore

2 downloads 0 Views 1MB Size Report
Abstract—Objective: In this paper, the genetic fuzzy inference system based on expert knowledge for automatic sleep staging was developed. Methods: Eight ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

1

Combination of expert knowledge and a genetic fuzzy inference system for automatic sleep staging Sheng-Fu Liang, Chih-En Kuo, Fu-Zen Shaw, Ying-Huang Chen, Chia-Hu Hsu, Jyun-Yu Chen 

leep is one of the most important human activities. Sleep diseases, such as insomnia and obstructive sleep apnea may cause daytime sleepiness, irritability, depression, anxiety or even death [1, 2]. Sleep analysis is helpful in disease diagnosis and in several psychophysiological analyses. For the diagnosis of sleep issues, all night polysomnographic (PSG) recordings, including electroencephalogram (EEG), electrooculogram (EOG) and electromyogram (EMG), are usually taken from the patients and the recordings are scored by a well-trained expert according to the Rechtschaffen & Kales (R&K) rules codified in 1968 [3]. According to the R&K rules, each epoch (i.e., 30 s of data) is classified into one of the sleep stages, including wakefulness (Wake), non-rapid eye movement (stages 1-4,

from light to deep sleep) and rapid eye movement (REM). Stages 3 and 4 were combined as the slow wave sleep (SWS) stage recently [4]. The expert manually scores the all night PSG recordings according to the characteristics of each sleep stage described by the R&K rules [3]. The Wake stage consists of alpha activity (813 Hz) or low-voltage mixed frequency activity. During light sleep, the S1 stage consists of related low-voltage mixed activity (3-7 Hz); the S2 stage is characterized by the appearance of sleep spindles and/or K-complexes. During deep sleep, the SWS stage consists of high-voltage (>75 μV), lowfrequency ( 30 min) and 75.25 min (> 15 min), respectively. These measurements were approved by the internal review board of National Cheng Kung University. Subjects were recruited from the public by online advertisements and announcements on notice boards at National Cheng Kung University. Participants had to refrain from any drug/medication and limit caffeine use (no caffeine intake for at least 5-6 h prior to sleep laboratory visits). The all-night PSGs were recorded in the sleep laboratory at the cognitive institute of National Cheng Kung University. There was no outside interference during data collection, and no medications were used to induce sleep. The recordings (Siesta 802 PSG, Compumedics, Inc.) included six EEG channels (F3-A2, F4-A1, C3-A2, C4-A1, P3-A2, and P4-A1, according to the

international 10-20 standard system), two EOG channels (positioned 1 cm lateral to the left and right outer canthi), and a chin EMG channel. The sampling rate was 256 Hz with 16-bit resolution. All 48 PSG sleep recordings were visually scored by a sleep specialist using the R&K rules with a 30-s interval (named an epoch). B. Feature extraction The automatic sleep staging system analyzes the data from two recording channels: the central EEG (C3-A2) and the chin EMG. According to the R&K criteria [3], the EEG data were filtered with an eighth-order Butterworth band-pass filter with a cutoff frequency of 0.5–30 Hz, and the EMG data were filtered with a 5–100 Hz eighth-order Butterworth band-pass filter. The continuous time signals were segmented into the nonoverlapping 30-s epochs. According to the property of sleep recordings, the spectral information may be lost if the processing window is too long. For example, the characteristics of Stage 2, i.e., K-complex and spindle, are often less than 2 seconds in duration. The SWS epoch has > 6-s delta activity and Stage 2 may contain < 6-s delta activity. Therefore, we segmented each epoch into 15 nonoverlapping intervals of two seconds for feature extraction. The non-overlapping interval of two seconds was called a window, i.e., there are a total of 15 2-s windows in a 30-s epoch. The 512-point (256 Hz * 2 sec) FFT calculation was applied to each 2-s. Table I lists the eight features used in this study. The types of feature include power spectrum (PS), power ratio (PR), spectral frequency (SF), duration ratio (DR), and EMG energy. PS is calculated by averaging the power of a specific frequency band in the 15 windows. PR is the power ratio of two frequency bands for comparison. SF is the mean frequency of spectral power. DR is the ratio between the number of windows in which the energy of a specific frequency band is higher than a threshold and the total 15 windows in an epoch. The feature Amp M is the mean value of the absolute amplitude of the total EMG data points in an epoch (for fuzzy rules) or a 2-s window (for movement epoch detection). TABLE I The features for automatic sleep scoring. No. Source Type Feature Label 1 EEG PS Total power of 0-30 Hz 0-30 E 2

EEG

3 4

Link to R&K NC

PR

0-4 Hz/0-30 Hz

0-4 E

SWS

EEG

SF

Mean frequency of 0-30 Hz

Mean(fre.) E

NC

EEG

DR

Alpha ratio

Alpha E

Wake

5

EEG

DR

Spindle ratio

Spindle E

S2

6

EEG

DR

SWS ratio

SWS E

SWS

7

EMG

PS

Total power of 0-30 Hz

0-30 M

NC

8

EMG EMG energy

Mean amplitude

Amp M

REM

* PS(=Power spectrum), PR(=Power ratio), SF(=Spectral frequency), DR(=Duration ratio), NC(=No correspondence).

The relations between the features and the R&K rules are also given in Table I. The feature “Alpha E” corresponds to the duration of the epoch consists of alpha (8-13 Hz) activity for

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

3 stage Wake. The “Spindle E” corresponds to the duration of spindle activity for stage S2. The “0-4 E” and “SWS E” correspond to the magnitude and duration of delta activity for stage SWS. The “Amp M” corresponds to magnitude of EMG for stage REM. More details of these features can be found in reference [8]. Most of the features we used had physiological meaning corresponding to the characteristics of sleep signals described in the R&K criteria [3]. For example, the feature “Alpha E” is a quantized value of the alpha ratio for each epoch. According to R&K criteria, the signature of the Wake state is that >50% of the epoch consists of alpha (8-13 Hz) activity or low-voltage. Therefore, the feature “Alpha E” can be used to determine Wake stages. Similarly, the feature “SWS E” is a quantized value of the SWS ratio for each epoch. According to R&K criteria, the identifying feature of SWS is that >20% of the epoch consists of high-voltage (>75 μV) and low-frequency (75 μV) and low-frequency (20 Hz) with higher muscle tone, is observed during the Wake stages. So, the logic of Rule 1 is that, when alpha EEG is high (Alpha E is high) and EMG is high (0-30 M is high) and low frequency (delta, 0-4 Hz) EEG is low (0-4 E is low), then the rule concludes that the stage is Wake. Similarly, for Rule 2, 50% of the page (epoch) consists of related low voltage mixed (3-7 Hz) activity during S1 stage according to R&K rules. Thus, the logic of Rule 2 is that, when alpha EEG is high (Alpha E is high) and EMG is high (0-30 M is high) and low frequency EEG is middle or high (0-4 E is mid OR high) and Spindle EEG is middle or low (Spindle E is low OR Spindle E is mid); then the rule concludes that the stage is S1.

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

4

Fig. 1. The flow chart of the proposed automatic sleep staging system.

Fig. 2. Histogram of each feather in Table I for the Wake, S1, S2, SWS, and REM stages. The X-axis represents the normalized feature values and the Y-axis represents the number of epochs.

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

5 Moreover, the main characteristic of the proposed fuzzy inference system is that multiple features are concurrently included and considered to determine a specific sleep stage. For example, we primarily use the feature Spindle E to detect spindles for classifying the sleep stage 2 as the R&K rule. However, other features such as Alpha E, 0-30 M, 0-4 E, SWS E, and Amp M, etc. are also used in the fuzzy rules no. 4-7 to distinguish S2 from S1, SWS, and REM. These features were observed from the feature distribution in each stage as shown in Fig. 2. Similarly, fuzzy rule 9 is used to detect the REM stage. In addition to the primary feature Amp M used to distinguish REM from awake or light sleep, four other features, Alpha E, 0-30 M, Spindle E, and Mean(fre.) E, are also used in rule no. 9. TABLE II The description of the nine rules for automatic sleep scoring. Target Rule sleep Rule description No. stage IF Alpha E is high AND 0-30 M is high AND 0-4 E is 1 Wake low THEN out is WAKE. IF Alpha E is high AND 0-30 M is high AND (0-4 E is 2 S1 mid OR 0-4 E is high) AND (Spindle E is low OR Spindle E is mid) THEN out is S1. IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) 3 S1 E is high AND Amp M is high AND (Spindle E is low OR Spindle E is mid) THEN out is S1. IF Alpha E is high AND 0-30 M is high AND (0-4 E is 4 S2 mid OR 0-4 E is high) AND Spindle E is high THEN out is S2. IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) 5 S2 E is low AND SWS E is low THEN out is S2. IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) E is high AND Amp M is low AND (Spindle E is high 6 S2 OR (Spindle E is mid AND 0-30 E is high)) THEN out is S2. IF (Alpha E is low OR 0-30 M is low ) AND Mean(fre.) 7 S2 E is high AND Amp M is high AND Spindle E is high THEN out is S2. 8

9

SWS

IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) E is low AND SWS E is high THEN out is SWS.

REM

IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) E is high AND Amp M is low AND (Spindle E is low OR (Spindle E is mid AND 0-30 E is low)) THEN out is REM

After calculating the activation values of all fuzzy rules, the Takagi-Sugeno fuzzy system was applied in our method [28] to make the decision. For each epoch, the inputs of the fuzzy inference system are the eight feature values and the output of the fuzzy inference system is the determined stage that has the maximum activation value among all of the nine rules. For example, the epoch is classified as Wake if the fuzzy Rule 1 has the maximum activation value. 2.2) Genetic algorithm After constructing the fuzzy rules and the fuzzy sets based on human knowledge and the distributions of feature values, the genetic algorithm (GA) was utilized to fine-tune the membership functions of the fuzzy sets to improve the recognition rate of the developed fuzzy inference system. The

genetic algorithm is often applied to solve multi-parameter problems and it has also been utilized to construct optimal membership functions of fuzzy sets [29, 30]. Fig. 3 shows how to apply the genetic algorithm [30] in our system for membership function optimization. Chromosome is used to build fuzzy sets for the fuzzy inference system. First, the GA would randomly generate values (range from 0 to 1) in the initial population and assess fitness, the overall accuracy of the training data for each chromosome. After generating the initial population or producing a new population, the chromosomes were sorted according to the fitness. The 50 best chromosomes were selected for the next new population. The next new chromosomes were produced by crossover and mutation from the best 50 chromosomes. The structure of chromosome X was composed of a real number with a length of 40, the population size is 100, and the number of generations is 200. In our proposed method, two-point crossover was used and the crossover rate was 0.95. The twopoint crossover scheme chose two crossover segments with the same length randomly and exchanged the segments of the strings with each other. The mutation rate was set as 0.01. After crossover, if a random number was lower than the mutation rate, we might randomly produce a number (range from 0 to 1) and the index of chromosome. Then, we replaced the value of the indexed chromosome with the random number. After initialization, the crossover and the mutation process, the genetic algorithm produced best parameter sets for the fuzzy inference system. The resultant fuzzy sets and parameters of the chromosome after training are shown in Figure 4 and Table III, respectively. Note that, the x-axis and yaxis in Fig. 4 represent the values of features (from 0 to 1) and the fuzzy variables, respectively. Initialization of parameters

Generation of the initial population Fitness evaluate

Population size Chromosome size Crossover rate Mutation rate

Fuzzy inference system

Generation end Yes Finish

No Producing a new population

Fig. 3. The flow chart of the genetic algorithm for optimization of our system.

3) Contextual rule smoothing Sleep staging has periodicity and continuity from light to deep [3]. General classifiers may not consider temporal contextual information, but the expert may refer to the neighbor epochs in addition to the current epoch to make decisions. Therefore, after classifying the sleep stage using the GA-fuzzy inference system, a smoothing process, considering the temporal contextual information, was applied for continuity [3].

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

6 (a).0-30E (EEG)

(b).0-30M (EMG)

1.5

L

1.5

H

1

X32

L

H

1

X29

0.5

0

X8

X5

X31 0

X30

0.2

0.4

0.6

0.8

0

1

X7 0

X6 0.2

0.4

0.6

0.8

1

L

X4

X26 0.4

0.6

0.8

0

1

X11 0.4

X15 0.6

X14 0.8

M

X17

1

0.5

0.2

0.2

L

H

X1

1

X28

X27

X10 0

1

(f).Spindle E (EEG)

L

0.5

0

0

1.5

H

X25

H

X12 X13 X16

(e).Alpha E (EEG) 1.5

1

M

X9

0.5

(d).Mean(fre.) E (EEG)

H

X24

X20 X21

0.5

X3 0

0.2

(g).SWS E (EEG)

0.4

X2 0.6

0.8

1

0

X23 0

X18 0.2

X19 0.4

0.6

0.8

X22 1

(h).Amp M (EMG)

1.5

1.5

L 1

H X40

L 1

X37

0.5

0

L 1

0.5

1.5

0

(c).0-4E (EEG)

1.5

H X33 X36

0.5

X39 0

0.2

X38 0.4

0.6

0.8

0

1

X35 0

0.2

X34 0.4

0.6

0.8

1

Fig. 4. The final membership functions of the fuzzy sets determined by the genetic algorithm. The symbols “L”, “M”, and “H” represent the sets of low, mid, and high, respectively. The x-axis represents the value of feature (from 0 to 1) and the y- axis represents the possibility. TABLE III The best parameters of the chromosome after training by GA (total 40 genes). x1

0.001495

x11

0.454787

x21

0.969481

x31

0.059114

x2

0.629536

x12

0.674123

x22

0.984466

x32

0.112125

x3

0.542222

x13

0.707175

x23

0.014130

x33

0.028291

TABLE IV Lists of the smoothing rules. Rule No.

2

Modification Any REM epochs before the very first appearance of S2 are replaced with S1 epochs Wake, REM, S2 → Wake, S1, S2

3

S1, REM, S2 → S1, S1, S2

4

S2, S1, S2 → S2, S2, S2

5

S2, SWS, S2 → S2, S2, S2

6

S2, REM, S2 → S2, S2, S2

7

SWS, S2, SWS → SWS, SWS, SWS

8

REM, Wake, REM → REM, REM, REM

1

x4 x5 x6

0.959105 0.301614 0.841639

x14 x15 x16

0.84582 0.639485 0.777276

x24 x25 x26

0.71633 0.111911 0.897671

x34 x35 x36

0.618549 0.205512 0.237342

x7

0.027497

x17

0.027772

x27

0.033814

x37

0.358867

x8

0.086337

x18

0.196844

x28

0.360363

x38

0.889248

x9

0.052431

x19

0.92819

x29

0.76516

x39

0.047670

x10

0.35008

x20

0.960021

x30

0.919828

x40

0.180578

These rules refer to the relationship between epochs prior to and after the current epoch. For example, three consecutive epochs of S2, REM, and S2 were replaced with the sequence S2, S2, and S2. Similarly, consecutive epochs of REM, S1, and REM were replaced with the sequence REM, REM, and REM. Table IV shows the rules for smoothing in the present method.

9

REM, S1, REM → REM, REM, REM

10

REM, S2, REM → REM, REM, REM

11

Mov, REM, S2 → Mov, S1, S2

4) Movement epochs elimination After smoothing, an elimination procedure was used on those MT epochs with the AASM scoring methods [4]. The final result of staging (hypnogram) was still characterized by five stages (Wake, S1, S2, SWS, and REM). D. Performance evaluation The performance of the proposed method was evaluated by a confusion matrix, is the typical evaluation method for multiclassification problems. The overall agreement (accuracy, AC), sensitivity (SE), and specificity (SP) of each sleep stage were also calculated. They are defined as:

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

7 AC 

TP  TN TP  TN  FP  FN

(1)

SE 

TP TP  FN

(2)

SP 

TN TN  FP

(3)

where TP: total number of correctly positive classification. TN: total number of correctly negative classification. FP: total number of erroneously positive classification. FN: total number of erroneously negative classification. In addition, Cohen’s kappa coefficient [31] was calculated for each subject to assess the robustness of our system. Cohen’s kappa coefficient (κ) is a statistical measure of inter-rater agreement among two or more raters. It is generally thought to be a more robust measure than simple percent agreement calculations because κ takes into account agreements that occur by chance. The interpretation of kappa coefficients [31] is as follows: values less than 0.00 indicate no agreement; 0.00 to 0.20 indicate slight agreement; 0.21 to 0.40 indicate fair agreement; 0.41 to 0.60 indicate moderate agreement; 0.61 to 0.80 indicate substantial agreement; and more than 0.80 indicates excellent agreement. III. EXPERIMENTAL RESULTS Several experiments on multiple datasets were performed to evaluate the proposed method. (1) Performance comparison between the proposed method and the rule-based method [8] on dataset PDB-1; (2) Subject-by-subject performance evaluation of our method on PDB-1; (3) Performance of our proposed method on PDB-2; (4) Performance of an integrated genetic fuzzy inference system on data from subjects with good and poor sleep efficiency as well as subjects with insomnia from PDB-1 and PDB-2; (5) Applying the proposed method on a publicly available sleep database [32] and making comparison with two existing methods [33, 34] using the same database. A. Performance of our method on PDB-1 All-night PSG sleep recordings obtained from 32 subjects were used. In PDB-1, half of the subjects’ sleep efficiencies were equal to or higher than 85% and the other half were lower than 85%. In order to effectively construct and evaluate our method, the PSG data were sorted according to the sleep efficiency. From the sorted list of sleep efficiency, the data from subjects of odd number (16 subjects) were used to train the system, and the data from the other 16 subjects were used for testing. Tables V (a) and (b) show the confusion matrices of the five-stage epoch classification by using the rule-based method [8] and our method, respectively. The identical smoothing process was applied to these two methods. The rows and column are the results staged by the expert and the automatic sleep scoring method, respectively. The unidentified signals and the movement epochs are not taken into account here. The overall agreement, sensitivity, specificity, and kappa coefficient of each sleep stage corresponding to the rule-based method [8] and our method are shown in Table V. As shown in

Table V (a), the overall agreement between the expert and the rule-based system was 85.85%. It was higher than the range of inter-score agreement [20]. The kappa value is 0.79, indicating substantial agreement. Most of the misclassifications occur in stage transitions. The sleep process is continuous and epochs during stage transitions are not typical, so these epochs are more likely to be erroneously classified due to hard thresholding for the rule-based system. Therefore, we propose to develop the genetic fuzzy logic system to enhance the scoring performance in this paper. TABLE V Confusion matrix between two computer scoring methods and the visual scorings on 16 subjects from PDB-1. (A) Method in [8]

Expert

Computer Wake

S1

S2

SWS

REM

Total

SE(%)

Wake

1189

78

98

0

14

1379

86.22

S1

72

232

65

1

95

465

49.89

S2

115

235

5734

165

219

6468

88.65

SWS

4

6

337

1861

13

2221

83.79

REM

49

165

123

2

2249

2588

86.90

Overall

13121 85.85

Specificity 0.98

0.96

0.88

Expert

0.97

0.79

kappa (B) The proposed method

0.99

Computer Wake

S1

S2

SWS

REM

Total

SE(%)

Wake

1258

21

92

0

8

1379

91.23

S1

85

159

100

1

120

465

34.19

S2

110

80

5856

243

179

6468

90.54

SWS

4

4

227

1970

16

2221

88.70

REM

73

54

162

5

2294

2588

88.64

0.99

0.91

0.98

0.97

Overall Specificity 0.98 kappa

13121 87.93 0.82

Table V (b) shows the performance of the proposed genetic fuzzy inference system. The overall agreement between the expert and the proposed system was 87.93% and the sensitivities for all stages, except for S1, were higher than 88%. In addition, the sensitivities for Wake and S2 were higher than 90%. The specificity for all stages is higher than 91%. Compared to Table V (a), the results on Wake, S2, SWS, REM, and overall were increased by 5.01%, 1.89%, 4.91%, 1.74% and 2.08%, respectively. It was also observed that kappa of our system shows excellent agreement (0.82). In addition, statistical analysis of performance difference between the proposed method and the method in [8] for subjectby-subject five-stage sleep scoring was performed by the paired t-test. The sensitivities (mean ± sd (%)) with respect the five sleep stages and the overall agreement for the proposed method and the method in [8] are Wake: 93.92 ± 5.48 and 87.37 ± 14.23, S1: 33.75 ± 19.39 and 44.83 ± 15.83, S2: 90.69 ± 4.37 and 90.48 ± 5.63, SWS: 88.1 ± 10.3 and 74.49 ± 25.4, REM: 87.16 ± 8.73 and 87.68 ± 11.44, overall agreement: 87.76 ± 3.65 and 85.43 ±

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

8 6.56, respectively. The accuracies of the proposed method on Wake (p=0.027), SWS (p=0.004), and overall agreement (p=0.024) are statistically superior to the method in [8]. The method in [8] has better performance on S1 (p=0.005) only. The results demonstrated that the performance of the automatic sleep scoring method might be further improved by using GA and fuzzy logic techniques. B. Subject-by-subject performance evaluation on PDB-1 Table VI shows the subject-by-subject sleep efficiencies, overall agreements, and kappa coefficients of the manual scoring and our automatic scoring on the first private database. The differences between these two scorings are also given. The average kappa value of the 16 testing subjects was κ = 0.81±0.04, and the individual kappa ranged from 0.7 to 0.87. The individual kappa values were at least substantial agreement (>0.7), and the kappa on 12 subjects were excellent (> 0.8).

(S.D. = 3.91%). The average kappa value of the proposed method is 0.80 (S.D. = 0.06). This analysis demonstrates that the proposed method is robust and effective for scoring the sleep stages for subjects with both good and poor sleep efficiencies. The accuracies are both in the inter-scorer agreement range for clinical applications (>82.6%) [20] with excellent agreement (kappa>0.8). Fig. 5 shows the hypnograms of subject no. 7 (from good sleep efficiency group) and no. 11 (from poor sleep efficiency group), including the manual scoring by expert and the automatic staging. The automatic scoring hypnograms are close to the hypnograms scored by the expert.

TABLE VI Subject-by-subject sleep efficiencies, overall agreements, and kappa coefficients of the automatic scorings compared with the manual scorings. (A) The subjects’ sleep efficiency ≧ 85% Subject manual auto diff agreement kappa number 1

97.26%

95.47%

1.79%

91.27%

0.85

2

95.71%

94.07%

1.65%

87.07%

0.80

3

94.09%

90.42%

3.67%

89.73%

0.83

4

93.70%

93.80%

-0.11%

89.80%

0.83

5

93.51%

92.28%

1.24%

88.11%

0.82

6

90.78%

90.88%

-0.10%

90.34%

0.83

7

90.16%

90.98%

-0.83%

90.26%

0.83

8

87.28%

87.70%

-0.42%

89.47%

0.85

avg.

92.81%

91.95%

0.86%

89.51%

0.83

3.02%

2.32%

1.41%

1.33%

0.02

std.

(B) The subjects’ sleep efficiency < 85% Subject number

Manual

auto

diff

agreement

kappa

9

84.95%

82.77%

2.18%

89.81%

0.85

10

84.66%

86.04%

-1.38%

86.55%

0.80

11

79.56%

82.40%

-2.84%

90.04%

0.85

12

78.15%

80.54%

-2.39%

78.66%

0.70

13

78.02%

79.57%

-1.55%

83.24%

0.78

14

72.29%

72.29%

0.00%

85.76%

0.79

15

64.02%

66.19%

-2.16%

91.77%

0.87

16

53.13%

64.37%

-11.24%

82.32%

0.75

avg.

74.35%

76.77%

-2.42%

86.02%

0.80

std.

10.92%

8.12%

3.91%

4.46%

0.06

In addition, Tables VI (a) and (b) present the results of the good and poor sleep efficiency group, respectively. The average difference of sleep efficiency scored by the automatic sleep staging and the manual scoring in good sleep efficiency group is 0.86% (S.D. = 1.14%). The average kappa value of the proposed method is 0.83 (S.D. = 0.02). The average difference of sleep efficiency scored by the automatic sleep staging and the manual scoring in the poor sleep efficiency group is 2.42%

Fig. 5. The hypnogram of subject no. 7 and no. 11: (a) the expert scored hypnogram of subject no. 7, (b) the automatic staging hypnogram of subject no. 7, and (c) the expert scored hypnogram of subject no. 11, (d) the automatic staging hypnogram of subject no. 11.

C. Performance of our method on PDB-2 All-night PSG sleep recordings obtained from 16 subjects with insomnia were used to confirm the robustness and clinical applicability of the proposed method. The averaged sleep efficiency of PDB-2 was 71.07% (< 85%). The PSG data were also sorted according to the sleep efficiency and the data from subjects of odd number (8 subjects) were used to train the system, the data from the other 8 subjects were used for testing. Table VII shows the performance of the proposed genetic fuzzy inference system on PDB-2. The overall agreement between the expert and the proposed system was 81.77% and the sensitivities for all stages, except for S1, were higher than 83%. In addition, the sensitivities for S2 and SWS were higher than 85%. The specificity for all stages is higher than 92%. It was also observed that kappa of our system shows moderate agreement (0.75). The result demonstrated that the agreement between the expert and our proposed method is still higher than 80% even on the subjects with insomnia.

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

9 TABLE VII Confusion matrix between the computer scorings and the visual scorings on 8 subjects with insomnia from PDB-2. Computer Wake

S1

S2

SWS

REM

Wake 1577 S1 87 S2 46 Expert SWS 4 REM 32 Overall Specificity 0.96 kappa

125 122 93 0 26

75 61 2113 105 61

12 4 101 782 13

110 84 128 7 671

0.96

0.92

0.98 0.75

0.94

Total SE(%) 1899 358 2481 898 803 6439

83.04 34.08 85.17 87.08 83.56 81.77

D. Performance of the integrated system on PDB-1 and PDB2 The experiments A and C demonstrate that knowledge of the experts in sleep scoring and the elasticity of fuzzy systems in reasoning can be integrated to develop the automatic sleep staging systems for the healthy subjects and the subjects with insomnia. Finally, an integrated sleep scoring system that can be applied to the data from subjects with (1) good and (2) poor sleep efficiency as well as (3) subjects with insomnia was designed. Because the sleep patterns of these subject groups may be different [24], the two fuzzy inference systems developed based on database PDB-1 (denoted as GA-fuzzy model-1) and database PDB-2 (denoted as GA-fuzzy model-2) were integrated as shown in Fig. 6.

the results of GA-fuzzy model-1 are adopted directly. The data from 24 subjects, 8 subjects with good sleep efficiency (PDB-1), 8 subjects with poor sleep efficiency (PDB1) and 8 subjects with insomnia (PDB-2) that did not used to train the GA-fuzzy model-1 and GA-fuzzy model-2 were utilized to test this system. The experimental results are shown in Table VIII. The overall agreement between the expert and the proposed system was 86.44% and the sensitivities for all stages, except for S1, were higher than 86%. The specificity for all stages is higher than 92%. It was also observed that kappa of this integrated system shows excellent agreement (0.81). Appling the method in [8] to these data, the sensitivities for Wake, S1, S2, SWS, REM, and overall were 79.19%, 34.51%, 83.22%, 84.42%, 84.49%, and 80.9%, respectively. The kappa coefficient is 0.73. These results show that the overall agreement of our integrated system (86%) is still in the interscorer agreement range for clinical applications (>82.6%) [20] with excellent agreement (kappa>0.8) for automatic scoring of data from multiple subject groups. However, the agreement of the rule-based method [8] will be lower than the inter-scorer agreement range if the data from the subjects with insomnia are included. TABLE VIII Confusion matrix between the computer scorings and the visual scorings on 24 subjects from PDB-1 and PDB-2. Computer

Expert

Wake

S1

S2

SWS

REM

Wake

2845

146

157

12

118

3278

86.79

S1

170

290

160

5

198

823

35.24

S2

124

173

8018

344

290

8949

89.60

SWS

8

4

320

2764

23

3119

88.62

REM

105

80

198

18

2990

3391

88.17

Overall Specificity 0.97 kappa

Fig. 6. The flow chart of the integrated automatic sleep staging system.

When a PSG data comes, the extracted features are classified by GA-fuzzy model-1 firstly. Then, the sleep efficiency of the subject is estimated according to the results of GA-fuzzy model-1. If the estimated sleep efficiency is lower than 75%, it means the sleep quality of the subject is worse and the subject has a high possibility of suffering from insomnia. The GA-fuzzy model-2 is utilized to classify the sleep stages. Otherwise, the subject is considered as a healthy individual and

Total SE(%)

19560 86.44 0.98

0.92

0.98

0.96

0.81

E. Performance of the method on public dataset Nevertheless, to further assess and demonstrate the abilities of our method, we also applied it to a publicly available sleep database (International Database PhysioNet Sleep Recordings, http://www.physionet.org.) [32] that provides sleep recordings and the corresponding hypnograms in European Data Format. The EEG recording sites are Pz-Oz and the sampling rate is 100 Hz for EEG and EOG in the both data sets. The sampling rates for EMG are 1 Hz and 100 Hz in the sc* and st* recordings, respectively. Because the sampling rate for EMG in sc* recordings is too low compared with the recommendation of our method, the st* recordings (4 subjects) were utilized in our experiment. It was noted that the sampling rate and the recording sites of EEG in the st* recordings (100 Hz) are still different from the recordings for the development of our method (256 Hz, C3-A2), and therefore we went through a training phase again to fine-tune the system parameters. We also sorted the subject's list according to the sleep efficiency. From the sorted list, the subjects of odd number were used to train the system, and the subjects of even number were used for testing.

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

10 Table IX shows the confusion matrices of the five-stage epoch classification by the proposed sleep staging method. The testing dataset for performance estimation was comprised of 30-second epochs. As shown in Table IX, the overall accuracy was 81.34% for the proposed system. The sensitivity of S2, SWS and REM was over 85%. The specificity for all stages is higher than 93%. The overall kappa value is 0.72. The methods proposed in [33-37] have been applied to the same database. The method in [33] combined spectral features and the hidden Markov models (HMM) for sleep staging. The spectral and temporal features were integrated with an adaptive fuzzy logic iterative system in [34]. The kappa coefficients of the methods in [33] and [34] were 0.61, and 0.52, respectively. It means that the proposed automatic sleep staging system is in substantial agreement and the results of our method are superior to the existing methods in references [34] and [33]. Because the methods in [35] and [36] combined stage S1 and REM as the same stage and the method in [37] removed some epochs in which the quality of PSG signals did not satisfy the requirement, the results of these methods were not included to avoid unfair comparison. TABLE IX Confusion matrix between our proposed method and the visual scoring for the public dataset in [32]. Computer Wake S1 S2 SWS REM Total SE(%) Wake 45 16 5 13 216 63.43 137 S1 53 3 10 144 39.58 57 21 S2 22 22 633 41 20 738 85.77 Expert SWS 3 0 23 0 358 92.74 332 REM 7 9 19 18 420 87.38 367 Overall 1876 81.34 Specificity 0.95 0.96 0.93 0.96 0.97 kappa 0.72

IV. DISCUSSION The three main ideas of our method are as follows. First, the decision rules that realize and implement human knowledge in sleep scoring were utilized. In the observation of the PSG signal for each stage, we found that there are many types of signal features present in one stage. The scoring rules of our method were designed based on expert knowledge and data distributions. Therefore, the developed multi-rule-based staging method that considers various types of situations according to the features of frequency and time domains is better than that which only refers to the R&K rules or relies on numerical classifies such as neural networks [12, 15] or SVM [38]. Second, we utilized the fuzzy inference system to transform the hard thresholds into soft thresholds, and the GA was also applied to fine-tune the membership function for optimization. The fuzzy system has the advantages of modeling human or approximate reasoning and dealing with uncertainty within the data. These characteristics are especially suitable for the uncertainty of sleep scoring due to the fuzziness in manual scoring caused by confusing PSG data in sleep-stage transitions or atypical epochs [20]. Therefore, the GA fuzzy inference system makes the automatic scoring method perform closer to the manual scoring by an expert.

Finally, most of the previous studies only focused on either healthy subjects or patients being treated for sleep disorders. In other words, their testing subjects may not include both groups of good and poor sleep efficiency. The sleep patterns of good sleep efficiency and poor sleep efficiency may be different [24]. To evaluate the applicability of an automatic sleep scoring method, it is essential that the data include both groups of subjects. Our experimental results show that the overall agreement of our method applied to both subject groups is higher than 86%. Our proposed method also maintains a good accuracy (>80%) for subjects with insomnia. The results demonstrate the robustness and reliability of the proposed method. In previous studies, several sleep staging methods, including rule-based, back-propagation neural network (BPNN), the hidden Markov models (HMM), support vector machine (SVM) and fuzzy classifier, have been applied to automated sleep staging [8, 12, 17, 21, 39]. Fraiwan et al. [24] reviewed the recent sleep staging works, and the overall agreement of these studies were reported to be in the range of 70% to 87.5%. Although group data from all subjects together may achieve higher classification accuracy [25], most of the automatic sleep scoring methods performed subjectindependent evaluation to simulate practical conditions. In the field of biomedical and clinical applications, expert knowledge plays an important role because clinical diagnoses are mainly based on the judgment of clinical staffs and the experts. The results of automatic interpretation by the machine or computer are also evaluated by the experts. In this study, the fuzzy rules are designed based on the knowledge of experts so the results of our method are superior to the existing numerical methods [33, 34] applied to the same public dataset [32]. V. CONCLUSION This paper studies the feasibility of integrating knowledge of the experts in scoring of PSG data and the elasticity of fuzzy systems in reasoning and decision making to develop an automatic sleep staging system. The overall agreement and kappa coefficient of the proposed genetic fuzzy inference system applied to all night PSGs on 16 subjects having good and poor sleep efficiencies (PDB-1) were 87.93% and 0.82, respectively. The overall agreement and kappa coefficient of our system on 8 subjects with insomnia (PDB-2) were 81.77% and 0.75, respectively. A sleep scoring system integrating two fuzzy inference models with robust performance on various subject groups is also developed. The overall agreement and kappa coefficient of this integrated system applied to PSG data from 8 subjects with good sleep efficiency, 8 subjects with poor sleep efficiency and 8 subjects with insomnia were 86.44% and 0.81, respectively. Because home-based PSG is associated with a better sleep efficiency [40], the proposed method can be combined with a portable PSG system [41] for sleep monitoring in clinical or homecare application in the future. REFERENCES [1]

M. Ciolek, M. Niedzwiecki, S. Sieklicki, J. Drozdowski, and J. Siebert, "Automated Detection of Sleep Apnea and Hypopnea Events Based on Robust Airflow Envelope Tracking in the Presence of Breathing

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering

11

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

Artifacts," IEEE J. Biomed. Health Inform., DOI: 10.1109/JBHI.2014.2325997, 2014. G. Sannino, I. D. Falco, and G. D. Pietro, "An automatic rules extraction approach to support OSA events detection in a mHealth system," IEEE J. Biomed. Health Inform., DOI: 10.1109/JBHI.2014.2311325, 2014. A manual of standardized terminology, techniques and scoring system for sleep stages of human subjects. Allan Rechtschaffen and Anthony Kales, editors. Bethesda, Md: U. S. National Institute of Neurological Diseases and Blindness, Neurological Information Network, 1968. D. Moser, P. Anderer, G. Gruber, S. Parapatics, E. Loretz, M. Boeck, et al., "Sleep classification according to AASM and Rechtschaffen & Kales: effects on sleep scoring parameters," Sleep, vol. 32, p. 139, 2009. H. Danker ‐ Hopfe, D. Kunz, G. Gruber, G. Klösch, J. Lorenzo, S. Himanen, et al., "Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders," J. Sleep Res., vol. 13, pp. 63-69, 2004. H. Danker‐hopfe, P. Anderer, J. Zeitlhofer, M. Boeck, H. Dorn, G. Gruber, et al., "Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard," J. Sleep Res., vol. 18, pp. 74-84, 2009. P. Anderer, G. Gruber, S. Parapatics, M. Woertz, T. Miazhynskaia, G. Klösch, et al., "An E-health solution for automatic sleep classification according to Rechtschaffen and Kales: validation study of the Somnolyzer 24× 7 utilizing the Siesta database," Neuropsychobiology, vol. 51, pp. 115-133, 2005. S.-F. Liang, C.-E. Kuo, Y.-H. Hu, and Y.-S. Cheng, "A rule-based automatic sleep staging method," J. Neurosci. Methods, vol. 205, pp. 169-176, 2012. H. Griessenberger, D. Heib, A. Kunz, K. Hoedlmoser, and M. Schabus, "Assessment of a wireless headband for automatic sleep scoring," Sleep Breath., pp. 1-6, 2013. C. Stepnowsky, D. Levendowski, D. Popovic, I. Ayappa, and D. M. Rapoport, "Scoring accuracy of automated sleep staging from a bipolar electroocular recording compared to manual scoring by multiple raters," Sleep Med., vol. 14, pp. 1199-1207, 2013. J. Virkkala, J. Hasan, A. Värri, S.-L. Himanen, and K. Müller, "Automatic sleep stage classification using two-channel electrooculography," J. Neurosci. Methods, vol. 166, pp. 109-115, 2007. N. Schaltenbrand, R. Lengelle, M. Toussaint, R. Luthringer, G. Carelli, A. Jacqmin, et al., "Sleep stage scoring using the neural network model: comparison between visual and automatic analysis in normal subjects and patients," Sleep, vol. 19, p. 26, 1996. E. Tagliazucchi, F. von Wegner, A. Morzelewski, S. Borisov, K. Jahnke, and H. Laufs, "Automatic sleep staging using fMRI functional connectivity data," Neuroimage, 2012. S.-T. Pan, C.-E. Kuo, J.-H. Zeng, and S.-F. Liang, "A transitionconstrained discrete hidden Markov model for automatic sleep staging," Biomed. Eng. Online, vol. 11, pp. 1-19, 2012. M. Ronzhina, O. Janoušek, J. Kolářová, M. Nováková, P. Honzík, and I. Provazník, "Sleep scoring using artificial neural networks," Sleep Med. Rev., vol. 16, pp. 251-263, 2012. S.-F. Liang, C.-E. Kuo, Y.-H. Hu, Y.-H. Pan, and Y.-H. Wang, "Automatic Stage Scoring of Single-Channel Sleep EEG by Using Multiscale Entropy and Autoregressive Models," IEEE Trans. Instrum. Meas., vol. 61, pp. 1649-1657, 2012. G. Zhu, Y. Li, and P. Wen, "Analysis and Classification of Sleep Stages Based on Difference Visibility Graphs from a Single Channel EEG Signal," IEEE J. Biomed. Health Inform., DOI: 10.1109/JBHI.2014.2303991, 2014. H. Park, K. Park, and D.-U. Jeong, "Hybrid neural-network and rulebased expert system for automatic sleep stage scoring," in Proc. 22nd Annu. EMBS Int. Conf., 2000, pp. 1316-1319. F. Chapotot and G. Becq, "Automated sleep–wake staging combining robust feature extraction, artificial neural network classification, and flexible decision rules," Int. J. Adapt. Control Signal Process., vol. 24, pp. 409-423, 2010.

[20] R. S. Rosenberg and S. Van Hout, "The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring," J. Clin. Sleep Med., vol. 9, pp. 81-87, 2013. [21] S.-F. Liang, Y.-H. Chen, C.-E. Kuo, J.-Y. Chen, and S.-C. Hsu, "A fuzzy inference system for sleep staging," in Fuzzy Systems (FUZZ), 2011 IEEE Int. Conf., 2011, pp. 2104-2107. [22] E. Zhou and A. Khotanzad, "Fuzzy classifier design using genetic algorithms," Pattern Recognit., vol. 40, pp. 3401-3414, 2007. [23] H. Ishibuchi, K. Nozaki, N. Yamamoto, and H. Tanaka, "Selecting fuzzy if-then rules for classification problems using genetic algorithms," IEEE Trans. Fuzzy Syst., vol. 3, pp. 260-270, 1995. [24] L. Fraiwan, K. Lweesy, N. Khasawneh, M. Fraiwan, H. Wenz, and H. Dickhaus, "Classification of sleep stages using multi-wavelet time frequency entropy and LDA," Methods Inf. Med., vol. 49, p. 230, 2010. [25] J. F. Gao, Y. Yang, P. Lin, P. Wang, and C. X. Zheng, "Automatic removal of eye-movement and blink artifacts from EEG signals," Brain Topogr., vol. 23, pp. 105-114, 2010. [26] L. A. Zadeh, "Fuzzy sets," Inf.Control, vol. 8, pp. 338-353, 1965. [27] A. D. Kulkarni, Computer vision and fuzzy-neural systems: Prentice Hall PTR, 2001. [28] T. Takagi and M. Sugeno, "Fuzzy identification of systems and its applications to modeling and control," IEEE Trans. Syst. Man Cybern., pp. 116-132, 1985. [29] J. H. Holland, "Genetic algorithms and the optimal allocation of trials," SIAM J. Comput., vol. 2, pp. 88-105, 1973. [30] D. E. Goldberg and J. H. Holland, "Genetic algorithms and machine learning," Mach. Learn., vol. 3, pp. 95-99, 1988. [31] J. Cohen, "A coefficient of agreement for nominal scales," Educ. Psychol. Meas., vol. 20, pp. 37-46, 1960. [32] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, et al., "Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals," Circulation, vol. 101, pp. e215-e220, 2000. [33] L. Doroshenkov, V. Konyshev, and S. Selishchev, "Classification of human sleep stages based on EEG processing using hidden Markov models," Biomed Eng, vol. 41, pp. 25-28, 2007. [34] C. Berthomier, X. Drouot, M. Herman-Stoïca, P. Berthomier, J. Prado, D. Bokar-Thire, et al., "Automatic analysis of single-channel sleep EEG: validation in healthy individuals," Sleep, vol. 30, p. 1587, 2007. [35] F. Ebrahimi, M. Mikaeili, E. Estrada, and H. Nazeran, “Automatic sleep stage classification based on EEG signals by using neural networks and wavelet packet coefficients,” in Proc. 30th Ann. Int. IEEE EMBS Conf., 2008, pp. 1151–1154. [36] Y. Liu, L. Yan, B. Zeng, and W. Wang, "Automatic sleep stage scoring using Hilbert-Huang transform with BP neural network," in:Proc. of Int. Conf. on the BioinformaticsandBiomedicalEngineering (iCBBE), 2010, pp. 1-4. [37] Y.-L. Hsu, Y.-T. Yang, J.-S. Wang, and C.-Y. Hsu, "Automatic sleep stage recurrent neural classifier using energy features of EEG signals," Neurocomputing, vol. 104, pp. 105-114, 2013. [38] J. Hedner, D. P. White, A. Malhotra, S. Herscovici, S. D. Pittman, D. Zou, et al., "Sleep staging based on autonomic signals: a multi-center validation study," J. Clin. Sleep Med., vol. 7, p. 301, 2011. [39] A. Flexer, G. Gruber, and G. Dorffner, "A reliable probabilistic sleep stager based on a single EEG signal," Artif. Intell. Med., vol. 33, pp. 199207, 2005. [40] M. Bruyneel, C. Sanida, G. Art, W. Libert, L. Cuvelier, M. Paesmans, et al., "Sleep efficiency during sleep studies: results of a prospective study comparing home‐based and in‐hospital polysomnography," J. Sleep Res., vol. 20, pp. 201-206, 2011. [41] D.-W. Chang, Y.-D. Liu, C.-P. Young, J.-J. Chen, Y.-H. Chen, C.-Y. Chen, et al., "Design and Implementation of a modularized polysomnography system," IEEE Trans. Instrum. Meas., vol. 61, pp. 1933-1944, 2012.

0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.