application of deep learning methods in brain

0 downloads 0 Views 6MB Size Report
Jan 12, 2018 - 4.10 Classification accuracies for CW-CNN using a 50-model en- ...... signal or raw STFT are valuable in terms of the interpretability of the.
APPLICATION OF DEEP LEARNING METHODS IN BRAIN-COMPUTER INTERFACE SYSTEMS

SIAVASH SAKHAVI

NATIONAL UNIVERSITY OF SINGAPORE 2017

APPLICATION OF DEEP LEARNING METHODS IN BRAIN-COMPUTER INTERFACE SYSTEMS

Siavash Sakhavi (M.Sc., Sharif University of Technology B.Sc., Amirkabir University of Technology)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE

2018 Supervisors: Assistant Professor Feng Jiashi, Main Supervisor Professor Guan Cuntai, Co-Supervisor Associate Professor Yan Shuicheng, Co-Supervisor

Examiners: Associate Professor Ong Sim Heng, NUS Professor Li Haizhou, NUS Professor Damien Coyle, University of Ulster

DECLARATION

I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in this thesis. This thesis has also not been submitted for any degree in any university previously.

Siavash Sakhavi 12th January 2018

Acknowledgments

I would like to thank everyone who have contributed in some way or another to the work described in this thesis. First and foremost, I would like to thank my academic advisors, Professor Cuntai Guan, Associate Professor Shuicheng Yan, and Assistant Professor Jiashi Feng, for accepting me into their groups. During my time attached to Prof. Guan’s group at the BCI Lab at A*STAR, he contributed to my graduate school experience by giving me intellectual freedom in my work, engaging me in new ideas, challenging my thinking, and demanding high quality work in all my endeavors. During my short time at Dr. Yan’s and Dr. Feng’s Learning and Vision lab at NUS, the change in environment and topic challenged and pushed me to develop my skills and increase my domain knowledge. Every result described in this thesis was accomplished with the help and support of fellow labmates and collaborators. From the BCI Lab, I would like to thank my labmates Dr. Atieh Bamdadian, Dr. Xinyang Li, Mr. Vikram Shenoy, Ms. Rooyi Feng, and Ms. Parastoo Fahimi for their inputs and discussions regarding various topics during our meetings or within the lab space. I would also like to thank many of the NBT department’s staff that collaborated with me in more ways than one and made my experience at A∗ STAR a pleasant one: Dr. Kai Keng Ang, Dr. Huijuan Yang, Mr. Aung Aung Phyo Wei, Dr. Haihong Zhang, and many others. I would like to especially thank Dr. Ang for the codes he provided to catalyze the progress of my work and his advice and discussions. In high

regards, I also thank Dr. Yang for giving me the opportunity to collaborate with her. From the Learning and Vision lab, I would like to thank the staff and students that heavily contributed to my work and provided a friendly environment for research and development: Dr. Yunchao Wei, Dr. Zequn Jie, Dr. Tam Nguyen, Ms. Quanhong Fu, Mr. Xiaojie Jin, and the other labmates. I would especially like to thank Zequn for letting me collaborate with him for his paper and also, Yunchao because of his patience in answering my questions and his advice during my times of difficulty. I would like to acknowledge the Department of Electrical and Computer Engineering (ECE) at the National University of Singapore and also, the Department of Neural & Biomedical Technology (NBT) at the Institute of Infocomm Research (I2 R) at the Agency of Science, Technology, and Research (A∗ STAR). ECE provided many interesting and insightful courses which further enhanced my knowledge. A∗ STAR provided a high-end working station and a very calm working environment that helped me focus on my work. I am grateful for the funding sources that allowed me to pursue my graduate school studies: the SINGA scholarship and residential assistance provided by A∗ STAR and the Lee Foundation for their generous financial assistance for my tuition fee. Without these financial sources, I would not have been able to survive. In the path to accomplishing my thesis, I have had many ups and downs. Sometimes, more downs than ups. The moral support of my father and mother from afar, and my uncles’ families in Singapore have helped tremendously and I would like to thank them, from the bottom of my heart. My dear wife, Nazlyna, and her amazing and kind family joined me closer towards the end of my PhD journey, but they have given me love and comfort as if they had always been there from the beginning. Friends and

v

religious communities have constantly given me emotional and spiritual support throughout the journey and made it more tolerable. I would like to thank all of them: Sajjad M., Sajjad S., Hossein N., Hossein T., Sadegh, Behrooz, Amin, Reza K., Baqir, Qatsier, as well as everyone else, regarding whom if I were to write, it would be an entire thesis by itself. And finally, I would like to thank the one personality whom I have not met, yet I know, who has never failed to help and support me: Thank You.

vi

Contents

Summary

x

List of Tables

xii

List of Figures

xiv

1 A Review of Brain-Computer Interfaces (BCI)

1

1.1

The Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Acquisition and Processing of Brain Data . . . . . . . . . . .

2

1.2.1

EEG Signal Processing . . . . . . . . . . . . . . . . .

5

1.3

A Brief Introduction to BCI . . . . . . . . . . . . . . . . . .

8

1.4

Motor Imagery BCI . . . . . . . . . . . . . . . . . . . . . . . 11

1.5

1.4.1

Classification of Motor Imagery Data . . . . . . . . . 12

1.4.2

Transferring Knowledge in BCI . . . . . . . . . . . . 15

Challenges & Motivations . . . . . . . . . . . . . . . . . . . 17

2 Machine Learning and Deep Learning for BCI 2.1

A Brief Introduction to Machine Learning . . . . . . . . . . 22 2.1.1

2.2

20

Machine Learning for BCI . . . . . . . . . . . . . . . 24

From Neural Networks to Deep Learning . . . . . . . . . . . 25 2.2.1

Modern Deep Learning . . . . . . . . . . . . . . . . . 30

2.3

Deep Learning for BCI . . . . . . . . . . . . . . . . . . . . . 33

2.4

Motivation & Thesis Organization . . . . . . . . . . . . . . . 36

vii

3 EEG Representation 3.1

38

Preprocessing Methods Used for EEG Data

. . . . . . . . . 39

3.1.1

Temporal Filtering . . . . . . . . . . . . . . . . . . . 39

3.1.2

Spatial Filtering . . . . . . . . . . . . . . . . . . . . . 40

3.2

BCI Competition IV-2a Dataset . . . . . . . . . . . . . . . . 44

3.3

A Brief Review of EEG Representations . . . . . . . . . . . 45

3.4

3.3.1

Energy & Relative Energy . . . . . . . . . . . . . . . 45

3.3.2

Covariance Matrix . . . . . . . . . . . . . . . . . . . 46

3.3.3

Other Representations . . . . . . . . . . . . . . . . . 47

3.3.4

EEG Representations Used in Deep Learning . . . . . 48

Novel Temporal Features . . . . . . . . . . . . . . . . . . . . 50 3.4.1

Filter-Bank Common Spatial Patterns (FBCSP) . . . 52

3.4.2

Temporal Representation from FBCSP . . . . . . . . 53

3.5

A Quantitative Analysis of the Proposed Representation . . 55

3.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Learning Temporal Information for BCI using Convolutional Neural Networks (CNN) 4.1

4.2

57

First Design: A Parallel Convolutional-Linear Neural Network 58 4.1.1

A Mathematical Review of CNNs . . . . . . . . . . . 58

4.1.2

Architecture Design Details . . . . . . . . . . . . . . 59

4.1.3

Classification Results . . . . . . . . . . . . . . . . . . 62

4.1.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . 65

Second Design: Channel-Wise Convolutions with a Channel Mixing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3

4.2.1

Parameter Selection via Cross-Validation . . . . . . . 71

4.2.2

Classification Results . . . . . . . . . . . . . . . . . . 74

Analysis of the Trained Network . . . . . . . . . . . . . . . . 84 4.3.1

Importance of the Learned Kernel Morphology . . . . 84

4.3.2

Qualitative Analysis of Kernel Shapes . . . . . . . . . 86 viii

4.4

Discussion & Conclusion . . . . . . . . . . . . . . . . . . . . 91

4.5

Guideline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5 Deep Transfer Learning for Subject-to-Subject Transfer 5.1

93

Introduction to Transfer Learning . . . . . . . . . . . . . . . 93 5.1.1

Transfer Learning in BCI Systems . . . . . . . . . . . 96

5.2

Motivation & Proposed Method . . . . . . . . . . . . . . . . 97

5.3

Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4

Network Architecture for Deep Transfer Learning . . . . . . 100

5.5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.6

5.5.1

Subject Pool to Small Sample Transfer . . . . . . . . 103

5.5.2

Subject Pool to Large Sample Transfer . . . . . . . . 105

5.5.3

Single Subject Pool to Subject Transfer . . . . . . . . 105

Conclusion & Guideline

. . . . . . . . . . . . . . . . . . . . 107

6 Contributions & Future Work

109

6.1

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.2

Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.3

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3.1

Neural Network Integration of the Preprocessing . . . 111

6.3.2

Beyond Handcrafted Representations . . . . . . . . . 112

6.3.3

Regularization . . . . . . . . . . . . . . . . . . . . . . 113

6.3.4

New Learning Methods and Architectures . . . . . . 113

Bibliography

116

ix

Summary Deep learning, as a branch of machine learning, has produced many successful methods and architectures. Some of which are currently considered state-of-the-art in the areas of image classification, language processing, and text analysis. However, applications in new areas such as braincomputer interfaces and EEG classification have been limited. This thesis focuses on developing deep learning methods and optimizing deep architectures and applying them to the area of EEG classification, more specifically, motor imagery-based brain-computer interfaces. First, we propose a classification framework for classifying motor imagery data by introducing a new representation of the EEG data from extending the formulation of the well-recognized FBCSP method to include temporal information and utilizing a convolutional neural network architecture systematically for motor imagery BCI. The convolutional neural network has been designed in a way to learn temporal information from the input signals. Our framework outperforms the best classification method in the literature on a four-class motor-imagery dataset by a significant seven percent increase in average classification accuracy. We have also analyzed and visualized the network regarding the learned parameters for a more in-depth understanding. Finally, we extend the application of deep learning to transfer learning in brain-computer interfaces by training a convolutional auto-encoder on multiple subjects. The subject-independent model is successfully transferred to unseen subject data with a different number of training samples. The

x

classification accuracy results produced in this thesis are stunningly higher relative to simple machine learning algorithms, even with a low number of training samples. The transferability of the network shows a potential for pre-trained networks to be used in EEG research and commercial BCI devices.

xi

List of Tables 3.1

Cross-validation accuracy results using SVM for different representations . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1

Classification results for parallel convolutional-linear network with comparison to baseline . . . . . . . . . . . . . . . 63

4.3

Average confusion matrix for BCI competition IV-2a data over all subjects . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4

A sample of the CW-CNN architecture . . . . . . . . . . . . 69

4.5

Parameters used for cross-validation of CNN architectures and their values to be selected . . . . . . . . . . . . . . . . . 73

4.6

Cross-validation results for feature representation given the CW-CNN architecture in Table 4.4 . . . . . . . . . . . . . . 76

4.7

Cross-validation results for different classifiers given the R1 feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.8

Cross-validation accuracy and corresponding test accuracy results for the convolution kernel in the C2CM architecture . 78

4.9

Cross-validation accuracy and corresponding test accuracy results for convolutional hidden nodes in the C2CM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.10 Classification accuracies for CW-CNN using a 50-model ensemble with the selected parameters . . . . . . . . . . . . . . 81 4.11 Classification accuracies for C2CM using a 50-model ensemble with the selected parameters . . . . . . . . . . . . . . . . 81 xii

4.12 Table of accuracy and kappa for baseline methods and our methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.13 Analysis of the effect of modifying the kernels of the first and second layer of the trained network . . . . . . . . . . . . 85 5.1

The parameters of the architecture used for Transfer Learning in this chapter

5.2

. . . . . . . . . . . . . . . . . . . . . . . 101

Average classification accuracy results for all subjects using 3 classifiers and 4 different sample sizes . . . . . . . . . . . . 104

5.3

Classification results for transferring multi-subject transfer with full use of the new subject’s data . . . . . . . . . . . . 106

5.4

Accuracy results on the new subject’s train data without re-training the model trained on the single subject pool . . . 107

5.5

Accuracy results on the new subject’s test data with retraining the last layer of the model trained on the single subject pool . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

xiii

List of Figures 1.1

The cortex and its components . . . . . . . . . . . . . . . .

1.2

The spatial and temporal resolution of some brain recording

3

methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Sample of EEG Signals . . . . . . . . . . . . . . . . . . . . .

6

1.4

Electrode positions on the scalp for 21 electrodes . . . . . .

6

1.5

Decomposition of the EEG signal into different frequency bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.6

General depiction of a Brain-Computer Interface . . . . . . .

9

1.7

A typical SSVEP system . . . . . . . . . . . . . . . . . . . . 10

1.8

A sample of the P300 signal and the BCI speller . . . . . . . 11

1.9

Sample ERD/ERS for a four-class motor imagery task . . . 13

1.10 State-of-the-art algorithms for MI classification . . . . . . . 15 2.1

The McCullochPitts (MCP) neuron model . . . . . . . . . . 25

2.2

The perceptron model . . . . . . . . . . . . . . . . . . . . . 26

2.3

The LeNet-5 Model . . . . . . . . . . . . . . . . . . . . . . . 27

2.4

The Restricted Boltzmann Machine . . . . . . . . . . . . . . 28

2.5

The AlexNet Network . . . . . . . . . . . . . . . . . . . . . 29

2.6

The different types of neural networks

3.1

Sample of differences between broadband and pass-band fil-

. . . . . . . . . . . . 31

tered signal . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2

Time scheme in the BCI Competition IV-2a data set . . . . 45

xiv

3.3

A representation of EEG based on the STFT . . . . . . . . . 48

3.4

Representation of EEG as a Series of Images . . . . . . . . . 49

3.5

The novel representation proposed in this thesis . . . . . . . 51

4.1

Our designed network for EEG classification . . . . . . . . . 60

4.2

The three types of convolution possible to be implemented on any feature map . . . . . . . . . . . . . . . . . . . . . . . 68

4.3

Visualization of Sample Architecture CW-CNN and C2CM . 70

4.4

Sample signal visualization for network perceived input for significantly correlated channels from subject 9 and class 1 . 88

4.5

Sample signal visualization for network perceived input for significantly correlated channels from subject 9 and class 2 . 88

4.6

Sample signal visualization for network perceived input for significantly correlated channels from subject 9 and class 3 . 89

4.7

Sample signal visualization for network perceived input for significantly correlated channels from subject 9 and class 4 . 89

4.8

Correlation between perceived signal and average signal in class 1 for subject 9 . . . . . . . . . . . . . . . . . . . . . . . 90

4.9

Sample visualization for average class signals and network perceived input for all channels sorted based on p-value for subject 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.1

The different categories of transfer learning . . . . . . . . . . 95

5.2

The transfer learning pipeline proposed in this thesis . . . . 98

5.3

Different transfer learning scenarios . . . . . . . . . . . . . . 99

5.4

Results of the average classification accuracy for different classifiers and sample sizes . . . . . . . . . . . . . . . . . . . 103

xv

List of Symbols

xvi

List of Abbreviations MI

Motor Imagery

EEG

Electroencephalogram

BCI

Brain-Computer Interfaces

CSP

Common Spatial Patterns

ERD

Event-Related Desynchronization

ERS

Event-Related Synchronization

FBCSP

Filter-Bank Common Spatial Patterns

SSVEP

Steady-State Visual Evoked Potential

ML

Machine Learning

DL

Deep Learning

CNN

Convolutional Neural Network

MLP

Multi-Layer Perceptron

SVM

Support Vector Machine

RNN

Recurrent Neural Network

LSTM

Long Short-term Memory

FCN

Fully Connected Network

xvii

RBM

Restricted Boltzmann Machine

CV

Computer Vision

NLP

Natural Language Processing

C2CM

Channel-Wise Convolution with Channel Mixing

CW-CNN

Channel-Wise Convolutional Neural Network

xviii

Chapter 1 A Review of Brain-Computer Interfaces (BCI) The primary objective of this chapter is to introduce the reader to the world of brain-computer interfaces (BCI). Because of the nature of this thesis, being a combination of BCI and deep learning algorithms, we understand that some of the readers might not be familiar with both of the fields. Therefore, in this chapter, we will give a very brief introduction to BCI systems, and in the next chapter, we will provide an introduction to machine learning algorithms and deep learning methods.

1.1

The Brain

The brain is truly one of the fascinating organs of the body. When we are born, this bundle of connected cells has an initial structure. It starts learning from the environment, changes its structure to adapt, and develops itself on the way. Looking at the development of an infant’s brain when he is born until he is 5 years old, by itself is amazing: a child’s brain starts by processing audio and then proceeds on to learning how to speak and understand multiple languages [1, 2]; the visual system develops, starting from recognizing faces, to sensing colour and light, to perceiving depth, 1

to understanding social interactions, and to object permanence (tracking objects even when they disappear from view) [2]; motor skills are developed from movement of the trunk and upper body, to gross motor skills (movement of the arms and legs), and to fine motor skills (e.g. wrist and finger movement). These amazing feats of human development, by itself, makes the brain a unique organ. Even until now, we do not have a grasp of how the brain functions in detail and as a whole. Although neuroscientists and neurophysiologists carry out experiments to pinpoint a particular function of the brain or analyze the structure and connectivity between different parts of the brain, it is hard to come up with a model to put all of the research next to each other. This difficulty is due to the different conditions in which the experiments are carried out or some of the more complex interactions going in the brain. Nevertheless, this complexity does not and has not resulted in a standstill of research about the brain and its functions. We must know how the brain functions and how it is structured to understand what the difference between a normal and abnormal brain is. To do so, we have to choose the right tools to view the brain.

1.2

Acquisition and Processing of Brain Data

To understand what information can be recorded from the brain, at first, we have to understand its anatomical and physiological structures. Concisely, the brain is a bundle of interconnected cells. These cells, in contrary to popular belief, are not only the electrically active neurons but include other cells such as the Glia cells (80 percent of the brain cells), astrocytes and oligodendrocytes. These cells, like other organs of the body, need blood to function correctly. Depending on the location of the activity in the brain, the blood flow will also change in that area.

2

The electrical activity in the brain is contributed from the firing of the interconnected neurons of the brain. The brain’s neural tissue can be divided into two main types: the gray matter and the white matter. The gray matter tissue, which is the outermost layer of the brain, is the main component of the cortex. The cortex is where most processing of the sensory input is carried out. The cortex can furthermore be divided into different lobes: temporal, frontal, occipital, and parietal. Each of these lobes has been associated with various functions of the human body. For example, the occipital lobe is well known to be the visual processing center of the brain [3], whereas the frontal lobe is in charge of attention [4]. Some of the lobes can be further divided to pinpoint functions of the brain. For example, the somatosensory cortex (part of the parietal lobe) is in charge of processing the sense of touch, and sense of position and movement. A visualization of different parts of the brain can be seen in Figure 1.1.

Fig. 1.1. The Cortex and its components (Wikipedia). When it comes to the nature of the data collected from the brain, there are two major parameters: the spatial and temporal resolution of the data. Spatial resolution refers to the spatial proximity of data recording points while temporal resolution refers to sampling frequency. A view of different data recording modalities and their spatial and temporal resolution can be 3

seen in Figure 1.2.

Fig. 1.2. The spatial and temporal resolution of some brain recording methods [5]. Here, we will give examples of two different data recording modalities. The first modality is functional Magnetic Resonance Imaging (fMRI). This data recording technique captures “snapshots” from the structure of the brain in certain time intervals with a focus on the change of blood flow in different areas. MRI, as an imaging technique, manages to capture structural information from the brain based on the difference of tissue properties but at the cost of being very time-consuming. But when it comes to capturing activity, time consumption is not an option. Therefore, in fMRI, the image quality is not as high as MRI but is still able to be used for identifying the change of blood flow in the brain. This modality is usually used in research where a relatively good spatial resolution for a slow task is needed. The second modality is electroencephalography (EEG). The EEG is a multivariate time signal recorded from electrodes placed on the scalp. These electrodes capture the electrical activity of the brain which has propagated from the cortex to the scalp. This modality can be very high in temporal 4

resolution based on the sampling rate used for data collection but has a limitation of spatial resolution. Low spatial resolution of EEG is mainly due to the summation of all activity from all sources inside the brain. The reason we have brought these two examples is to understand how recording modalities focus either on physiological or physical or both phenomenon taking place in the brain. Newer imaging modalities and techniques have given insights into the brain’s information unlike before. For example, Diffusion Tensor Imaging (DTI) has equipped researchers with a tool to explore the brain’s connectivity. Also, increase in fMRI resolution and speed have assisted in capturing more detailed information regarding the spatial and temporal functions of the brain. Data storage and management have changed rapidly in the past 10 years leading to the storage of more data. Machine learning algorithms and tools have been developed and are used to assist healthcare experts with visualization and diagnosis tools. Other than the factor of spatial and temporal resolution, the invasiveness and cost are also two important factors that must be taken into consideration when selecting a modality for brain research or tools. Amongst the different modalities, the EEG signal is one of the conventional modalities. The hardware needed to record EEG is much more cost-effective, and because it can have high sampling frequency, it can be used for scenarios where we have real time tasks. From this point, we will be mostly discussing applications of EEG signals in research, medical and commercial applications.

1.2.1

EEG Signal Processing

The electroencephalogram (EEG) is a multi-channel time series which captures the electrical activity recorded from the scalp from specific locations (electrodes). A sample of EEG in multiple electrodes can be seen 5

in Figure 1.3 and electrode position based on the 10-20 standard for 21 electrodes is shown in Figure 1.4. EEG signals are recorded during resting state, during the performance of a task by the subject, or when an external stimulus is presented to the subject.

Fig. 1.3. Sample of a multi-channel EEG recording (Physionet[6]).

Fig. 1.4. Electrode positions on the scalp for 21 electrodes based on the 10-20 system. In addition to the spatial location of electrodes and the number of time samples of the EEG signal, another important characteristic of the EEG signal is the frequency content. Figure 1.5 shows several of the relevant 6

Fig. 1.5. Decomposition of the EEG signal into different frequency bands [8]. Each frequency band has been proven to correlate with multile functions of the brain. frequency bands used in EEG processing. The reason the frequency bands are important is that during specific tasks, it is not the amplitude of the whole signal which changes but rather, the amplitude of a particular frequency band. For example, when we close our eyes, the amplitude of the alpha component of the EEG increases [7]. The act of increase or decrease in the value of a particular frequency, in signal processing, is denoted as amplitude modulation. Amplitude modulation has two elements: the carrier, c(t), and the envelope, e(t). The carrier is a sinusoidal signal at a particular frequency, c(t) = sin(2πf t + φ), and the envelope is the information which is reflected in the carrier’s amplitude and usually has a much lower frequency from the carrier. Amplitude modulation is carried out by multiplying the envelope in the carrier, e(t)c(t). Demodulating is the act of extracting the envelope from modulated signal and can be carried out in various ways. We will be using demodulation or envelope extraction in this thesis as a way of extracting meaningful features from the EEG signal. In the next section we describe the use of EEG signals in brain-computer interfaces (BCI).

7

1.3

A Brief Introduction to BCI

Brain-computer interfaces (BCI) have had many definitions throughout the years. In a more recent definition, a BCI system is defined as “a combination of hardware and software that allows brain activities to control external devices or even computers” [9]. A sample of the structure of the BCI system with its different elements can be seen in Figure 1.6. First, the task which the subject is participating in must be designed accordingly. Although it might seem trivial, the task design is one of the most rigorous processes in developing a good BCI system. The task must be in a way that can isolate the intended outcome from the subject. At the same time, careful instructions must be given to the subject to guide his physical and mental conditions, and ask him to perform the task in a certain way. Second, the recording modalities, or, in other words, the type of data must be specified. The next stage is the cleaning and pre-processing of data. For example, in fMRI, sequences of images of the brain may be not aligned because of movement, and therefore, motion correction algorithms must be carried out to align the data. In EEG, artifacts caused by blinking, eye movement, electrocardiogram (ECG), electrical noise, ground noise, muscle movement, etc., all should be removed or corrected before using the data. In addition to artifact removal, spectral and spatial filtering are also carried out to pinpoint information content of the signal and remove correlation from the electrode channels (this will be discussed in detail in Chapter 3). Eventually, after preparing the data, a learning algorithm is used to learn from the data and produce an output which can be used in tasks or used as feedback to the subject. In the context of EEG, signals can be divided into two types: exogenous and endogenous. An exogenous signal is a signal recorded from the brain when a subject is presented with external stimuli. However, an endogenous signal is referred to a signal where the source or cause of 8

Fig. 1.6. A sample of how a BCI systems works [10]. This figure shows a closed loop system in which a task is presented to a user. Data is recorded from the userduring the task and after pre-processing and feature extraction, a machine learning algorithm is used for classification/ regression. The output of the machine learning algorithm can then be used for control and/or used for changing the task presented to the user. generation is not external stimuli but rather the intention of the subject. Regardless of exogenous or endogenous signals, a BCI system tries to interpret this recording, understand the intention or the state of the subject, and then take action based on what it has understood. An example of exogenous signals produced by external stimuli is SteadyState Evoked Potential (SSEP). When a subject is stimulated with an external stimulus in a particular frequency, a peak can be seen in the frequency content of the EEG signals in that frequency and its harmonics. For example, if a blinking light is presented to a person and the person looks at the light continuously, a peak can be seen in the frequency content of the EEG signal in the occipital electrode channels in the given frequency. This frequency response in the visual domain is called Steady-State Visual

9

Evoked Potential (SSVEP). Figure 1.7 graphically shows how SSVEPs can be seen in the EEG frequency response. SSVEPs have been used for BCIspellers [11] and also the field of vision analysis in glaucoma patients [12]. Other types of SSEPs can also be produced such as auditory (SSAEP) and somatosensory (SSSEP).

. Fig. 1.7. A typical SSVEP system. A stimuli is shown to a subject and the frequency of stimulation and its harmonics will peak in the EEG frequency response [13]. In this image, the stimulus is at 7 Hz There is an endogenous signal that can be used for the BCI-speller which is called the P300 signal. The P300 is a positive peak in the EEG signal that can be seen 300 milliseconds after a stimulus is presented. Instead of showing the subject a stimulus with a specific frequency, a random stream of stimuli is presented to the subject. In the mind of the subject, some of these stimuli are relevant, and some of them are irrelevant. For example, in a BCI-speller, all rows and columns containing letters will randomly light up. However, the P300 response of the brain occurs when the subject’s intended letter’s row or column light up. It is interesting that this is considered as an endogenous signal rather than an exogenous one because it is not sensitive to the type of the stimulus but rather to what the subject is attending. A sample of the signal morphology of P300 and the BCI-speller paradigm can be seen in Figure 1.8. The P300 signal has other applications as well, such as the P300-based lie detection paradigm [14]. For more detailed information regarding different types of exogenous 10

. Fig. 1.8. A sample of the P300 signal (left) and the BCI Speller (right) [15]. Each column and row of the speller is lit randomly several times and based on the focus of the user, the P300 can be seen in the recorded EEG. and endogenous signals and the tasks related to them, we refer the readers to [16, 17]. In the next section, we will describe in details one of the widely used endogenous signals in BCI systems: motor imagery.

1.4

Motor Imagery BCI

Motor Imagery (MI) is defined as “ mental rehearsal of movements without the movements being executed”[18]. An example of MI would be the imagination of hand movement versus the actual execution. When movement is imagined, the activity in the brain is similar to the real movement (i.e. activation of the motor cortex). BCI systems have utilized MI to control the action of devices such as wheelchairs, or objects on a monitor (e.g. cursor or avatar) for patients that have Amyotrophic Lateral Sclerosis (ALS) or have limited physical interaction with the outside world. These patients have benefited from BCI systems by gaining higher quality of life [10]. Also, these systems have been used for improving the rehabilitation process of patients that have been impacted by stroke [19, 20, 21, 22]. In MI-BCI systems, there are two main aspects: what the subject is doing, and how the computer is interpreting the input signals. In a typical MI task, a subject focuses on a screen where a cue for MI is shown to him. The subject is then required to perform motor imagery based on the shown 11

direction. When a subject is told to do MI of the left or right hand, nobody other than the subject knows how he is performing the act of imagination. Although it is possible to control this action using questionnaire [18] or come up with smart ways of conducting the task [23], the pacing of MI is subject-dependent. This fact makes the act of interpreting the recorded data from a subject a challenging problem: not only one trial of the subject may be different from another trial in the same setting, but subjects might not be able to perform the MI task, which is termed BCI illiteracy [24, 25]. Nevertheless, there have been many attempts in understanding the underlying mechanisms of MI for application in BCI systems. Event-related desynchronization/synchronization (ERD/ERS) in the motor cortex are the temporal activity associated with MI which can be seen in the EEG of a subject when performing an MI task. ERD is defined as the relative difference of energy before and after the cue for MI execution [26] and is equivalent to the subject modulating the amplitude of the recorded EEG signal during an MI task. A sample ERD/ERS map can be seen in Figure 1.9 for four different MI tasks. Based on the figure, in left/right MI, a consistent pattern of energy change can be seen whereas, in feet/tongue MI, a dynamic pattern of energy can be seen (see [27] for a detailed analysis). This difference in characteristics must be taken into consideration for feature extraction classifier design.

1.4.1

Classification of Motor Imagery Data

The phenomenon of ERD/ERS observed during MI has inspired computer scientists and BCI researchers to propose classification methods such as the Common Spatial Pattern (CSP) Algorithm [29]. The CSP algorithm finds a set of linear transformations on the electrode (channel) space (i.e. spatial filters) that maximizes the “distance” of two classes of data recorded

12

Fig. 1.9. Sample of ERD for a subject performing MI of right Hand, left hand, feet, and tongue on the channels C3, Cz, and C4. The red colour shows high ERD levels relative to the EEG before the cue [28]. The vertical access is the frequency of the signal and the horizontal access is time starting from the cue. during an MI task. It does so by maximizing the following equation:

w∗ = arg max w

w T Σ c1 w wT (Σc1 + Σc2 )w

(1.1)

Where Σc1 and Σc2 correspond to the channel (electrode) covariance matrix for classes c1 and c2 respectively in a specified time segment and w is the spatial filter. This objective function, also called the Rayleigh Quotient, has an analytical solution which is equivalent to solving a Generalized Eigenvalue Decomposition (GEVD) problem of: Σc1 w = λ(Σc1 + Σc2 )w. The solution of GEVD is usually a number of eigenvectors corresponding to the dimension of the covariance matrix. The eigenvector corresponding to the highest eigenvalue obtained maximizes equation 1.1. Practically, more than one eigenvector is selected and also, eigenvectors of high eigenvalues are paired with eigenvector of low eigenvalues to boost the 13

performance of the algorithm for classification purposes. After estimating the spatial filters, the energy of the spatially filtered channels is computed, divided by the energy of all the channels, and considered as the representation of the data. The dimension-reduced EEG data can be easily fed into a linear classifier such as Naive Bayes or SVM and has shown to have good performance [30]. The caveat of using such representation of EEG is reducing the signal from a time-series into a single value, and this results in losing the temporal information. The Filter-Bank CSP algorithm (FBCSP) [31, 32] is an extension of the CSP algorithm which in addition to optimizing a linear combination of the channels, also looks at the frequency content of the signals which contain task-related and discriminative information. Therefore, by passing the signal through multiple pass-band frequency filters, also known as filterbank, the CSP-based energy features are computed for each of the filterbank outputs. The features are then selected via a mutual informationbased algorithm and classified using SVM. The extra step of performing CSP on each filtered input helps boost the performance and shows the benefits of signal decomposition before spatial filter estimation as an important preprocessing stage for classification. FBCSP is not the only successful attempt to improve the CSP algorithm: Sparse CSP [33] adds a regularization factor on the spatial filter estimation, imposing sparsity on the spatial filter values. Stationary CSP [34], divergence-CSP [35], probabilistic CSP [36] and Bayesian machine learning [37] each try to solve the CSP problem either by changing the objective function (i.e. using KL or Beta Divergence) or by defining a more generalized computational framework around the problem (i.e. variational Bayes). Some algorithms have also tried to optimize frequency and spatial filters simultaneously, such as [38, 39]. Although they have individually improved the CSP algorithm and increased the classification accuracy, they

14

still contain the same caveat of the original CSP method: the negation of temporal dynamics. Although being the dominant method, the CSP algorithm is not the only algorithm used for MI classification. The Riemannian geometry (RG) approach that has gained popularity in recent years, also tackles the problem of MI-EEG classification by classifying the data in the channel covariance space rather than solely using the energy of spatially filtered channels [40, 41, 42, 43]. RG can be viewed as a generalization of the CSP algorithm, but yet again, does not use temporal information. Before the contents of this thesis, the state-of-the-art algorithm reported on the BCI Competition IV-2a data, a benchmark dataset for motor imagery classification, was the algorithm proposed in [43], which utilizes the RG method with subspace optimization to produce their results. A summary of the results of the algorithm and other competing results can be seen in Figure 1.10.

Fig. 1.10. State-of-the-art algorithms for MI classification reported in [43].

1.4.2

Transferring Knowledge in BCI

In many BCI research, data is recorded from a subject and then, used to train a classifier. Due to the duration of experiments, the number of samples recorded can be relatively lower than the dimension of the data. Multiple sessions of data are recorded at different times to compensate for the lack of data. Additionally, to prove the robustness of a method/task, data

15

is also recorded from several subjects. These facts yield two main challenges for BCI systems: Session-to-Session Transferability and Subject-toSubject Transferability. In the session-to-session transferability problem, there are usually two scenarios. In the first, we have data from one session or multiple sessions of a task from one subject, and we are willing to transfer the knowledge gained from all sessions to a newly recorded session. In the second scenario, we have multiple sessions of a certain simple task, and we need to use this data for second, more complicated related task, instead of recording data from this second task. In subject-to-subject transferability, multiple sessions of various subjects have been previously recorded, and knowledge must be transferred from the data of multiple subjects to one another. Knowledge transfer is crucial in the context of BCI due to the limitation of time. Recording one session of a task is time-consuming and also tiring for the subject. These two factors result in a limited number of trials per session being able to be recorded. As an example, the BCI Competition IV-2a dataset, which will be used in this thesis, records 288 samples of data in a 7 s interval. This means every session takes around 30 minutes of continuous imagery which can be very tiring for the subject. Researchers have tackled the problem of transferability by adapting methodologies from a field of machine learning called transfer learning. As an example for subject-to-subject transfer, sensor space transfer [44] has been proposed which focuses on transferring the data space of multiple subjects from the electrodes on the scalp to the sources in the brain by estimating their relationship. Composite CSP [45] considers a weighted sum of covariance matrices estimated from multiple subjects as a way of transferring knowledge to a new subject. Adaptation techniques have also been used to reduce the calibration time [46]. Session-to-session variability

16

problems are usually either tackled by estimating a better spatial filter using a better loss function [45, 35] and regularization [30] or using a small set of the new session to adapt the methodology trained on a previous session [47, 48, 49].

1.5

Challenges & Motivations

As discussed in the previous sections (and also stated in [17]), when designing a BCI system based on EEG, several challenging properties of the EEG signal should be taken into consideration: • The signal-to-noise ratio of EEG signals is low, and in many cases, due to artifacts such as movement or blinking, the data is contaminated. • As stated in 1.4.2, small number of samples are available for BCI research due to time-consuming preparation and intensive tasks presented to the subjects to perform. • EEG is high dimensional due to the recording from multiple electrodes with a certain sampling frequency. This problem of high dimension, accompanied by small sample size, results in a poor generalization and overfitting of machine learning models. For example, considering even 5 channels of EEG being recorded for a period of 2 s with a sampling rate of 100 Hz, 1000 time points will already be generated (raw representation of the EEG signal in this interval), a number which is higher than most sample sizes of EEG datasets, thereby posing a problem when designing a well-performing classifier. • Activity of the brain is a function of time and it varies from subject to subject, session to session and trial to trial. This variability makes it challenging for the elements of the classification methodology to be robust [50, 51]. 17

• In some experiments, such as experiments that invoke the P300 signal, a change in signal can be seen in a particular time interval after the stimulus is presented. However, in experiments where subjects control their activity at their own pace, as in motor imagery, the exact time in which a subject performs a task is unknown. Therefore it is incumbent to design a system that has the ability to accept temporal inputs and be invariant to the time of the subject’s activity. These characteristics of the EEG cause two main elements of the machine learning systems used for BCI systems to be critical: feature extraction and classification algorithm. Feature extraction mainly focuses on finding a suitable representation of the data that is lower in dimension relative to the raw EEG, but at the same time preserves the information that is required to discriminate different classes of data. The classification algorithm then receives this representation and based on the design, learns to infer the label of unseen data from the trained data. It should be noted that these two aspects of the system (feature extraction and classification algorithm) are intertwined. For example, if the classification algorithm can slide on the data contents and find patterns (such as Template Matching), the data representation must be in a way that can provide this option to the classifier. Therefore, when designing the representation of the data, the classification algorithm must be taken into account. The primary motivation of this thesis is to improve the classification accuracy in motor-imagery BCI systems by providing a low-dimensional representation of the EEG data which considers the time information and designs a suitable classification algorithm which is robust to the variability of sessions and also, invariant to the onset of task execution by the user. Considering the high dimensionality of the EEG signal, if the raw representation of the EEG signal is not reduced and directly fed into a classifier, 18

without enough data samples available, the machine learning algorithms will surely overfit. Current motor imagery classification methodology, although performing well, represent the EEG channels as a single value (e.g. most CSP-based methods) and/or neglect the temporal information of the signal when extracting the features (e.g. Riemannian features [40, 43]). The negligence towards the change of energy or in other words, the dynamicity of the temporal characteristics was a motivation for this thesis to go towards adapting an algorithm that takes into account this aspect of EEG signals. With this in mind, using a more complex representation of the signal also motivates us to use more complex classifiers and thus, giving us the reason to approach “Deep Learning” methods and more precisely, “Convolutional Neural Networks”, which will be described in the next chapter (Chapter 2). As an extension of the above solution, to tackle the challenge of small samples, we propose transfer learning frameworks based on deep learning methodology. The design of complex classifiers with the ability to capture more information from the input can also be extended to learn from multiple subjects simultaneously. The acquired knowledge can then be transferred to a new subject and result in reducing the time and data needed to train the entire classifier on the new subject from scratch. This will be discussed in depth in Chapter 5.

19

Chapter 2 Machine Learning and Deep Learning for BCI When children are born, the amount of knowledge and information that they know about the world is close to zero. They have not seen anything, smelt anything, listened to anything (except the sounds close to the womb), or uttered a single word. But the amazing thing about children is that they develop and learn. They learn to recognize their parents’ faces, learn to respond to their names when being called and identify the speaker, learn to speak in multiple languages, and discover many more skills in the first few years of their life. By observing this astonishing accomplishment of the human brain, there is a question that has continually been asked by scientists in various fields of study: How does the brain learn? and consecutively, What does it learn? Mimicking the abilities of the brain has applications in areas where there is a need for such skills, but there is a lack of human resources. For example, in computer vision, the principal focus is teaching a computer how to identify objects, people, textures, etc. in situations where not enough manpower is available to do so. In natural language processing (NLP), a computer has to understand speech content from millions of users, identify

20

the context of a text containing billions of words, extract sentiment and emotion from billions of content, and if needed, reply to a question posed by multiple users or generate news and information by just viewing a few samples. In both of these applications, an input is given to a computer, and after analyzing, an output is produced which can then be used for the application in hand. The vision and audio sensory inputs of the brain, the eyes, and the ears, and also, the auditory and visual processing locations in the brain are different from each other, and both send information to the cognitive parts of the brain for understanding. This difference in input and processing are a crucial clue in designing systems for computer vision and NLP: understanding the data is as important as processing it. If a system does not consider the underlying properties of the input, even if the processing is high-end, the output is useless, in other words: the eyes cannot “hear” an object, and the ears cannot “see” a sound. In the body’s visual system, an object first passes through various structures of the eyes (cornea, lens, vitreous humour) and is then mapped on the retina. The retina, having receptors sensitive to light and color, generates neural activations which are sent to the brain via the optic nerve. These neural activities pass through different locations in the visual pathway and are decomposed based on various aspects of the object that have been seen: color, aspect ratio, boundaries, shades, etc. Each level of the visual pathway is different from the one before and the information is derived hierarchically. After information has been extracted via the optical pathway, the information is passed to the other parts of the brain for understanding and cognition. The auditory system has its processing pathway which is entirely different from the visual pathway. If the underlying characteristics of an image are not extracted via the visual processing system, the brain cannot comprehend the image.

21

By observing how the brain processes information, several approaches can be taken to mimic its ability. One approach is to understand what kind of information the brain extracts and then, try to extract similar information from an input. In the terminology of machine learning and computer science, this approach can be explained as extracting hand-crafted features from an input and then using these features for various tasks such as classification. This approach to solving the problem, for many years, was the go-to method for computer vision and NLP. Another way of mimicking the brain is to design a system which is comprised of the same or similar elements and architectures in the brain and let the system learn by itself via a learning algorithm. This approach has become the leading method in computer vision, NLP, and other machine learning research, and has produced state-of-the-art results which have led to exciting applications. In the following sections, we will delve deeper into machine learning and its now-popular sub-field, Deep Learning.

2.1

A Brief Introduction to Machine Learning

Machine learning has been defined as “a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty” [52]. In other words, methods that can learn from data and after the process of learning, detect information regarding the data can be categorized as machine learning algorithms. Depending on how data is fed to an algorithm, machine learning can be categorized into two main sub-categories of supervised and unsupervised learning. In supervised learning, the dataset, D, has a structure of D = {(xi , yi }N i=1 , where x is the input, y is information regarding the input 22

which is usually a label (or multiple labels) and N is the number of samples in the dataset. The algorithm learns a model or decision function, F, that can map between the input and the label, y = F(x). This function is later used to predict the labels of unknown inputs or give a probability of it belonging to a certain label, p(y = c|x), which is called class posterior in machine learning nomenclature. Decision making in classification settings is usually carried out by using the maximum a posteriori (MAP) estimate: yˆ = arg maxc (p(y = c|x)). Supervised algorithms, based on the way they estimate the class posterior, can also be categorized into discriminative and generative models. Generative models, focus on modelling the distribution of the data given the label of the data, p(x|y = c), called the class-conditional density and by using the class prior, p(y), predict the class posterior using Bayes’ Rule: p(y = c|x) =

p(y = c)p(x|y = c) Σc (y = c)p(x|y = c)

(2.1)

In contrast, discriminative models do not estimate the class-conditional density and directly find the class posterior. An example of widely used generative models are the Naive Bayes Classifier and Linear Discriminant Analysis (LDA). As for examples of discriminative models, one can list Support Vector Machines (SVM) and Neural Networks in this category. Based on [52], both generative and discriminative approaches to supervised learning have their pros and cons. For example, generative models should be used in cases where a parametric model exists that could be fitted to the data. However, discriminative models are data-driven and optimize to fit themselves to the data without any prior probability distribution, thus rendering them to be more time-consuming but more accurate relative to their generative counterparts. Unsupervised algorithms, on the other hand, do not have the luxury of having labels or additional information about the data, D = {xi }N i=1 . 23

Difficulty or costs in acquiring information regarding the data is the main reason. Therefore, unsupervised models use the data as they are and fit a model to estimate the distribution of the input, p(x). Usually, models have parameters which are either optimized or manually tuned based on a “goodness of fit” or based on constraints. For example, Principal Component Analysis (PCA) finds sub-spaces of the data in which the variance is maximized or minimized, or K-Means Clustering defines a series of points (centroids) in the data sample space and groups the data based on the proximity of each sample to the centroids. Other groups of learning categories defined in machine learning can be viewed as combinations of different scenarios of supervised and unsupervised learning. For example, Semi-Supervised Learning is a situation in which unlabeled data is used to aid in a supervised learning case or Active Learning [53] is a scenario where unlabeled data is decided to whether be sent to an oracle to be labeled or labeled based on a pre-existing trained model on similar labeled data. Multi-instance learning [54] defines a problem where one sample of the data is comprised of multiple data sub-samples or “instances” and supervised or unsupervised algorithms must infer the sample based on these underlying instances.

2.1.1

Machine Learning for BCI

Since the introduction of BCI systems, machine learning has played a crucial role in classification. Supervised algorithms such as SVM (with different kernels) [45, 33, 55, 56, 43], LDA [34, 40, 57, 35, 44, 43], sparse Bayesian [58], and logistic regression [59] have been used for classifying MI-EEG signals, with SVM and LDA being the most popular algorithms because of their simplicity and fast training [17]. These classifiers are often trained on the energy of signals after spatial filtering or covariance-based Riemannian features. More information regarding the type of feature used in BCI is 24

presented in Chapter 3.

2.2

From Neural Networks to Deep Learning

Neural Networks, as the name implies, are algorithms initially inspired by neurons and the way neurons send information. At neural connections, known as the synapse, the dendrite from multiple cells connects to the axon of one cell. For the axon to be activated and send an action potential (neural signal), the amount of cumulative activity and inhibition from dendrites must weigh towards activation and the level of activation must be above a certain threshold for the axon to fire. This simple model of neuron activation inspired McCulloch and Pitts [60] to propose the MCP neural model (Figure 2.1). In the model, if the weighted sum of the elements of the input (Σi wi xi ) is above a threshold level, the output will be one, and if not, the output will be zero. In the original MCP model, an inhibitory input is also considered as a constraint for the activation to take place or not.

Fig. 2.1. The McCullochPitts (MCP) neuron model.

After the MCP neural model, Rosenblatt [61] proposed the perceptron model (Figure 2.2). This model was similar to the MCP model but was mostly inspired by the visual system [62] and first introduced a nonlinear

25

activation function in the neural model. Instead of a hard threshold and zero-one output, the output was defined as y = f (Σi wi xi +b), where b is the threshold term, and f is defined as f (t) =

1 , 1+e−t

which is denoted by the

sigmoid function. For a given dataset with binary labels, the perceptron model can be used to derive a classification plane in high-dimensional space. In other words, the perceptron was only useful in cases that the data was linearly separable [63]. Interestingly, after the perceptron was analyzed in [63], the authors discouraged researchers from pursuing these models [64].

Fig. 2.2. The perceptron model. After the perceptron model, the multilayer perceptron (MLP) with one hidden layer was proposed and was shown to be a universal approximator for Boolean and continuous outputs [65] and a two-layer perceptron could approximate any function [62]. Alongside the backpropagation (BP) algorithm proposed by Hecht-Nielson [66], which was a fast and easy method for optimizing neural networks, it seemed some of the major issues of the perceptron algorithm was solved. The upgrade from a single perceptron layer to multiple perceptron layers in a distributed computing manner had a meaning: adding layers could increase the capacity of the network in learning more complicated functions. This increase of ability to express functions led to the idea of increasing the number of layers even more, which, in turn, became a computational bottleneck for multilayer perceptrons due to limited computational power and the inability to converge to

26

a global minimum. Inspired by the visual system, the neocognitron model [67, 68] and later, the convolutional neural network (CNN) [69, 70, 71] introduced the convolution operation in neural networks (Figure 2.3). The convolution operation shares weights among different sections of the input, in contrast to a multilayer perceptron that has weights for every point of the input. This weight sharing reduces the amount of computation significantly, and also, the shared weights can be interpreted as local feature extractors. For example, the first layer of a convolutional neural network usually learns rates that are sensitive to edges and have high similarity to Gabor filters. At the same time, the computational resources, the amount of data in hand, and the convergence problem led to a decrease of interests in the field of neural networks, to the extent that with the introduction of SVM, some researchers believed the neural networks would be lost in history.

Fig. 2.3. The LeNet-5 Model (http://deeplearning.net/tutorial/lenet.html). In the year 2006, Hinton et. al [72, 73] published two papers that caused a spark of interest in the field of neural network again. These papers focused on “pre-training” neural network models in a stacked manner using restricted-Boltzmann machines (RBM). RBM or, in general, Boltzmann machines are a class of generative graph models in which nodes in a graph are connected, and their connection shows a dependency in the probability distribution. Figure 2.4 shows the RBM model and the equations governing it, where E(v, h) is the energy of the model. The fast optimization algorithm proposed by Hinton [72] tries to reduce the energy of the model and after optimization, the model is reconfigured to be feed-forward rather 27

than bi-directional. This procedure is continued by training another RBM on top of the previous layer’s outputs. This stacked “pre-training” of multiple layers and later, fine-tuning the model using back-propagation gave the opportunity to train a larger number of layers in an MLP [74].

Fig. 2.4. The Restricted Boltzmann Machine (RBM) and governing equations. Although Hinton’s sparked interest in neural networks, it was the 2012 paper by Krizhevsky [75], proposing the AlexNet architecture, which changed the course of machine learning algorithms forever. This work, using a unique CNN architecture and several tricks for training the large network, was the first time a neural network won the world-renowned ImageNet challenge (Figure 2.5). ImageNet, at the time, focused its challenge on classifying a large number of images. AlexNet used several clever tricks for its training. First, instead of using sigmoid non-linearities, rectified linear units (ReLU) [76] were used. These units helped with the better training of the network by allowing the gradient to flow better in the network. Second, Dropout [77] was used to prevent overfitting in the network. Third, augmentation of the inputs was used to increase the number of data samples. And last, to train the massive network designed, instead of using CPU processing, the graphical processing unit (GPU) was used for training the neural network [78]. The use of GPU lowered the training of the network significantly; reducing it from a month of training to only 10 days. This success in training a deep neural architecture or, as it is now known, Deep Learning, was the first stepping stone in developing better algorithms, better software, smarter and better network designs and faster hardware. Another class of neural networks is auto-encoders. These neural net28

Fig. 2.5. AlexNet, the winner of the 2012 ImageNet competition [75]. works mainly focus on receiving the input, reducing the dimensionality (encoder) and reconstructing the same initial input based on the lower dimension (decoder). Many strategies exist to construct auto-encoders and how to train them. An overview of auto-encoders and their training can be seen in [79]. As an example, the denoising auto-encoder [80] adds noise to the input so that the network reconstruction becomes robust to input noise. Another example would be the variational auto-encoder [81] which estimates parameters to a distribution using the encoder and uses the estimated parameters to sample from the distribution. The sampled values are then fed to the decoder in order to reconstruct the input. Auto-encoders have also been used for pre-training feed-forward neural networks for a better initialization of the network. In this scheme, the encoder is the desired network, and the decoder is constructed to mirror the operation. This pre-training either can be an end-to-end (training all the network in one structure), or it can be in a stacked manner (training one layer at a time) [82, 83]. In this section, up until now, we have mainly focused on introducing feed-forward neural networks; networks whose flow of data is only in one direction. Other classes of neural networks that have recurrent or feedback connections also exist, such as recurrent neural networks and Long ShortTerm Memory (LSTM). Because we have not utilized them in this thesis, we have not explained them in detail and have left them as future research directions. 29

2.2.1

Modern Deep Learning

Since 2012, many advances can be named in the area of deep architectures and deep learning and for a multitude of applications. To the extent that many different architectures were revived or developed for bettering the solutions to challenges in machine learning. Many of these architectures can be seen in Figure 2.6. The success of deep learning quickly spread from academia to industry and this phenomenon has largely contributed to the expedite of the research success. For example, after Google set up their research and development for machine learning in 2012 and also acquired DeepMind which is a UKbased artificial intelligence company: • we have seen GoogLeNet and inception architecture win categories of the ImageNet challenge in 2014 [84] and further development in the past two years ([85, 86]), • Batch Normalization [87] and its recent extension, Batch Renormalization [88], was proposed as a method for faster convergence in the training of neural networks, • Neural machine translation models [89] were designed to translate between languages, and surprisingly, a middle language between all languages was created, • Google published its software called “TensorFlow” [90] and is currently one of the lead software packages for designing deep learning systems, • DeepMind published two Nature papers [91, 92] in the area of deep reinforcement learning, and

30

Fig. 2.6. The different types of neural networks (source: http://www. asimovinstitute.org/neural-network-zoo/).

• DeepMind’s AlphaGo and AlphaGo Zero [93] beat the world champion of the game “Go”; a game that is known to be infinitely complicated to play and sometimes relies on intuition rather than strategy. This example of Google is just a tip of the iceberg. Big companies such as Facebook [94], Microsoft [95], NVIDIA [96], Amazon [97], and Dis31

ney [98] all have gone on-board the deep learning bandwagon and started publishing many papers, acquiring smaller deep learning-based companies, recruiting fresh graduates from top machine learning universities. Well-known universities such as the University of Montreal, University of Toronto and Stanford University are publishing papers and developing new ideas which later on are applied in algorithms used by these large corporations. Speech recognition and speech generation, as a signal-based input/output, has been targeted by deep learning researchers since the deep learning “boom” and from the beginning, interesting publications have emerged. For example, works from Abdel-Hamid et. al. [99, 100, 101, 102, 103] have focused on using pre-processed speech signals as an input to various networks and used for speech recognition. Also, recent publications such as WaveNet [104], music generation [105], end-to-end speech recognition [106], and speech intelligibility for hearing impaired listeners [107] have shown promising results and capabilities of deep learning in the area of speech and acoustics. From analyzing the applications of deep learning, one can notice that the goal of deep learning is to mimic the human brain concerning its understanding of a certain raw input. In other computer vision algorithms, hand-crafted features are extracted from an image and then used for classification, whereas in deep learning, images are given with a minimum processing beforehand, and classification is done on them. This ability of deep learning shows the power of deep learning and its capacity and potential to capture information from inputs with minimum human intervention. Observing the potential of deep learning methods used for images, text, and speech, we set out to explore the potential of deep learning for a unique class of signals that intrinsically differ from speech signals and other types of data: EEG Signals.

32

2.3

Deep Learning for BCI

The high-dimensionality (multi-channel and sampling rate) of EEG data, channel correlation and the presence of artifacts (i.e. movement) and noise pose a challenge in designing the right framework for EEG classification using deep learning. Based on the nature of the data, the framework must include a data preparation stage in which the signal information is reduced and exploited into a new representation without any significant loss of information. Based on this representation, the next stage of the framework, which is the network architecture, must be designed to extract meaningful features from the input. With the challenges in mind, nevertheless, DL methods have been successfully implemented by authors for EEG classification. Cecotti introduced a CNN classifier for classification of a P300 (a positive peak seen in the EEG signal 300 ms after presenting a stimulus to the subject) speller task [108]. The CNN in this paper was used in both temporal and spatial manner: a convolution was performed on the spatial EEG channels therefore mixing them, and in the next layer, a convolution was performed along the temporal dimension of the EEG signal. Filtered EEG signals were used as inputs. The author also used a similar architecture in another paper for SSVEP [109] and rapid serial visual presentation (RSVP) task [110]. In another work, Stober used two representations (raw signal and spectral features) for classifying music imagery EEG signals using CNNs [111]. The results showed the capacity of CNNs to classify imagery-based EEG. In a follow-up paper, Stober used a convolutional auto-encoder (CAE) to pre-train a CNN on the same dataset in a unique fashion using cross-trial encoding and similarity-constraint encoding [112]. These two techniques increase the number of samples for the network to learn from and can be utilized as a solution in problems with a low number of data samples in 33

future studies. In a more recent paper, Bashivan introduced a novel representation of EEG signals by using an image of the topological map of the EEG signal’s Fourier transform on the scalp, in a specified time interval [113]. By doing so, a sequence of images is generated for the whole EEG trial and then fed into a combined CNN and Long Short-Term Memory (LSTM) for classification. This paper is interesting in two different aspects: first, it introduces a new take on how to represent the EEG data and second, uses the nature of the time series and defines a recurrent neural network (RNN) on the data. Hajinoroozi has focused on applying DL techniques for driver cognitive performance using Deep Belief Networks (DBN), RBM, and CNN on raw representations of the EEG [114, 115]. In the study, a convolutional neural network specifically designed for EEG (channel-wise convolution) is used for determining the cognitive performance in a two-class (drowsy or alert) situation. The above papers are not the only papers in the area of DL for EEG. These papers have been cited as applications and demonstrate the possible potential for future studies in different parts of EEG research. As for the use of DL for MI-EEG, the number of papers is also insufficient. An et al. [116] proposed using manually-extracted features from the channels based on fast Fourier transform (FFT) and then fed them into a Deep Belief Network (DBN). Although this can be considered as a demonstration of DL, it only uses a DBN as a classifier and does not interpret the network as a feature learning algorithm. Santana [117] proposed a neural network based on the CSP algorithm in which all parts of the network can be trained but without utilizing a CNN or introducing new elements into the original CSP algorithm, and instead, focused on the right/left two-class problem. In a similar paper, [118] also

34

focused on optimizing a neural network for spatial filter optimization with constraints and a more sophisticated update algorithm. The filter-bank extension of the algorithm can also be seen in [119, 120]. In the paper proposed by Yang [121], building upon the success of FBCSP, an augmented-CSP (ACSP) algorithm is suggested by using overlapping frequency bands. The log-energy features are extracted for each frequency band and arranged in a two-dimensional matrix. By training a convolutional network on the frequency-energy matrix, the network learns to discriminate the features. Furthermore, a map selection algorithm is proposed to select specific feature maps after the convolution operation. The interpretation of the weights in the network is unknown, and the features selected neglect the time dynamics. Rezai Tabar et. al. [122] published a paper which uses a stackedautoencoder (SAE) and a CNN to classify the two-class BCI competition II-3 and IV-2a MI data. The data is first transformed using a short-time Fourier transform (STFT), in which only 3 EEG channels are used (C3, Cz, C4). Although proving to be successful, the focus has been given to improving the two-class classification problem rather than multiple classes. This constraint in solving a two-class problem may be due to using a limited number of EEG channels for classification, and the exponential growth of the feature dimensionality if additional channels are added. Schirrmeister [123] applies a CNN to raw EEG signal for classifying motor imagery signals. The CNN contains the the readily available units for designing neural networks. Their main contribution is a novel loss function and a scheme for augmenting the input data in order to increase the data size. A visualization of the learned network parameters is provided and correlated with the input to the architecture. Researchers from the same group have also extended the above study to interpret the different stages of processing of the input inside the network [124]. The study sheds light

35

to the different aspects of the input (phase, amplitude, frequency content) and the layers of the network that focus on processing them.

2.4

Motivation & Thesis Organization

Deep learning, as a powerful tool for representation learning and classification, has achieved state-of-the-art performance in many computer vision and NLP challenges by training deep neural networks with unique structural and computational elements, capable of learning information from the data in a hierarchical structure. With this said, the amount of research published in the area of utilizing deep learning for BCI systems, especially MI-BCI systems, is limited and the handful of research which have employed deep neural networks have done so in constrained problems such as two-class MI-EEG classification. Given the challenges listed in Section 1.5 and our motivation for building a temporal representation of the EEG, we hypothesize that some computational elements in deep networks such as the convolution operation in CNNs seem compatible to a temporal representation and can increase classification performance if designed properly. Furthermore, the ability of deep networks to learn from large amounts of data has motivated us to design networks for transferring information from a large amount of subject data to new subject and thus, tackling the subject-to-subject variability stated in Section 1.5 with this methodology. Based on the motivations stated, the organization of this thesis is as follows: • In Chapter 3, we ask the question: what is the most suitable representation for EEG? We explore different representations of EEG based on current papers and then propose our novel representation utilized for a DL architecture. Our novelty is inspired by and based upon the success of the FBCSP method. We extend the method to

36

consider the dynamicity of the EEG signal and prepare it to be fed into a CNN architecture. • In Chapter 4, we explore the different types of architectural elements possible to be used in designing a network for EEG classification and propose several novel network architectures. Our classification accuracy results easily exceed the state-of-the-art results showing the classification power of deep neural networks and the suitability of our representation for designed network. Furthermore, we investigate the network architecture for understanding more about what has been learned from the data and provide quantitative and qualitative analysis. • We extend the application of deep neural networks and explore the concept of transfer learning in Chapter 5. Using transfer learning, we tackle the problem of reducing training samples and lowering the calibration time. We do so by training an auto-encoder on a pool of multiple subjects and transfer the network to a new subject. • Finally, we conclude on the success of our method and approach to solving the MI-EEG classification challenge and show possible research directions in the future in Chapter 6.

37

Chapter 3 EEG Representation In this chapter, we will be exploring the different ways to preprocess and represent the EEG signal. This crucial process is fundamentally needed for the EEG classification methodology due to the nature of EEG data. We represent a raw recorded EEG signal by a matrix X ∈ RC×T , where C is the number of electrodes/channels and T is the number of time points in a given time segment. By observing the EEG signal, we will notice its high dimensionality and correlation between channels due to the positioning of the electrodes. Two main preprocessing operations can be carried out on the EEG signal to reduce its dimensionality without removing valuable information: either the operations can be conducted on the channel dimension or on the time dimension. In this chapter we will be exploring both approaches in sections 3.1.1 and 3.1.2. Furthermore, by using these operations, we will introduce previously used representations for MI-EEG (Section 3.3) and propose our novel methodology for EEG feature extraction (Section 3.4). As a reminder, in this study, the main focus is analyzing MI-EEG signals, and for this reason, most of the examples and discussion are regarding these signals.

38

3.1

Preprocessing Methods Used for EEG Data

3.1.1

Temporal Filtering

Based on the nature of the EEG signal, information is nested in the changes of amplitude in a specific frequency band related to the task performed by a subject. Preprocessing algorithms either tend to use the whole frequency spectrum between 0.5 Hz to 30 Hz (broadband), a specific frequency band related to a task (pass band), or a filter bank of several specific frequency bands. The choice of frequency bands depends on the application and the following algorithms. For example, the FBCSP algorithm uses a filter-bank approach and then resorts to feature selection algorithms to choose the best frequency bands. As another example, in some analytical studies of MIEEG data, it is shown that the ratio of theta (4 Hz-7 Hz) to a combination of alpha (8 Hz-12 Hz) and beta (12 Hz-25 Hz) frequency bands can be used for predicting the performance of subjects in an MI task [125]. In addition to the importance of the frequency band, there is also a concept of dominant frequency within the frequency band which is subject-specific [126]. The dominant frequency is essential to narrow down the information for a specific subject. These characteristics of the frequency content have been explored in many studies in conjunction with spatial filtering. This is mainly because the results of temporal filtering directly influence the performance of spatial filtering. Therefore, in some studies, the temporal and spatial algorithms are done sequentially with the former being first, or both of them are optimized simultaneously. An example of the difference between broadband filtered and pass-band filtered signals can be seen in Figure 3.1.

39

Fig. 3.1. Sample of differences between broadband and pass-band filtered signal. For each plot, there are 22 signals corresponding to 22 electrodes used in the BCI competition dataset IV-2a. The broadband signal (top) has been filtered between 8 to 30 Hz

3.1.2

Spatial Filtering

Spatial filtering, as a data preparation stage, is key for extracting meaningful features from EEG signals due to a high correlation between the

40

recorded EEG channels. Spatial filtering methods can be categorized into data-independent and data-driven methods. The spatial filtering operation 0

can be written as Z = W T X where X ∈ RC×T is the input, Z ∈ RC ×T 0

is the output, W ∈ RC ×C is the spatial filter, and C 0 is the number of channels after the spatial filtering operation. In data-independent spatial filtering, the spatial filter is fixed and is based on the electrode locations. In data-driven methods, the spatial filters are determined via an optimization algorithm that is either supervised or unsupervised and dependent on the subject(s) data. In this section, we will be describing some of the methods used in spatial filtering for both categories and also, introduce our method in Section 3.4.

Data-independent Methods In these types of methods, for each channel, filtering is carried out by subtracting a group of channels from it [127]. Two of the more famous spatial filtering algorithms in this category are Laplacian filters and Common Average Reference (CAR) filter. In Laplacian filters, each channel value is substituted by its value minus the mean of the surrounding neighborhood of channels yielding a relatively sparse spatial filter. The neighborhood size is a meta-parameter to be chosen or optimized. In mathematical form, the Laplacian can be written as: XiLap = Xi − N1 Σj∈N Xj , where XiLap is the i-th channel which the filtering is performed, Xi is the channel’s original value, and N is the surrounding neighborhood of channels. In CAR, the value of a channel is substituted by its value minus the average of all other channels, or, in other words, the CAR is a Laplacian where the neighborhood consists of all the channels. These methods have applications in settings that do not require filter optimization and an isolated analysis of certain channels is needed.

41

Data-Driven Methods As mentioned, data-driven methods focus on optimizing a certain criterion based on the statistical or deterministic relationship between the channels and can be divided into unsupervised and supervised methods. An example of unsupervised methods is blind source separation (BSS) methods with Independent Component Analysis (ICA) being one of the most popular in the category. ICA focuses on decorrelating the channels with the objective of the newly transformed channels being “independent” from each other based on the kurtosis or fourth order moment metric [128]. Given an input X ∈ RM×T× , where T is signal time points, and M is the recorded sources, C independent component of the matrix X is extracted via the weight vector W ∈ RM ×C using the following recursive algorithm (Algorithm 1): Algorithm 1 FastICA Algorithm 1: procedure FastICA(X ∈ RM×C , C) 2: g ← tanh(x) 3: g 0 ← 1 − tanh2 (x) 4: 1: Column vector of length T 5: for p ← 1, C do 6: wp ← Random sample from RM 7: while wp changes do 8: wp ← T1 Xg(wp T X)T − T1 g 0 (wp T X)1wp T 9: wp ← wp − Σp−1 i=1 wp wi wi wp 10: wp ← ||wp || 11: end while 12: end for 13: return W = {w1 , . . . , wC } 14: return S = W T X ∈ RC×T . Independent Sources 15: end procedure

Principle Component Analysis (PCA) is also an unsupervised BSS method which focuses on finding spatial filters that are orthonormal and also, results in a diagonal channel covariance matrix. The spatiospectral decomposition (SSD) [129] is a decomposition algorithm designed specifically for EEG which receives a frequency band (f req) as an input and finds a spatial

42

filter that maximizes the energy of the given band and reduces the energy of the frequencies surrounding the band (Algorithm 2). It has proven to be a successful algorithm for isolating information in a frequency band based on the spatial filter. Algorithm 2 Spatio-Spectral Decomposition (SSD) Algorithm 1: procedure SSD(X ∈ RM×T×N , f req) 2: Xf ← bandpassf ilter(X, f req) 3: Xf0 ← X − Xf 4: Cf ← hCov(Xf )iN . hi :=Expectation 5: Cf0 ← hCov(Xf0 )iN 6: return W =GEVD(Cf , Cf0 ) 7: end procedure 8: 9: 10:

procedure GEVD(C1 , C2 ) . Generalized Eigenvalue Decomposition return eigenvectors of C2−1 C1 end procedure

In supervised methods, the spatial filters are designed to discriminate between the different class labels of the signal in hand. As stated in Section 1.4, the common spatial patterns (CSP) algorithm [29] is widely used and many variants of the algorithm have been proposed. In addition to CSP, algorithms such as Source Power Comodulation (SPoC) [130] are a generalization of the CSP which uses the correlation between the target variable and the input to find spatial filters or decorrelate the input (Algorithm 3).

Algorithm 3 Source Power Comodulation (SpoC) Algorithm 1: procedure SPoCλ (X ∈ RM ×T ×N , z ∈ RN ) 2: C ← hCov(X)iN 3: Cz ← hCov(X)ziN 4: maximize wT Cz w s.t. wt Cw = 1 5: return W = GEV D(Cz , C) 6: end procedure

In this study, we have built upon the FBCSP algorithm, which incorporates filter banks, CSP and feature selection. Details will be explained 43

in Section 3.4.1.

3.2

BCI Competition IV-2a Dataset

In this thesis, we have focused on the 2008 BCI competition IV-2a EEG dataset [131] for all experiments. This dataset consists of a four-class motor imagery (Left, Right, Feet and Tongue) EEG recorded from 22 Ag/AgCl electrodes with a 250 Hz sampling rate in two sessions from nine subjects, with each session recorded in two different days. Each session has 72 trials per class resulting in 288 data samples per session. The timing scheme consisted of a fixation of 2 s, cue time of 1.25 s, followed by a period of motor imagery of 4 s. The time scheme can be seen in Figure 3.2. The subject starts MI after the cue, and there is no questionnaire regarding how MI was conducted by the subject. The train data was provided at first by the competition committee, and the test data labels (true labels) were released after the competition finished. The train data also contains additional information regarding whether a certain trial was noisy and should be rejected, or the otherwise. More information about the details of the experiments is provided on the competition page. Previous attempts on classifying the data show a variety of subject performances based on accuracy score and Cohen’s kappa. This variety of performance in different subjects shows the challenge in designing a proper methodology to be robust against the variability of the subjects.

44

Fig. 3.2. Time scheme in the BCI Competition IV-2a data set([131])

3.3

A Brief Review of EEG Representations

Up until now, in this chapter, we have focused on the preprocessing stages that are usually performed on EEG signals. In this section, we will be introducing the “representations” of EEG signals throughout literature. Here, representation refers to the input to any machine learning algorithm after preprocessing. This preprocessing is a crucial stage for machine learning algorithms for EEG: if the representation is too high dimensional, too correlated, or too noisy, the machine learning algorithm cannot find discriminative information regarding the data and will overfit if the data sample to feature dimension ratio is too high. On the other hand, if the dimension of the raw EEG is reduced significantly, it may result in loss of valuable, discriminative information. Therefore, using the right representation is crucial. We will introduce some representations of the EEG in the following subsections and also, introduce our novel representation of the EEG data which is used for CNNs.

3.3.1

Energy & Relative Energy

Channel Energy is one of the widely used representations of EEG in literature. In algorithms that utilize the CSP algorithm, the relative energy of spatially filtered channels is used to represent the EEG signals for classification. The feature naturally correlates with the ERD and ERS, which

45

is the change of synchronization in brain activity or, similarly, the change of variance of the signal stated in Section 1.4. In most methods, after spatiotemporal filtering, the variance of the newly constructed signals are taken and used, or the relative energy of each signal to the overall energy of all signals is used instead. When relative energy is used, to change the manifold of the features from a hypersphere to Euclidean space, the logarithm of the energy features are taken. Compacting the time points of the signal into one value results in a low dimensional representation of the EEG signal which is the main benefit of using this representation and has provided good performance in classification. The downfall of this representation, however, is the negligence for the dynamics of the signal or in other words the changes of energy.

3.3.2

Covariance Matrix

In recent years, the covariance matrix has become a contender in the BCI classification arena and holds the state-of-the-art results in motor imagery classification [43]. It has also won almost all BCI related competitions on Kaggle [132]. This method first introduced by Barachant [41] uses the channel covariance matrix as the representation of EEG signals for motor imagery classification. Because covariance matrices lie in a Riemannian space, their geodesic (i.e. the distance of two matrices) is more complicated than a simple Frobenius or Euclidean distance metric. For the matrices to be classified using Euclidean metric-based classifiers, the matrices are first projected into a new space called the Riemannian Tangent Space using the geometric mean and then classified in the given space (Algorithm 4). The advantages of this method are the fact that no preprocessing is needed to be used (unless focusing on specific frequency bands). However, the main disadvantage is the relatively large feature vector that is created relative to 46

the number of data samples given (x ∈ R

C×(M −1) 2

where M is the number

of channels). One of the ways of resolving the issue is using dimension reduction techniques such as LDA. This Riemannian geometry approach was extended by [43] by finding sub-manifolds of the covariance matrices that are also separable in the Riemannian Tangent Space and has produced state-of-the-art results. Algorithm 4 Riemannian Feature Extraction Algorithm 1: procedure Tangent Space Mapping(C ∈ RM ×M ) 2: C = U T diag(σ1 , . . . , σM )U . Eigenvalue Decomp. T 3: log(C) := U diag(log(σ1 ), . . . , log(σM ))U 1 1 1 1 4: LogC (Ci ) := C 2 log(C − 2 Ci C − 2 )C 2 5: upper(X) := upper triangular matrix of X 6: kCi kC := kupper(Ci )k2 7: δR (C, Ci ) := kLogC (Ci )kC . Riemannian Distance N . Riemannian Mean 8: CG := arg minC∈C(N ) Σi=1 δR (C, Ci ) 9: for i ← 1, N do −1 −1 10: si ← upper(CG 2 LogCG (Ci )CG 2 ) 11: end for 12: return si 13: end procedure

Many works of literature exist in the area of Riemannian Geometry outside the realm of BCI systems [133, 134, 135, 136, 137, 138, 139, 140] which can potentially be used for BCI. In general, the covariance representation for BCI applications has much potential and is promising research direction in the future.

3.3.3

Other Representations

Other representations for EEG can be found in literature, though not as widely used. These include Emperical Mode Decomposition [141, 142, 143], Bispectrum Energy [144, 145, 146], Autoregressive parameters [147, 148], Spectrum Decomposition (Wavelet and STFT) [149, 150, 151], Entropy [152, 153, 154] and many more.

47

3.3.4

EEG Representations Used in Deep Learning

Because of our focus on utilizing deep learning for EEG signal classification, this section will discuss some of the representations presented in the literature. With the intention of using deep neural networks to learn information from EEG, the first intuitive representation of EEG would be to use either the raw signal or the raw time-frequency spectrum. These two representations of the EEG signal dominate most of the publications in the area [111, 117, 117, 114, 112, 118, 155, 119, 120, 115, 123, 122, 156]. Papers using either one of these features, commonly use a deep learning architecture designed specifically to classify the EEG signals. For example in [122], the time series for three channels and two frequency bands are used for the representation of the deep architecture (Figure 3.3). STFT is used for extraction of the representation without any other preprocessing performed on the signal. Alternatively, in [114, 115], a channel-wise convolutional layer is used to classify the signal.

Fig. 3.3. The representation used in [122]. The signal time series for three channels (Cz,C3,C4) and two frequency bands (mu, beta) is used as the representation. STFT is used for extracting the features. Another representation of the EEG can be seen in [113], where the EEG 48

time-series and electrode positions are mapped onto a series of images. The mapping is produced using a topology preserving mapping from a 3D space into a 2D space. The data is filtered into three main frequency components: Theta, Alpha, and Beta frequency bands and the spectral topography maps are constructed in these frequencies. By cleverly representing each of the frequencies as an RGB channel, a final series of images are constructed and then passed into a CNN for a single frame and into a Recurrent Neural Network (RNN) for multi-frame classification. This representation was used for a working memory task and was the first time that an image-based representation of EEG was used in combination with deep learning methods. The methodology for representation extraction and feature classification can be seen in figure 3.4

Fig. 3.4. The representation used in [113]. Signals are represented as a series of images based on the information extracted from three frequency channels. Each frequency channel represents a colour channel in the final image. Other representations of EEG data proposed in the literature are limited to using energy features extracted from specific frequency bands [116] or energy features in combination with spatial filtering [121].

49

3.4

Novel Temporal Features

The representations stated above have been used for two-class or four-class MI-EEG classification. Although they have resulted in 1% − 3% higher accuracy than the FBCSP method, they still perform lower than the stateof-the-art presented in [43]. We speculate that although using the raw signal or raw STFT are valuable in terms of the interpretability of the network and direct representation learning, but, because of the low number of data samples per class and the correlation between the channels, it is very likely that the network will overfit and not learn to decorrelate the channels from each other. This speculation has motivated us to add a preprocessing stage to prepare the data before feeding it to a deep neural network. This preprocessing algorithm must be able to preserve the temporal and spectral information of the EEG but also, decorrelate the EEG and significantly reduce the dimension. In the following sections, we propose a novel feature representation of the EEG signal which solves the caveats of previous representations and will be later used as the input to a convolutional neural network. The methodology for extracting the representation can be seen in Figure 3.5 and consists of two main parts: 1) the FBCSP method and 2) temporal feature extraction from FBCSP. A detailed description of each of the parts will be presented in the following sections.

50

51 Fig. 3.5. The novel representation proposed in this thesis. In the FBCSP stage, a raw EEG signal is passed through a filter-bank. The CSP spatial filtering algorithm is performed for each frequency band. The log-energy features are used for the feature selection algorithm. In the second part, the spatially filtered signals corresponding to the selected features are selected, the envelope is extracted and the signal is down-sampled.

3.4.1

Filter-Bank Common Spatial Patterns (FBCSP)

The Filter-Bank Common Spatial Patterns (FBCSP) algorithm was first introduced as an extension to the original Common Spatial Patterns (CSP) algorithm and gained attention by winning the 2008 BCI Competition IV2a [32, 31]. The algorithm for processing the EEG signals and extracting the features is as follows: 1. The EEG signals from all recorded channels are filtered using a filter bank with nine bandpass filters starting at 4 Hz. Each filter has a bandwidth of 4 Hz. All filters are type II Chebyshev filters. 2. Spatial filters for each output of the filter bank are computed using CSP. A detailed description of the EEG algorithm can be seen in Section 1.4 and Equation 1.1. 3. Spatial filters corresponding with the 2 × NW extreme eigenvalues (NW largest and NW smallest eigenvalues) are selected. Each of the extreme spatial filters are then paired with each other correspondingly. 4. Energy (variance) of the spatially-filtered channels is calculated (EC ) and normalized to the total energy of the channels in a given frequency band( E˜C =

EC ). 2×N Σi=1 W EC

The logarithm of energy is computed

as the final feature. 5. Features coming from all filter bands are concatenated, and a mutual information-based feature selection is performed on 2 × NW spatially filtered channels, choosing NS channels and their pairs. 6. Because CSP is designed for a two-class problem, in the case of multiclass tasks, a one-vs-rest or one-vs-one strategy must be appointed. In FBCSP, the former is chosen and it will lead to a maximum of classN umber × 2 × NS features. 52

The values NW = 2 and NS = 4 were selected using cross-validation. With the competition data having four classes, the maximum number of features used for classification will be 32. It should be noted that the features can be handled in two ways: concatenation of all features into one large vector or using features extracted by class-specific spatial features individually. In [32], the latter has been used. Because of the success of FBCSP in classification framework, we decide to build a feature extraction procedure for temporal features utilizing a modified version of the FBCSP algorithm which will be described in Section 3.4.2. From the above algorithm it can be observed that: • The FCSP method performs the spectral and data-dependent spatial filtering that is required for enhancing the EEG signal and reducing its dimensionality. • Inherently, it can be modified to include the temporal information. Knowing this, we propose to extract t the temporal representation from the he FBCSP algorithm as follows.

3.4.2

Temporal Representation from FBCSP

Assuming we have performed the FBCSP process as described in 3.4.1, we extract the temporal features from the following procedure: 1. After FBCSP, we have indices of selected channels for each frequency band and each class. We use these indices to extract the corresponding EEG signals. It should be noted that we force the FBCSP algorithm to select 2 × NS channels and use the whole period of motor imagery (0 s to 4 s) as the time segment to estimate the covariance matrices. 2. The signal envelope of each signal is extracted using the Hilbert transform [157]. The Hilbert transform produces the analytic form of a 53

signal which is complex-valued and is interpreted as the one-sided version of the original signal’s frequency spectrum. Taking the amplitude of the analytic form gives an estimate of the envelope. 3. After extracting the envelope, we have considered three possible representations for the EEG: • Using the raw (or a smoothed version of) EEG envelope (R1) • Taking the power of the envelope, which can be interpreted as instantaneous energy (R2) • Dividing the envelope power by the total energy of each of the channels in each trial, similar to step 4 in the FBCSP algorithm described in 3.4.1 (R3) 4. Due to the spectral nature of the envelope being low-frequency, we can downsample the signal without information loss. Downsampling comes in handy to lower the dimension of the data, especially when the number of data samples are limited. The ability to downsample is a natural benefit of using the FBCSP algorithm: the filter-bank intrinsically reduces the number of data samples needed to represent the original data. 5. After extracting the envelope signals for each class, instead of taking a one-vs-rest strategy for classification, we concatenate all four classes making a single matrix of signals. With the same values of NW = 2, NS = 4 and with 4 classes, the number of channels will be 32. As for the dimension of the features in time, the original sampling frequency of the data is 250 Hz and by choosing 4 s interval of data, we will have 1000 time points. Note that the envelope will have a cutoff frequency of 4 Hz, which means a sampling frequency of 8 Hz is sufficient for the signal (Nyquist rate) but we have chosen to reduce 54

the frequency to 10 Hz, yielding 40 time points for the 4 s interval. Overall, the dimension of the data fed into the network will be 32 × 40. From here on we will refer to the channel dimension as feature channels. To confirm whether this representation is better than the energy features, we conduct a simple 10-fold cross-validation experiment using an SVM classifier. The results are presented in the next section.

3.5

A Quantitative Analysis of the Proposed Representation

The results of the 10-fold cross-validation accuracies can be seen in Table 3.1. The “FBCSP” column shows the cross-validation accuracies for the FBCSP energy features using a one-vs-rest classification scheme. The “R1”, “R2”, and “R3” columns show the same cross-validation using our proposed features. It is evident that our representation shows superiority relative to the FBCSP algorithm for the R1 and R3 representations. It should be noted that the R1 representation is not significantly higher than the R3 representation but it is desired because of the one less computational step. Table 3.1 Cross-validation accuracy results using SVM for different representations.

Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Average 1

FBCSP 81.96 51.47 88.94 58.20 65.63 47.86 91.27 87.72 83.05 72.90

R1 82.36 69.52 90.84 62.09 74.32 52.82 91.91 86.57 83.89 77.151

significant with p ≤ 0.05

55

R2 R3 78.46 87.00 67.66 65.54 86.91 87.65 59.49 60.34 72.29 70.32 43.98 47.93 91.38 91.01 82.96 88.25 79.27 86.65 73.60 76.081

3.6

Conclusion

When using a neural network, it is important to have the right input to the network. In this chapter, we covered many of the preprocessing methods and also representations of the EEG signal and proposed a new representation of the EEG signal based on the FBCSP method and envelope extraction. Our representation contains the temporal information regarding the EEG signal and also provides information regarding the potentially most discriminative spatially-filtered channels as well. Cross-validation accuracy results also support our representation’s richer information content relative to the normal energy features.

56

Chapter 4 Learning Temporal Information for BCI using Convolutional Neural Networks (CNN) After introducing a new representation for EEG in Chapter 3, in this chapter we will be focusing on architecture design for the new representation. Architecture design focuses on how to combine and initialize known neural network elements in a fashion that is best suited for the given problem. This chapter’s contents is the results published in two of our papers. We will first discuss our initial viewpoint introduced in [28] and then will extend the idea to our paper in review (Learning temporal information for brain-computer interface using convolutional neural networks.” In IEEE Transactions on Neural Networks and Learning Systems). It should be noted that almost all the architectures were designed using the Torch7 [158]. Torch7 is a Lua-based software package to design, train and deploy neural networks.

57

4.1

First Design: A Parallel ConvolutionalLinear Neural Network

The network design provided in this section, was the initial design for our experiments. The results of this section motivated us to delve deeper into the design of the network and the selection of the right EEG representation, which yielded the results of Section 4.2.

4.1.1

A Mathematical Review of CNNs

The classic CNN architecture [159] comprised of a sequence of convolutions and sub-sampling layers in which the content or values of the convolutional kernels is learned via an optimization algorithm. The governing equation of the convolution operation is as follows:

l l−1 hli = f (ΣM n=1 Wni ∗ hn + bi )

(4.1)

l where hli is the i-th output of layer l, Wni is the convolutional kernel that

is operated on the n-th map of the l − 1-th layer used for the i-th output of the l-th layer, and bli is the bias term . f () is an activation function or nonlinearity imposed on the output of the convolution. The optimization algorithm focuses on optimizing the convolutional kernel W . The output of each convolutional operation or any operation is denoted as a feature map. After convolution, the sub-sampling layer reduces the dimension of the feature maps by representing a neighborhood of values in the feature map by a single value. For example, max-pooling uses the maximum value of a neighborhood of values as the representation. Each of the convolution operations, based on the contents of the learned convolutional kernel, would extract information regarding the contents of its input and the maxpooling would focus on the most relevant information in the neighborhood

58

of a pixel. This sequence of convolution and reduction leads to a lower dimensional representation of the image, containing information on different aspects of the input. For example, a feature map can be “active” when a certain geometrical shape is seen in the image. The extracted information is concatenated over all feature maps and fed into a multilayer perceptron (MLP) for classification.

4.1.2

Architecture Design Details

Inspired by LeNet-5’s design and based on the nature of our data, our designed network can be seen in Figure 4.1. Our network is comprised of two independently trained networks (a CNN and an MLP). The CNN receives an envelope-based temporal representation of the EEG (similar to Section 3.4.2), and the MLP receives energy features based on the FBCSP method. The inference is made by selecting the maximum value of each network for each class, and the highest value among the classes is chosen as the prediction. We have chosen to have two independent networks to use the new temporal features to boost the classification performance relative to when only using the energy features in an MLP network. For the temporal representation, we choose the instantaneous energy of the envelope (sub-sampling with a rate of 5, CSP performed in a 2 s window) which has been channelnormalized similar to the FBCSP. The features have been chosen this way because the average of the feature’s values over time results in the FBCSP features. Furthermore, we distribute the EEG data matrix into 4 channels based on the one-vs-rest extraction of the EEG features, resulting in the input to have a shape of 4 × 8 × 100 (class × channel × time). Note that this representation is extracted similarly to the representation in 3.4.2, but has a different dimension. This representation of the EEG signal is, later on, evolved into the dimension presented there. 59

Fig. 4.1. Our designed network for EEG classification. Two types of features: trial energy (static energy) and instantaneous energy (envelope energy) features (dynamic energy) is extracted from the data in every trial. A CNN is trained on the dynamic energy and a MLP is trained on the static energy. The output of each of the networks (class probability) is max-pooled for the inference of the class. The convolutional neural network is trained in a parallel manner on each of the class dimension of the EEG input. The details of each of the parallel CNN is as follows: • Parallel convolutional layers with a one-dimensional kernel in time. One-dimensional kernel in time means the information of channel dimension is not affected. A more detailed description will be presented in 4.2 regarding one-dimensional kernels. • Average-pooling. In CNN architectures for images, max-pooling is used because of its spatial invariance-inducing property and the abil-

60

ity to select the most important information and neglect the irrelevant or less relevant information. However, training the EEG data with max-pooling will result in poor results because single trial EEG is inherently noisy, whereas average-pooling is more robust to noise than max-pooling. • Convolutional layer with a one-dimensional kernel in time. In this stage, the kernel is chosen as the same size as input and can be viewed as template matching for each channel, or a shared linear layer on each of the channels. • Linear layer before and after concatenation. The linear layer after the convolutional layers is an embedding that can be viewed as a modified energy feature. After concatenating data from the convolutional and linear layers above, the data is fed into another linear layer for classification. The size of the first layer’s convolutional kernel is 51, and the average pooling has a neighborhood of 5 with a stride of 5. The last convolutional layer kernel is of size 10. The number of hidden nodes for each parallel convolutional network are 50, which results in a feature vector of 50 at the end of the second convolution. A linear layer is added to to the end of each parallel convolution with a width of 50 hidden nodes. Because we are classifying 4 classes and therefore, the number of convolutions are 4, the feature vector is of size 200. In the end, another linear layer is added for classification, with a width of 200, which is given to the output layer. The MLP used for the classification of the energy features is a simple two layer network with 50 hidden nodes. For both static and dynamic energy architectures, dropout regularization is used. Dropout, proposed by [77], is a semi-ensemble learning regularization that helps network not to overfit. The activation functions chosen 61

for the two architectures is a Rectified Linear Unit (ReLU ). Each architecture is trained individually using Stochastic Gradient Descent (SGD) and a Negative Log-Likelihood (NLL) criterion.

4.1.3

Classification Results

A support vector machine (SVM), applied to the static FBCSP energy features, is used as the benchmark and implemented using LIBSVM [160]. To verify the significance for increase/decrease of the accuracy, we once again use the one-sided Wilcoxon signed-rank test [161]. This test is appropriate under conditions where the number of paired statistical samples to compare are relatively small and non-Gaussian. The Wilcoxon signed-rank test computes the Wilcoxin score (W-score) by the following procedure: 1. Given two sets of data, the sign and absolute value of the difference between the datasets are computed. 2. Based on the value of the absolute value, a rank is given to each value with 1 corresponding to the lowest value. 3. The ranks are multiplied by the sign of the difference, hence creating the signed-rank. If the number of negative values are more than the positive values, the positive values of the signed-rank are summed to create the W-score. If the positive values are more, the absolute value of the sum of the negative signed-ranks is the W-Score. If there is no negative or positive value, the score is zero. For example, when there is an improvement of accuracy values of all subjects, based on the Wilcoxon signed-rank test, the W-score is 0, and there is a significant increase. The W-score for a significance level of 0.05 or 0.01 is chosen from a look-up table. Accuracies are achieved by an ensemble of ten models, Table 4.1. Results of independently trained static and dynamic energy networks (MLP 62

& CNN) and their combination (CNN||MLP) are presented so these three results can be used to interpret which feature has more contribution to the accuracy of the combined network. Although we see an increase in average classification accuracy for the CNN and CNN||MLP, only the latter has a significant increase. An interesting observation is the amount of growth in the accuracy value in some subjects, specifically subjects 5, 6 and 9. By comparing the different features and classifier accuracies, it is evident that the convolution network and dynamic energy features are boosting performance in these subjects. Table 4.1 Classification results for Parallel Convolutional- Linear Network with Comparison to Baseline over BCI competition IV-2a test data. SVM column is the benchmark static energy features using an SVM classifier. MLP and CNN columns show results of linear and convolutional neural network using static and dynamic energy features respectively. CNNkMLP shows the combined network using bothfeatures. The p-value for the Wilcoxin rank-signed test can be in row p-value. Results show a significant increase in classification accuracy when a static and dynamic energy is used together.

Sub 1 Sub 2 Sub 3 Sub 4 Sub 5 Sub 6 Sub 7 Sub 8 Sub 9 mean p-value

SVM MLP 79.16 75.69 52.08 48.96 83.33 75.35 62.15 64.93 54.51 52.08 39.24 39.93 83.33 82.99 82.64 84.72 66.67 67.36 67.01 65.78 0.4065

CNN CNNkMLP 78.82 80.55 53.47 53.82 82.64 84.72 60.76 64.58 59.03 59.03 43.75 44.1 82.64 84.03 83.68 86.8 81.25 77.77 69.56 70.60 0.2127 0.0091

With the improved performance, we look into how the CNN is contributing to the overall results. For this, we derive the per-class classification accuracies and also, the average confusion matrix over all subjects. These values can be seen in Table 4.2 and 4.3. In Table 4.2, we see class accuracies for the SVM given static energy feature and the CNN given dynamic energy features for each of the four classes. The table shows the contribution of CNN is not in increasing the performance regarding right 63

and left motor imagery but rather, in boosting the performance of the two other classes, feet and tongue. Table 4.3 also supports this by showing the overall reduction of confusion between tongue/feet and left/right classes. Table 4.2 Class accuracy for BCI competition IV-2a data. SVM is the benchmark FBCSP features using an SVM classifier. CNN show results of linear and convolutional neural network. L, R, F and T correspond to Left, Right, Feet and Tongue classes in the MI task. Ovr is the overall accuracy. It can be seen that the convolutional neural network has caused an increase of classification accuracy in Feet and Tongue classes for almost all subjects.

Sub 1 Sub 2 Sub 3 Sub 4 Sub 5 Sub 6 Sub 7 Sub 8 Sub 9 mean

L 0.92 0.54 0.90 0.46 0.81 0.50 0.86 0.93 0.75 0.74

Energy SVM R F T 0.79 0.57 0.85 0.38 0.64 0.43 0.96 0.76 0.78 0.69 0.76 0.61 0.67 0.11 0.50 0.58 0.07 0.38 0.94 0.63 0.83 0.82 0.85 0.68 0.89 0.67 0.39 0.75 0.56 0.60

Ovr 0.79 0.52 0.83 0.62 0.54 0.39 0.83 0.83 0.67 0.67

L 0.85 0.64 0.75 0.39 0.75 0.46 0.88 0.93 0.78 0.71

R 0.79 0.31 0.97 0.75 0.88 0.40 0.93 0.74 0.82 0.73

CNN F 0.64 0.81 0.81 0.53 0.17 0.28 0.56 0.86 0.85 0.61

T 0.88 0.39 0.78 0.76 0.57 0.61 0.94 0.82 0.81 0.73

Ovr 0.79 0.53 0.82 0.61 0.59 0.44 0.83 0.84 0.81 0.70

Table 4.3 Average confusion matrix for BCI competition IV-2a data over all subjects. It can be seen that the network is reducing misclassification between classes. SVM L R F T

L R 74.07 15.28 16.05 74.69 10.19 15.28 13.12 15.12

F 5.86 4.48 56.17 11.27

CNN L R F T

L 71.30 11.11 6.33 8.49

F T 5.71 9.88 7.25 8.49 60.96 23.92 10.34 72.84

R 13.12 73.15 8.80 8.33

64

T 4.78 4.78 18.36 60.49

4.1.4

Discussion

This initial study concluded that the convolutional neural network has the potential to be used in combination with the right temporal representation of the EEG. The CNN results in significant increase in the classification accuracy and reduction of the confusion between the classes. The main caveat of the designed network is the heuristic selection of the parameters and the complicated structure of the network. In the following section, we will try to explore the different representations proposed in Section 3.4.2 and propose a simplified network utilizing the feature, with the network parameters selected via cross-validation.

4.2

Second Design: Channel-Wise Convolutions with a Channel Mixing Layer

Since the introduction of LeNet5 and the success of AlexNet, most neural network architectures follow a similar design procedure with additional computational layers or connections such as dropout [77], batch normalization [87], Inception [84] and Identity Mapping [162]. These additions, based on their natures, can lead to the faster training of the network, better conservation of information throughout the hierarchical process, and avoid overfitting of the network. For designing a network, the nature of the input should be taken into consideration. Focusing on each of the representations introduced in Section 3.4.2, one can notice that each of the 32 feature channels selected may have different characteristics. They may be from different frequency bands (based on the feature selection algorithm). Regarding spatial filters, each feature channel has a unique spatial filter (based on the selected eigenvalues in the CSP algorithm) which is designed for discriminating one class against the other classes. Furthermore, it is possible for the spatial filters 65

to be correlated since one spatial pattern can be discriminative for two classes or more. These characteristics of our representation, begs the question of what the meaning of a convolutional operation on such an input is. We will consider different scenarios and their interpretations of utilizing convolution for our EEG representation. Scenario 1: Convolution only across time with a common kernel shape for all feature channels. In this type of convolution, the assumption is that feature channels selected for classification, although intrinsically different and independent, share a common morphology. This morphology can be captured from each channel using a common kernel which learns the morphology leading to the discrimination of classes. The choice of such a convolutional kernel will result in preserving the channels during the convolutional layers of a network (channel-wise convolution) and reduction of the temporal dimension. After convolution, the fully-connected layers will mix all the channels and temporal values and then classify them. We will call the network architecture utilizing this convolution “Channel-wise CNN” (CW-CNN). Scenario 2: Convolution only across channels. This operation can be interpreted as mixing the channel signals with each other. For example, if the size of the convolutional kernel for this layer is the same as the number of channels, the output of the convolution operation is a new signal which is the linear combination of all the given channels. A kernel size smaller than the number of channels is not ideal because it implies that a common linear combination can be shared among the channels but usually the EEG channels in this stage are independent, and their order in the input matrix is not important. We call this scenario “Channel Mixing CNN” (CM-CNN). This type of convolutional layer is better used with CW-CNN because the FBCSP input is already a linear mixture of the original EEG channels and an extra channel mixing is redundant. Only after some processing can a

66

new mixture of the channels make sense. Scenario 3: Convolution across both time and channels using a twodimensional kernel. The visible result of this type of convolution, in addition to convolution in time, is the mixing of the feature channels after they are convoluted. This scenario produces a new time series which captures information from all the channels simultaneously. The receptive field of the kernel in the channel dimension determines which channels should be mixed with each other. For example, if all 32 channels of the input are mixed, the output will be a single channel feature which is the summation of the convolution of all the other feature channels with their unique filter. Alternatively, in another case, if the feature channels related to the class-specific spatial filters are mixed, the output is a channel summing the information of a class. We will call this type of architecture “two-dimensional convolution scenario” (2D-CNN), because of its similarity to most two-dimensional CNN architectures. Another way of performing two-dimensional convolution is by breaking up the two-dimensional convolution into two one-dimensional convolutions in two separate layers. This method of implementing the two-dimensional convolution makes the convolution in time and space independent of each other and increases the flexibility of the network but with the cost of increasing the number of parameters due to the introduction of a new computational layer. It can be viewed as adding a channel mixing to the CWCNN network. Therefore, this architecture will be called “Channel-wise Convolution with Channel Mixing” (C2CM). Figure 4.2 illustrates the convolution in time, convolution in channel and two-dimensional convolution in detail. Among all the models, 2D-CNN has the smallest number of parameters, followed by C2CM. The main contributor to the number of parameters trained in a CNN is the connection of the last convolutional layer to the

67

Fig. 4.2. The three types of convolution possible to be implemented on any feature map (from left to right): convolution in time, convolution in channel and 2D convolution. A mapping from multiple squares to one square is a linear combination of the multiple squares into a single value. The number of squares in the output of each convolution corresponds with the actual effect of convolution. Colour shows an independent time-series. The gray values in the output means the channel’s values are mixed. fully-connected layers. In all CNN architectures, the last convolutional feature maps are vectorized and stacked into one large vector and fed into fully connected layers. The smaller the dimension of the feature map of the last layer is, the lower the number of linear units used will be. In the case of both 2D-CNN and C2CM, because of the reduction of feature map size due to convolution in both dimensions, the number of parameters are significantly lower compared to CW-CNN. A lower number of parameters is desired especially when the number of training data samples are relatively low to avoid overfitting and allow for better training of the network. We will be presenting the results for architectures C2CM and CW-CNN in Section 4.2.2. A sample visualization of the architectures can be seen in Figure 4.3. As an example, considering the input size calculated in Section 3.4.2, Table 4.4 shows the input size at each layer for a sample two-layer CW-CNN architecture.

68

Table 4.4 A sample or the CW-CNN architecture.The table shows the input size at each layer and the number of parameters for a sample two-layer CW-CNN architecture. Layer Type Convolution ReLU Convolution ReLU Linear ReLU Linear LogSoftMax

Patch size / Stride 1 × 7/1 × 3 1 × 3/1 × 3 -

Input Size 32 × 40 32 × 12 32 × 12 32 × 4 4096 512 512 4

69

Hidden Unit 32 32 512 4 4

Parameters 256 0 3104 0 2,097,664 0 2052 0

70 Fig. 4.3. Visualization of Sample Architecture CW-CNN (above) and C2CM (below). After the convolution layer, the feature maps are flattened into a single vector and fed to the fully-connected network. It can be seen that the more the convolutional layers, the lower the dimension of the last convolution operation and therefore, less parameters needed to connect to the fully-connected layers.

4.2.1

Parameter Selection via Cross-Validation

For each of the scenarios in Section 4.2, we need to choose the network parameters: number of layers (convolutional, fully-connected), kernel size, number of hidden nodes, convolution stride, pooling method, regularization methods (batch normalization, dropout), and other network-related parameters are considered as hyper-parameters and can be optimized or selected heuristically. These hyper-parameters, in addition to the different representations for the EEG, brought in Section 3.4.2 must be correctly chosen based on cross-validation. Practically, it is not feasible to search through the parameter space due to time and computation limitations. Instead, we use coordinate descent as a sub-optimal method to perform cross-validation for the network parameters [163]. In coordinate descent, a set of parameters, Θ = [θ1 , θ2 , · · · , θN ], is initialized and then the objective function or score function is optimized for each θi (i = 1, · · · , N ) independently, while updating the values of the initial Θ with the newly optimized parameters. After N optimizations, the Θ vector will be completely updated, and a new iteration of optimization can be initiated. For better results, the algorithm can be repeated for several iterations. We have chosen two values to be selected via cross-validation: size of the kernel and number of convolutional nodes. Table 4.5 shows the values decided for each of the two values plus the other parameters’ initialization values. Multiple values in curly brackets indicate the number of layers used during cross validation. For convolution kernels, the values in the bracket are as follows: kernel width, kernel height, stride in width and stride in height. For example, {{4, 1, 3, 1}, {3, 1, 2, 1}} indicates that the convolution has two layers where the first layer has a kernel size of 1×4 with a stride of 3 and the second layer has a kernel size of 1×3 with a stride of 2. We perform a 10-fold cross-validation only once to select the parameters. The convolutional layer parameters (convParams) are first selected using 71

cross-validation and then the chosen values are used for cross-validation for selecting the number of convolutional nodes (hidNodes). For the C2CM structure which has an additional computational layer, every value is the same as the CW-CNN structure. The channel mixing layer is positioned after the channel-wise convolutions and has the same number of hidden nodes as the previous convolutional layers.

72

Table 4.5 Parameters used for cross-validation of CNN architectures and their values to be selected. Initial value for other parameters

Convolutional kernel sizes (convParams)

MLP Nodes = none, Conv. Nodes = {32, 32}

Number of convolutional nodes (hodNodes)

Chosen values for convParams, MLP Nodes = none

73

Cross Validation Parameter

Cross Validation Values {{4, 1, 3, 1}, {3, 1, 2, 1}}, {{7, 1, 3, 1}, {3, 1, 3, 1}}, {{8, 1, 2, 1}, {5, 1, 3, 1}}, {{10, 1, 3, 1}, {3, 1, 2, 1}}, {{10, 1, 2, 1}, {4, 1, 2, 1}}, {{16, 1, 3, 1}, {3, 1, 2, 1}}, {{20, 1, 2, 1}, {5, 1, 2, 1}} {8, 8}, {16, 16}, {32, 32}, {64, 64}, {128, 128}, {256, 256}

The kernel size and stride combination for the first and second layer are chosen based on the input size and in a way that the output of the convolutional neural network is an integer value. It should be noted that in the conventional CNN architecture used in computer vision, the kernel size is high for the first layer but is reduced in the subsequent layers. To find the rule of the kernel, we give different values for the first layer from a size 4 corresponding to an interval of 400 ms to a size of 20 which corresponds to an interval of 2 s. Furthermore, compared with conventional architectures, we choose not to use any sub-sampling method and instead, rely solely on changing the stride. The training of the networks is performed with the following configurations: • ADAM [164] is used as the optimization method. The parameters are set to default values as in [164]. • Negative log-likelihood is taken as the optimization criterion. • In all layers, we insert a batch normalization [87] layer before the activation layer and a dropout layer after the activation with a probability of 50%.

4.2.2

Classification Results

Baseline Method In our study, we use both Cohen’s kappa and accuracy to evaluate our method. The FBCSP feature extraction algorithm in combination with a linear C-SVM classifier is used as the baseline. Features are extracted from the 0.5 s to 2.5 s after the cue for both train and test datasets. Kappa values are from the original FBCSP paper. We also include the results from a paper of Bashashati et al. [59] which used Bayesian optimization to find the best parameters for FBCSP and has reported the classification accuracy of 74

their method using the BCI competition IV-2a dataset. Regarding kappa, we use the values in [43] which are shown to be the highest among many methods (accuracy was not reported in the paper). The baseline results can be seen in Table 4.12.

EEG Representation & Architecture Comparison To select one of the three EEG representations described in Section 3.4.2, we use a simple architecture with a set of chosen parameters and perform a 10-fold cross-validation over each of the representations. The architecture is similar to that in Table 4.4, with two convolutional layers (32 nodes each) and without the fully-connected layer. The average cross-validation values are obtained by repeating the measurements for 10 networks (10 networks×10 folds). Our assumption is that with a common architecture, the better representation would have a higher cross-validation average. The results can be seen in Table 4.6. The table shows that the CNN architecture selects the R1 representation for all subjects and the cross-validation average accuracy is higher than the other two representations. Although the difference in accuracy is minimal between the representations, the R1 is preferred because it has less computational operations than R2 and R3. For classifier comparison, we perform cross-validation of the given representation on two additional classifiers: linear SVM and MLP (two hidden layers with 32 hidden nodes, 10 networks). The cross-validation results are given in Table 4.7. It can be seen that on average, most subjects have a higher cross-validation accuracy for the CNN architecture; thus CNN is selected.

75

Table 4.6 Cross-validation results for feature representation given the CW-CNN architecture in Table 4.4. Feature Envelope (R1) Instantaneous Energy (R2) Relative Instantaneous Energy (R3) 1

Classifier CNN

Subject 1 85.23 83.38 85.04

Subject 2 69.73 69.42 65.67

Subject 3 90.15 87.38 87.07

Subject 4 65.57 65.27 64.28

Subject 5 77.42 75.43 73.72

Subject 6 52.41 47.71 48.59

Subject 7 93.68 92.38 91.97

Subject 8 90.04 87.45 87.80

Subject 9 84.75 82.40 83.55

Average 78.781 76.76 76.41

Subject 9 84.75 85.83 83.89

Average 78.781 77.33 77.15

the R1 representation is significantly higher than the two other representations with p ≤ 0.01 Table 4.7 Cross-validation results for different classifiers given the R1 feature. Feature

76 Envelope (R1) 1

Classifier CNN MLP SVM

Parameters 19748 42180 5120

Subject 1 85.23 85.20 82.36

Subject 2 69.73 68.16 69.52

Subject 3 90.15 91.04 90.84

Subject 4 65.57 60.86 62.09

Subject 5 77.42 73.68 74.32

Subject 6 52.41 52.34 52.82

the CNN representation is significantly higher than the two other classifiers with p ≤ 0.05

Subject 7 93.68 90.20 91.91

Subject 8 90.04 88.69 86.57

Architecture Parameter Selection With the chosen representation, we conduct cross-validation over the network parameters based on the values in Table 4.5. Parameters are selected by averaging the results of each fold, as well as multiple network initializations. Here, a 10-fold cross-validation is performed on 10 network initializations. The average accuracy of these 100 networks is used for classification. The cross-validation results for all parameter values for all subjects can be seen in Tables 4.8 and 4.9. These results belong to the C2CM architecture. The top portion of the tables shows the cross-validation accuracy, and the bottom part shows the test accuracies with each of the parameters. Ideally, we want the highest cross-validation accuracy to correspond to the maximum test accuracy. In many of the selected values, this is not the case. This difference between the selected and test accuracies can be due to a mismatch between the domains of the data or the initialization. Nevertheless, the selection process can be still performed. Also, as seen in both of the tables, for each subject, the cross-validation accuracy versus parameter trend is not consistent, but there is a peak. This peak in the cross-validation is our parameter selection criteria.

77

Table 4.8 Cross-validation accuracy and corresponding test accuracy results for the convolution kernel in the C2CM architecture. The test accuracy is the result of a 10-model ensemble when the parameters in each row are used. CNN Kernel

78

Subject 1

Subject 2

{{4,1,3,1},{3,1,2,1},{1,32,1,1}} {{7,1,3,1},{3,1,3,1},{1,32,1,1}} {{8,1,2,1},{5,1,3,1},{1,32,1,1}} {{10,1,3,1},{3,1,2,1},{1,32,1,1}} {{10,1,2,1},{4,1,2,1},{1,32,1,1}} {{16,1,3,1},{3,1,2,1},{1,32,1,1}} {{20,1,2,1},{5,1,2,1},{1,32,1,1}}

86.98 87.35 87.50 87.60 88.47 86.96 86.68

66.74 65.92 66.45 65.59 66.01 66.37 66.44

Subject 3 Subject 4 Subject 5 Cross-Validation Accuracies 93.39 62.51 72.68 93.29 63.17 70.50 93.59 63.83 73.00 92.79 64.28 73.87 93.86 64.75 73.12 91.97 62.93 73.90 92.17 64.92 76.34

{{4,1,3,1},{3,1,2,1},{1,32,1,1}} {{7,1,3,1},{3,1,3,1},{1,32,1,1}} {{8,1,2,1},{5,1,3,1},{1,32,1,1}} {{10,1,3,1},{3,1,2,1},{1,32,1,1}} {{10,1,2,1},{4,1,2,1},{1,32,1,1}} {{16,1,3,1},{3,1,2,1},{1,32,1,1}} {{20,1,2,1},{5,1,2,1},{1,32,1,1}}

87.15 87.85 88.89 88.54 86.46 87.15 85.42

65.28 61.11 61.11 62.15 59.72 66.32 62.85

88.89 90.63 90.28 88.89 89.58 87.85 90.63

Test Accuracies 68.75 63.54 67.71 60.07 66.32 61.11 69.10 62.15 67.71 59.72 68.40 63.19 67.01 60.76

Subject 6

Subject 7

Subject 8

Subject 9

Average

49.27 50.23 50.74 50.54 51.86 51.31 53.72

93.71 92.90 92.90 92.75 93.41 93.35 93.64

89.64 87.34 88.87 89.07 89.40 87.75 87.84

82.62 82.60 82.29 83.54 82.89 82.13 82.95

77.50 77.03 77.69 77.78 78.20 77.41 78.30

42.36 46.88 43.40 42.71 47.22 45.49 46.53

88.19 87.15 87.15 87.85 87.50 86.46 88.19

80.90 78.82 79.17 79.17 80.21 80.90 79.86

76.04 80.90 78.13 77.43 79.86 78.13 77.08

73.46 73.46 72.84 73.11 73.11 73.77 73.15

Table 4.9 Cross-validation accuracy and corresponding test accuracy results for convolutional hidden nodes in the C2CM architecture. The test accuracy is the result of a 10-model ensemble When the parameters in each row are used.

79

CNN Hidden Nodes

Subject 1

Subject 2

{8,8} {16,16} {32,32} {64,64} {128,128} {256,256}

87.97 87.08 88.47 89.40 89.58 89.08

63.47 65.41 66.74 66.67 66.73 67.72

Subject 3 Subject 4 Subject 5 Subject 6 Cross-Validation Accuracies 91.65 61.44 73.72 48.71 92.65 62.02 74.26 49.42 93.86 64.92 76.34 53.72 94.09 66.56 77.07 54.44 93.78 66.78 77.08 55.18 94.09 66.82 76.91 54.95

62.15 60.07 65.28 63.19 64.58 63.19

Test Accuracies 63.19 58.68 66.32 60.42 67.01 60.76 66.32 62.50 67.01 62.85 69.79 62.15

{8,8} {16,16} {32,32} {64,64} {128,128} {256,256}

87.85 85.76 86.46 85.76 86.81 88.54

88.19 89.24 89.58 89.58 89.24 90.28

41.67 44.79 46.53 45.83 45.14 46.18

Subject 7

Subject 8

Subject 9

Average

91.11 92.47 93.71 93.86 94.74 94.26

87.62 87.71 89.64 89.84 90.28 90.78

79.37 80.41 83.54 84.14 85.64 86.12

76.12 76.83 78.99 79.56 79.98 80.08

86.11 86.46 88.19 86.46 88.54 89.58

80.90 80.56 80.90 80.56 80.21 80.21

73.26 76.39 77.43 77.43 78.82 79.17

71.33 72.22 73.57 73.07 73.69 74.34

The selected parameters for architectures CW-CNN and C2CM can be seen in Tables 4.10 and 4.11 respectively. Selection is based on the highest cross-validation value. The final test accuracy reported in both tables is obtained by averaging an ensemble of 50 model initializations trained on the training data using the selected parameters. For the CW-CNN architecture, most of the subjects select a larger kernel size with a smaller number of hidden nodes, whereas for the C2CM architecture, most subjects select lower kernel sizes but with higher hidden nodes, with the channel mixing layer being the only difference between the two architectures. This difference means that the channel mixing makes the network wider and thereby increases the number of features in the output of the convolution. In contrast, without the channel mixing, the network gives more emphasis on the receptive field of the network rather than widening the network. With this said, however, we are unable to derive a specific rule for the size of the kernel.

80

Table 4.10 Classification accuracies for CW-CNN using a 50-model ensemble with the selected parameters.

Selected Kernels Size Selected Hidden Nodes

Subject 1

Subject 2

Subject 3

86.11 {{20,1,2,1}, {5,1,2,1}} {64,64}

60.76 {{7,1,3,1}, {3,1,3,1}} {32,32}

86.81 {{20,1,2,1}, {5,1,2,1}} {32,32}

Subject 4

Subject 5 Subject 6 Test Results (50 Ensemble) 67.36 62.50 45.14 {{20,1,2,1}, {{10,1,2,1}, {{20,1,2,1}, {5,1,2,1}} {4,1,2,1}} {5,1,2,1}} {32,32} {32,32} {8,8}

Subject 7

Subject 8

Subject 9

Average

90.63 {{20,1,2,1}, {5,1,2,1}} {8,8}

81.25 {{7,1,3,1}, {3,1,3,1}} {32,32}

77.08 {{20,1,2,1}, {5,1,2,1}} {32,32}

73.07

81 Table 4.11 Classification accuracies for C2CM using a 50-model ensemble with the selected parameters.

Selected Kernels Size Selected Hidden Nodes

Subject 1

Subject 2

Subject 3

87.50 {{10,1,2,1}, {4,1,2,1}, {1,32,1,1}} {256,256}

65.28 {{4,1,3,1}, {3,1,2,1}, {1,32,1,1}} {32,32}

90.28 {{10,1,2,1}, {4,1,2,1}, {1,32,1,1}} {256,256}

Subject 4

Subject 5 Test Results (50 66.67 62.50 {{20,1,2,1}, {{20,1,2,1}, {5,1,2,1}, {5,1,2,1}, {1,32,1,1}} {1,32,1,1}} {256,256} {128,128}

Subject 6 Ensemble) 45.49 {{20,1,2,1}, {5,1,2,1}, {1,32,1,1}} {32,32}

Subject 7

Subject 8

Subject 9

Average

89.58 {{4,1,3,1}, {3,1,2,1}, {1,32,1,1}} {256,256}

83.33 {{4,1,3,1}, {3,1,2,1}, {1,32,1,1}} {8,8}

79.51 {{10,1,3,1}, {3,1,2,1}, {1,32,1,1}} {256,256}

74.46

Table 4.12 shows the results for multiple classification methods including our proposed method using C2CM. The values in parenthesis are Cohen’s kappa. As shown in the table, our method has superior performance in both kappa and accuracy. The first “FBCSP” uses the SVM classifier on the FBCSP features and is reported from [32]. The second “FBCSP” is based on the FBCSP algorithm results in [59] and “BO” is the Bayesian optimization method proposed in the same paper. “SVM” means the use of SVM on the R1 envelope representation features proposed in Section 3.4.2. Based on the Wilcoxon signed-rank test, the increase in the mean accuracy for C2CM is significantly higher with a bound of p < 0.05 relative to BO, SVM, and CW-CNN. Regarding average kappa, our method is not significantly higher than the method in [43] based on the Wilcoxon test due to the ranking of the differences between the kappa values of subjects 4 and 6. However, overall, there is an increase in kappa for the other subjects. For other methods, the increase in mean kappa is significant with a bound of p < 0.05.

82

Table 4.12 Table of accuracy and kappa for baseline methods and our methods.

83

Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Average 1

FBCSP FBCSP [59] 79.51 (0.676) 76.00 51.04 (0.417) 56.5 81.25 (0.745) 81.25 65.63 (0.481) 61 54.17 (0.398) 55 38.54 (0.273) 45.25 85.07 (0.773) 82.75 79.86 (0.755) 81.25 73.96 (0.606) 70.75 67.67 (0.569) 67.75

BO [59] 82.12 44.86 86.60 66.28 48.72 53.30 72.64 82.33 76.35 68.13

TSSM+SVM [43] (0.77) (0.33) (0.77) (0.51) (0.35) (0.36) (0.71) (0.72) (0.83) (0.593)

SVM 82.29 (0.764) 60.42 (0.472) 82.99 (0.773) 72.57 (0.634) 60.07 (0.468) 44.10 (0.255) 86.11 (0.815) 77.08 (0.694) 75.00 (0.667) 71.18 (0.616)

CW-CNN 86.11 (0.815) 60.76 (0.477) 86.81 (0.824) 67.36 (0.565) 62.50 (0.500) 45.14 (0.269) 90.63 (0.875) 81.25 (0.750) 77.08 (0.694) 73.07 (0.641)

Detail of the significance of this method relative to other methods can be seen in the text (page 82).

C2CM 87.50 (0.833) 65.28 (0.537) 90.28 (0.870) 66.67 (0.556) 62.50 (0.500) 45.49 (0.273) 89.58 (0.861) 83.33 (0.778) 79.51 (0.727) 74.461(0.659)

4.3

Analysis of the Trained Network

To further interpret the learned network weights, and verify whether the result obtained by the trained network is by random or the network learns important patterns in the data, we conduct some experiments.

4.3.1

Importance of the Learned Kernel Morphology

As shown in the results of Section 4.2.2, each subject chooses a specific kernel length and a certain number of hidden nodes for the first two layers. Note that the selected parameter values for each subject are based on crossvalidation and ensembles. Therefore, the kernel sizes are dependent on each fold of the data used in cross-validation and the initializations of the network for each model in the ensemble. This complication, in turn, makes it difficult to determine whether or not the kernel sizes have a meaning from a neuroscientific perspective. Because we are handling brain signals, it would be ideal if we could map the parameters learned from the network to neurological phenomena related to the brain. In this study, we have not conducted any analysis on this topic, but it can potentially be a future research direction. Nevertheless, in this section, we seek to verify whether the convolutional kernels learned in the first two layers truly matter, or the shapes are random and do not contribute to the classification performance. We modify the convolutional kernels such that each convolutional kernel is replaced by its mean value. We adopt the kernel mean because we do not want to abrupt the scale of the network by changing the values of the kernel to non-related values. After modifying the values and calculating the ensemble accuracy based on the new kernel values, we perform statistical analysis on the accuracy of the following hypothesis: The mean accuracy obtained by the modified network is lower than the original accuracy. It

84

is expected that the classification results will drop, but our hypothesis emphasizes whether this decrease is significant or not. Table 4.13 Analysis of the effect of modifying the kernels of the first and second layer of the trained network. L1 & L2 refer to the Kernels of Layer 1 and 2. Each column shows the test accuracy when the kernel of those layers(s) is modified to be the mean value of the kernel. Subject Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Average 1

Trained CNN 87.85 65.28 89.58 67.71 63.19 45.49 89.58 81.94 79.5139 74.46

Mod. (L1) 84.38 61.81 84.72 61.11 53.47 48.26 90.97 82.99 61.11 69.871

Mod. (L2) 84.72 61.46 86.46 63.54 57.99 43.75 90.63 80.56 60.76 69.981

Mod. (L1+L2) 83.68 57.99 84.03 59.38 53.47 48.26 91.67 81.60 59.38 68.831

All values are significantly lower than the baseline with p ≤ 0.05. While performing this analysis, we first modify the kernel values of each

of the layers independently and eventually change the values for both layers simultaneously. Table 4.13 shows the accuracy values for each subject after the network modification from an ensemble of 50 networks, the same 50 networks used to derive the test accuracy. The W-score calculated between the original network and the change of the first layer parameters is 6 and based on the hypothesis, the maximum value at which the hypothesis is significant with a p-value of 0.05 is 8, showing that the network with the first layer modified performs worse than the original network. Modification of both layers simultaneously shows an even larger decrease relative to the original network, demonstrating that the morphology captured by the kernels of the first two layers is important.

85

4.3.2

Qualitative Analysis of Kernel Shapes

After training the network, we also want to gain an understanding of what the network has learned and if possible, obtain a visualization of the learned kernels. In image processing, methods such as deconvolution [165], backpropagation-based visualization [166, 167], and layer-wise relevance propagation (LRP) [168] have been used to interpret trained networks with LRP recently being used for EEG analysis as well [169]. Here we choose the back-propagation methods proposed in [166]. In this group of methods, an initial image is fed to the network and based on the desired activation of any specific node in the network, the initial image is recursively changed to match the activation at that node. This recursive algorithm can be accompanied by a smoothing function to remove highfrequency noise during the recursion. From a mathematical point of view, the following optimization problem is solved:

x∗ = arg max(ai (x) − Rθ (x))

(4.2)

x

In this equation, x∗ is the desired input that will lead to the desired activation feature map ai (x). The term Rθ (x) is a regularization term which is imposed on the desired input. In order to solve the problem in Equation 4.2, the following recursive solution based on gradient decent is given:

x ← rθ (x + η

∂ai ) ∂x

(4.3)

Here, η can be considered as learning rate and rθ as a regularization function, where θ is the parameters of the function. In the paper [166], four different methods for regularization is discussed. As an example, to penalize the high frequency information which may rise by using pure gradient descent, a Gaussian blur function can be used. Now that we have derived the recursive solution, to use Equation 4.3, 86

we must first determine the following: the initial point for optimization (x0 ), the activation feature map (ai ), and the regularization function (rθ ). In our case, we initialize the input with the average of all training examples and then, recursively change the input in a way that its classification label vector (i.e. network output layer) corresponds to class C. The result of the recursion is an input that when fed to the network, results in class C. In other words, this perceived input is what the network recognizes as class C. We use the Gaussian blur as our regularization function. Figures 4.4, 4.5, 4.6, and 4.7 show a sample of the average signals for all four classes of subject 9 (blue solid) and the perceived signals for the classes for a number of channels that have high significant correlation (the correlation, r, and p-values, p, have been provided). This reconstruction is made based on the network that has the highest classification accuracy on the test data. By visually inspecting the graphs, in most cases, the perceived input signal follows the average signal. It should be noted that the algorithm converges and the perceived input correlates with each of the classes. Figure 4.8 shows the high positive correlation coefficient values (r > 0.5) with a p-value of lower than 0.05 for all feature channels of subject 9 in each of the 4 classes. The dashed vertical lines correspond to each of the one-vs-rest classes in the CSP algorithm. The figure illustrates high significant correlation only in certain channels for each class, implying that the perceived signals and average signals are only similar in these channels for these classes. Figure 4.9 shows the signals sorted based on the p-value for the original average channel (left) and perceived input (right) based on the p-value arranged vertically from class 1 to class 4, validating the similarity between the original data and the perceived input in some, but not all, channels. Further analysis must be conducted to verify whether the similarity between the perceived and average inputs in specific feature

87

Fig. 4.4. Sample signal visualization for network perceived input for significantly correlated channels from subject 9 and class 1. The blue (solid) signal is the average signal and the red (dash) signal is the perceived signal from the network. Only signals with significant high positive correlation (p < 0.05, r > 0.5) have been shown. The correlation and p-value have been written in the image. The x-axis represents time and y-axis is amplitude. The y-axis has been adjusted for each subplot for better visualization.

Fig. 4.5. Sample signal visualization for network perceived input for significantly correlated channels from subject 9 and class 2 (Description similar to Figure 4.4). channels imply the importance of the channels or not.

88

Fig. 4.6. Sample signal visualization for network perceived input for significantly correlated channels from subject 9 and class 3 (Description similar to Figure 4.4).

Fig. 4.7. Sample signal visualization for network perceived input for significantly correlated channels from subject 9 and class 4 (Description similar to Figure 4.4).

89

Fig. 4.8. Correlation between perceived signal and average signal in class 1 for subject 9. The vertical dashed lines show the channels belonging to the one-vs-rest CSP channels for each class. Only high positive correlations (r > 0.5) and significant correlations (p < 0.05) have been shown.

Fig. 4.9. Sample visualization for average class signals and network perceived input for all channels sorted based on p-value for subject 9. Left column: average input for each class. Middle column: perceived input for each class. Right column: correlation p-value between average and perceived classes.

90

4.4

Discussion & Conclusion

The results in Section 4.2.2 have identified two important facts regarding the application of deep learning in EEG. First, the representation of the signal fed into the deep learning framework is important: previous methods based solely on energy values neglect valuable temporal information. When considering the new representation, the way the information is processed must be updated as well because new representations need new processing methods. This demand for new processing and classification tools leads to the second important fact: deep learning methods can be used in the context of EEG signal classification and can yield superior results relative to other methods such as SVM and MLP. Our analysis shows that our results are not from random matrix multiplications, and the network is indeed learning something from the input EEG data, proving that the network’s architecture is meaningful for EEG. Visualizing the architecture further verifies that the network has learned important relationships from the data and can construct a perceived input which is similar to the original data. To date, the classification accuracy results produced in this Chapter are the state-of-the-art results for multi-class classification on the BCI competition IV-2a dataset, the baseline dataset for multi-class MI-BCI.

4.5

Guideline

Based on the contents of the chapter, we propose the following guideline for the classification of motor imagery BCI data: 1. Apply spatial and spectral filtering in conjunction with signal processing algorithms to the EEG data in order to enhance the EEG representation and lower the initial dimensional of the signal. In this chapter, we have utilized the FBCSP algorithm which incorporates both spectral and spatial filtering and using envelope extraction and 91

down-sampling to reduce the dimensionality of the signal. Other methods for spectral and spatial filtering can also be used. 2. Design the architecture based on the input . In this study, although we’ve chosen to experiment with two types of network architectures, CW-CNN and C2CM, we recommend using the C2CM architecture because of the computation on both dimensions of the input. Furthermore, the output of the convolutional layers (before the fully connected layers), is lower in dimension and will lead to less connections in the fully-connected layer. 3. Optimize the parameters using cross-validation. Almost all parameters can be optimized via cross-validation, but it is crucial to identify which values have more affect on classification output and also, will lead to a better interpretation of the network. We did not excessively search through all of the parameters to find the best ones, but it can be a good direction for future work. 4. Use Ensembles to boost performance. Each network parameter can be initialized with different values and this results in unlimited number of configurations for the initial value of the network. Using ensembles helps by averaging inference over many networks and boosting the performance.

92

Chapter 5 Deep Transfer Learning for Subject-to-Subject Transfer In the previous chapter, we mainly focused on the session-to-session variability problem by transferring knowledge from one session to a new unseen session. In this chapter, we have mainly focused on solving the subject-tosubject transferability problem using deep learning methodology. This chapter’s contents is related to the results published in our paper [170]. The extended content is being prepared for a journal publication.

5.1

Introduction to Transfer Learning

Transfer Learning is a branch of machine learning that focuses on transferring knowledge learned in one domain on a specific task(s) to another domain with another task. Mathematically speaking, assuming we have a source domain (DS ), a task in this domain (TS ), a target domain (DT ), and a task in the target domain (TT ), transfer learning improves the decision or regression problem in the target domain (fTT ()), by using the information from the source domain [171]. This definition of transfer learning is abstract and depending on the relationship between the source and target domain, annotation availability in the two domains, and the relationship 93

between the tasks can be categorized into different groups [171]. In traditional machine learning algorithms, the source and target domain tasks are usually the same, and if a classifier is well trained and generalized on the source domain data, the classification accuracy is high when transferring the classifier to the test domain. Inductive Transfer Learning refers to the case where the source and target domain are the same but the tasks are different but related. The emphasis of being related must be made because some tasks are fundamentally distinct from another. A typical example of inductive learning would be a deep neural network which has been trained for image classification and wants to be transferred to another task such as object detection or image segmentation. Transductive Transfer Learning is the opposite of its inductive counterpart and focuses on transferring the knowledge from the source domain and task to the target domain in which the task is the same but the data is different. In this case, it is expected that some data samples from the target domain exist so that the classifier can be fine-tuned on the new data. Lastly, Unsupervised Transfer Learning is a case where the target and source domain are different but related and no labels are provided for both domains. A chart of different categories and methods can be seen in Figure 5.1. Almost all methods of transfer learning fall into the three categories of transductive, inductive or unsupervised. Other subcategories in transfer learning are defined based on the availability of annotation and the number of data samples available, and can be considered as subcategories of the three main methods. Zero-shot learning is a subcategory of transductive in which there are no labels in the target dataset, and only some information regarding the relationship of the source and target domain is available. One-shot learning is a case where the only annotation of one sample from the target domain is known, and there is no extra information [172]. Domain adaptation and covariate shift, are two of the more popular

94

Fig. 5.1. The different categories of transfer learning based on the survey in [171] methods for transfer learning, with both being transductive [171]. In domain adaptation, the assumption is that the probability distribution of the source and target data are different but similar. In this case, data from the source domain is adapted to the target domain (which may or may not be labeled). In covariate shift, similar to domain adaptation, the distribution of the data between domains is different, however, the class conditional probability is similar, P S (Y |X) ∼ P T (Y |X). For solving this situation, parameters of the models are tweaked to fit the distributions. In conclusion, transfer learning plays a major role in solving problems that may not be able to be solved conventionally due to existing constraints or limitations. In some situations, the amount of data in the target domain is not enough, and therefore, if the data of the target domain were to be used for training a model, it could be prone to overfitting and undergeneralization. In some cases, the number of annotations in the target data are not enough or not present. In other applications, the main reason for using transfer learning is to speed up the training process because obtaining new labeled or unlabeled data is costly. Because of all of these reasons, transfer learning is a promising direction to solving many real-world prob-

95

lems.

5.1.1

Transfer Learning in BCI Systems

Many of the categories of transfer learning have also been used in BCI research. Covariate shift adaptation [173] has been utilized in a Linear Discriminant Analysis (LDA) model to adapt the bias of the target domain in session-to-session transfer. Data space adaptation [57, 46] has been used to transfer the data distribution of the target domain (new session) to the source domain (older session or other subjects). Spatial filter adaptation [174] has been proposed to adapt the spatial filters from the CSP algorithm to the target domain’s data, and in [175] the adaptation is made by using the tensor decomposition model. In many of the above algorithms, a small labeled subset of the target domain is available. In [49], both transductive and adaptive learning approaches have been used for changing or estimating the bias of the classifier model. For an in-depth survey, we refer to Lotte [176] who has referenced many of the transfer learning methods used for reducing calibration time. In most cases stated above, the target domain is a new session recorded from the same subject and usually, small amount of labeled data is available. However, transfer learning for subject-to-subject must take another approach. Challenges in solving the subject-to-subject problem in BCI systems are mainly contributed to anatomical differences between the subjects [44] or statistical variations in the data [177]. As discussed in a survey by Jayaram et. al [177], there are mainly two approaches to transfer learning, and more specifically, subject-to subject transfer in BCI systems: domain adaptation (DA) or rule adaptation (RA). In DA, the solution proposed for transferring knowledge between subjects is bringing their data into a common space. These algorithms have mainly focused on finding spatial filters that are common among subjects [178, 179, 180, 46] and have been 96

the go-to method for transfer learning. In RA, the classifier is changed based on a distribution of classifiers between the subjects [181]. Reiterating some examples from the literature review in Chapter 1, sensor space transfer [44] has been proposed which focuses on transferring the data space of multiple subjects from the electrodes on the scalp to the sources in the brain by estimating their relationship. This approach can be viewed as a DA solution. Composite CSP [45] considers a weighted sum of covariance matrices estimated from multiple subjects as a way of transferring knowledge to a new subject. This method can also be viewed as DA solution. The number of papers focusing on the RA approach are quite limited.

5.2

Motivation & Proposed Method

We believed by using a deep neural network model trained on data from multiple subjects and sessions, common information can be learned from all subjects, and this common knowledge can be transferred to new subjects without the need to re-train the network entirely (fine-tuning). The existence of such a network would lead to lowering the time required to train large networks by only focusing on training parts of the network and therefore, immediately using the network for fast-deployment or if needed, online adaptation. Also, by having a trained network, the number of data samples used to fine-tune the network can also be reduced and thus, reducing the time to collect recordings from subjects. To our knowledge, this is one of the first attempts to use neural network-based method for transfer learning in BCI. We have taken an RA-based approach to the problem of classifying motor imagery EEG signals: rather than bringing the data to a common space, we train a neural network model that captures information from multiple

97

subjects and stores the information as the parameters of the network. This network can be trained using a generative (auto-encoder) and discriminative (classifier) process. A depiction of a sample transfer learning pipeline can be seen in Figure 5.2.

Fig. 5.2. The sample transfer learning pipeline. In step 1, a generative and discriminative model is first trained on a pool of subjects and the values of the encoder is then transferred. A classifier is then trained on the outputs of the encoder in step 2. In this study, we have tackled the problem of ”classifying unknown trials” using different scenarios of training data available in the target domain and use transfer learning algorithms to improve the results. These scenarios are: • Only a small sample of labeled training data in hand. (Small Sample) • A large amount of labeled data in the target domain. (Full Data) A depiction of the different scenarios can be seen in Figure 5.3. One other scenario that can be seen in Figure 5.3 is semi-supervised learning. Although this scenario of data availability is highly common in image research, in BCI, it is hardly the case that we have unlabeled data that has 98

been recorded and corresponds to a task. Therefore, we will not be focusing on this topic in this study. In general, in this study, transferring is done by using data from other subjects to train a generalized network architecture and transferring some of the trained network’s parameters to be utilized for the new subject’s data. How we prepare the data for training is described in Section 5.3. Auto-encoder has been chosen as the main network to be trained on the multiple subjects. This auto-encoder has also been accompanied by a classifier to be both generative and discriminative. Details regarding the network and training can be seen in Section 5.4.

Fig. 5.3. Different transfer learning scenarios. In the first (left), we assume only a small amount of labeled data is available in the target domain. In the second (middle), in addition to labeled data, unlabeled information is also provided. In the third scenario, a large amount of data is available in the target domain.

5.3

Data Preparation

To avoid confusion, we define “the subject pool” as the union of the data from all subjects excluding the subject in the target domain. The “subject pool” in the context of transfer learning is the source domain. “New subject” refers to the subject target domain in which information from the subject pool will be transferred. We use the BCI competition IV-2a data as our dataset. In the original competition, the data is divided to train and test data for all subjects. These two datasets are recordings from the same subject but in different sessions. In the current study, when training the network on the subject pool, instead of dividing the data into train and 99

test (evaluation) sessions, we do the following: • Assume the two sessions are independent sessions recorded from a single subject (S1 & S2). • Extract the novel representation proposed in Chapter 3 for each of the two sessions and, also, from the union of the two sessions together. This augmentation of the data will result in a large pool of data for each subject. Because of the nature of the FBCSP algorithm, the estimated spatial filters for the individual sessions and their union will be different and will result in a variety of signals which are similar in labels but different in content. We hope with this augmentation of the data, a more generalized network will be trained. • When transferring the trained network for the new subject, the network can be trained on all subjects in the current subject pool or a subset of it (e.g. subjects that have high performance). To evaluate the transfer learning paradigm for the scenarios which use a subset of the data, we randomly sample the subsets 10 times. We have chosen the number of data samples per subset to be 5, 10, 20, and 40 data samples per class. Results will be provided for all sample sizes.

5.4

Network Architecture for Deep Transfer Learning

For designing the network, we have considered several design aspects: 1. Unlike the network in Chapter 4, we have decided to select values for the kernel width and the depth, so there is not much room for optimizing their sizes. For this reason, we have chosen the architecture in Table 5.1. As can be seen in the table, the kernel sizes 100

Table 5.1 The parameters of the architecture used for Transfer Learning in this chapter. Layer Type Convolution ReLU Convolution ReLU Convolution ReLU Convolution Linear

Patch size / Stride 1 × 4/1 × 2 1 × 3/1 × 2 1 × 3/1 × 2 32 × 1/1 × 2 -

Input Size 32 × 40 32 × 19 32 × 19 32 × 9 32 × 9 32 × 4 32 × 4 512

Hidden Unit 32 32 32 128 4

Parameters 160 0 3104 0 3104 0 131104 2052

are smaller, and the network is deeper relative to the architectures trained in Chapter 4. 2. We have removed batch normalization entirely from the network architecture. Batch normalization saves information regarding the average and variance of activations in each layer. When there is a mismatch between the source and target domain, the values of mean and variance can lead to unwanted shift in the data and will lead to a poor representation of the data after the convolution stages. For training the auto-encoder network, as mentioned, we have decided to train in a generative-discriminative manner, which means that a convolutional auto-encoder and a classification model are trained simultaneously. If the encoder network is designated as fenc , the decoder as fdec , and the classifier as fclass for an input of X and label y we have the following loss function:

Ltotal = (1 − α)LGen {fdec (fenc (X), X)} + αLDis {fclass (fenc (X)), y} (5.1)

In this equation, α is the parameter that specifies the weight given to the generative or discriminative loss. In our experiments, it is set to 0.5. Furthermore, the classifier function fclass is the last linear layer seen in Table 5.1 and the rest of the table is the encoder function. LGen has been selected to be the MSE criterion and LDis is the cross-entropy criterion. 101

Based on the values in Table 5.1, the network can be broken up into two main sub-network: the temporal sub-network and channel sub-network. The temporal sub-network which consists of three layers can be viewed as a nonlinear filtering network which acts on the input regardless of the orientation of the channels. In the context of learning from the subject pool, it can be viewed as capturing all possible signal morphologies from the subjects in the subject pool. As for the channel sub-network, which is a single layer mixing all the channels, the aim is to learn different mixtures of the channels which, first, can reconstruct the input properly and second, can capture different discriminative mixtures from the subjects in the pool. An overall number of 5 auto-encoder networks with different initialization are trained over 9 different subject pools (leave-one-subject-out). Each of the networks is transferred separately and fine-tuned/re-trained on the new subject’s data. During inference, the ensemble of these networks are used, and the predicted outcome is given based on the median of the prediction of each class over all 5 networks. In addition to the results of training the model on a subject pool of 9 subjects, we also provide the results of transferring from a single subject pool to a new subject. By looking at the accuracy of the training set of the new subject without fine-tuning, we hope to identify which subject’s data pool is similar to the new subject.

5.5

Results

In this section, we will present the results for the different scenarios mentioned in Section 5.2.

102

5.5.1

Subject Pool to Small Sample Transfer

In this scenario, after training the auto-encoder, fenc and fdec , on the sub∗ ject pool, the encoder is frozen and a new classifier function, fclass , is re-

trained on the data in the target domain without any generative loss function (purely discriminative). To compare the transfer learning pipeline, we choose the baselines to be an SVM classifier which uses the small data samples to train, and a randomly initialized network of which only the last layer is trained on the data samples (CNN-R). We present the average classification accuracy as the results over 10 random data samples selected from the original train data. The results can be seen in Figure 5.4 and Table 5.2. In the table and figure, CNN-R is a network which is randomly initialized, but only the last linear layer is trained. We have chosen to compare this network with our proposed network to show whether the training of the network matters or not. CNN-T is using the proposed transfer learning algorithm.

Fig. 5.4. Results of the average classification accuracy for different classifiers and sample sizes for the proposed transfer learning pipeline. SVM uses the FBCSP features for classification without any transfer. CNN-R randomly initializes a network and only trains the last layer. CNN-T transfers the network from a pool of subjects and only re-trains the last layer.

103

Table 5.2 Average classification accuracy results for all subjects using 3 classifiers and 4 different sample sizes. The results are the average over 10 random samples selected from the original train data. SVM is classification using a linear support vector machine classifier. CNN-R is a random initialization of the network with training done only on the last linear layer. CNN-T uses transfer learning as a way of increasing the performance of the classification.

104

Sample Size Classifier Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Average 1 2

SVM 42.67 28.58 54.41 33.82 30.76 28.02 43.68 42.26 55.83 40.00

5 Samples CNN-R CNN-T 51.94 55.94 34.30 34.83 56.53 59.34 37.88 38.06 29.76 30.90 30.59 31.63 44.62 49.41 44.48 50.14 45.97 49.93 41.79 44.461

SVM 54.24 33.19 67.15 40.07 35.76 31.11 55.00 58.54 61.42 48.50

10 Samples CNN-R CNN-T 61.94 65.38 39.62 40.14 65.42 68.58 40.94 43.06 33.47 34.97 31.01 33.19 51.39 61.01 52.26 58.47 58.51 57.81 48.28 51.402

SVM 64.20 34.83 73.23 43.68 40.45 33.92 69.76 69.13 62.67 54.65

The value is more significant than the CNN-R and SVM counterpart with p ≤ 0/05 The value is only more significant than the CNN-R counterpart with p ≤ 0/05

20 Samples CNN-R CNN-T 71.67 75.80 46.53 49.44 71.98 80.14 47.67 49.93 40.07 45.56 36.04 36.60 60.10 73.23 65.62 72.47 65.97 66.88 56.18 61.111

SVM 75.31 42.57 79.31 55.66 47.36 38.82 78.89 76.94 67.15 62.45

40 Samples CNN-R CNN-T 72.85 80.56 53.30 57.05 72.33 82.85 54.41 56.18 43.71 53.89 41.11 42.85 68.72 81.49 70.45 77.99 72.05 75.42 60.99 67.581

As can be seen in Table 5.2, there is a significant difference between the SVM method and all CNN methods. The difference between the SVM and CNN methods in the small sample sizes reflect the superiority of the representation and network architectures in benefiting the classification. By comparing CNN-R and the other two CNN architectures, we see that a random convolutional feature extraction, although containing information that can be used for classification, still needs to be trained. This result also confirms that the CNN-T network is learning valuable information from the inputs that is not learned by only relying on a random network.

5.5.2

Subject Pool to Large Sample Transfer

In this scenario, similar to the previous one, we train a discriminative convolutional auto-encoder on the subject pool and then transfer the learned parameters to the new subject’s data. The main difference here is the new subject sample size. Furthermore, we have also chosen to compare our proposed pipeline with SVM. To compare with the results reported in [32], we have also reported the results using Cohen’s kappa.The results can be seen in Table 5.3. The results show a significant increase in kappa between the SVM and CNN methods. These results are interesting when compared to the ones reported in Section 4.2.2: using a more generalized and deeper network, without crossvalidation for choosing the parameters, and a smaller ensemble size we have managed to reach similar results. These similar results show that using a convolutional auto-encoder, pre-trained on data from a pool of subjects, helps to improve the overall performance of the network.

5.5.3

Single Subject Pool to Subject Transfer

The results in this section are different from the previous two sections because instead of using multiple subjects, we only use one subject in the 105

Table 5.3 Classification results for transferring multi-subject transfer with full use of the new subject’s data. SVM uses the FBCSP features for classification without any transfer. CNN-R randomly initializes a network and only trains he last layer. CNN-T transfers the network from a pool of subjects and only re-trains the last layer. Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Average

SVM 0.676 0.417 0.745 0.481 0.398 0.273 0.773 0.755 0.606 0.569

CNN-R 0.699 (77.43) 0.482 (61.11) 0.620 (71.53) 0.454 (59.03) 0.357 (51.74) 0.218 (41.32) 0.546 (65.97) 0.667 (75.00) 0.690 (76.74) 0.526 (64.43)

CNN-T 0.787 (84.03) 0.509 (63.19) 0.768 (82.64) 0.528 (64.58) 0.514 (63.54) 0.292 (46.88) 0.782 (83.68) 0.759 (81.94) 0.727 (79.51) 0.63 (72.22)1

1

The value is significantly higher than other average values with p ≤ 0.01. pool but at the same time, use the data preparation method of combining the sessions to generate more data. There are two results presented in this section. First, the classification accuracy of the train data from the new subject using a network only trained on the single subject pool without fine-tuning (Table 5.4). Second, the classification accuracy of the test data of the new subject when the trained model has been fine-tuned on the new subject’s train data only in the last layer (Table 5.5). The reason we bring the results in Table 5.4 is to see whether having good results on the train data without fine-tuning can correlate with good results on test data with fine-tuning. The results of this table are produced on the full training dataset. For clarification, the rows of the tables refer to the models that have been transferred, and the columns are the results on the new subject. Both tables show that in almost all subjects, a group of single subject pools perform well when using their models to transfer to the new subject. As for correlation, considering a tolerance of 2% around the maximum

106

Table 5.4 Accuracy results on the new subject’s train data without re-training the model trained on the single subject pool. The columns are the new subjects and the rows are the single subject pool. S1 S1 S2 S3 S4 S5 S6 S7 S8 S9

64.10 65.57 74.36 80.59 77.29 74.36 81.32 78.39

S2 37.78 60.00 70.00 68.89 78.52 63.33 65.93 58.15

S3 79.26 70.00 82.96 79.63 84.07 85.19 84.81 85.19

S4 45.80 62.21 63.36 69.47 71.37 73.28 59.16 65.27

S5 48.86 75.19 68.32 76.72 79.77 77.86 76.34 63.74

S6 44.75 59.82 56.16 68.49 66.21 63.93 60.27 57.53

S7 59.41 77.49 90.77 89.30 85.61 88.93 88.19 83.39

S8 67.42 75.38 76.14 76.89 83.71 87.12 88.26

S9 51.48 65.82 73.00 71.73 73.00 74.68 77.64 67.09

85.23

Table 5.5 Accuracy results on the new subject’s test data with re-training the last layer of the model trained on the single subject pool. The columns are the new subjects and the rows are the single subject pool. S1 S1 S2 S3 S4 S5 S6 S7 S8 S9

75.69 84.72 84.03 84.03 82.99 80.56 82.29 84.72

S2 53.47 58.68 59.72 58.33 60.76 59.03 56.94 59.38

S3 72.22 75.69 79.17 77.08 75.69 78.13 73.61 77.43

S4 57.29 64.24 57.99 62.15 63.19 64.93 59.03 57.29

S5 63.54 59.38 60.42 63.19 60.76 60.42 59.38 61.81

S6 46.18 43.40 46.18 46.88 46.18 45.14 45.14 46.53

S7 75.00 79.51 83.33 80.56 81.25 83.68 81.94 84.03

S8 81.94 73.61 78.82 81.25 80.90 83.33 79.51

S9 72.22 71.88 77.43 72.22 78.82 70.49 77.08 77.43

80.56

value in Table 5.4, five subjects have a correlation between the two tables, showing that it is possible to identify the best single subject pool for a new subject. This information can later be used to build better ensemble models or select a pre-trained model which is the combination of the best subjects.

5.6

Conclusion & Guideline

Using a pool of subjects, we have trained a deep convolutional auto-encoder which can be transferred to new subjects and boost classification performance when fine-tuned. The convolutional kernel sizes have been selected

107

so that there is no need for cross-validation or heuristic value selection. We have considered different scenarios of data availability for the new subject and have seen an improvement in performance in all of the scenarios relative to the baseline SVM method. Furthermore, from a data size point of view, when the number of data samples are too low, the network will surely overfit and therefore, using a model that has been trained on more data samples is better. Based on the finding of this chapter and by comparing the results with the previous chapter, we propose the following guideline for designing and training a network architecture suitable for transferring to new subjects: 1. Perform the pre-processing stages on the input similar to Chapter 4. Even when we have a pool of subjects that have distinct characteristics in each of their data, we speculate that after pre-processing (spectral filtering, spatial filtering, signal processing), the data is, to some extent, in a common space. Especially if the processing is subject-dependent. This can be interesting direction for future research. 2. Design an generalized network to learn from multiple subjects. As stated in the chapter, the network can be in a way that has minimum room for the parameters to change and also, must not contain layers that saves information specific to the pool of subject. For example, batch normalization saves the characteristics of the activation layers and therefore, when transferring the network to the new subject, the mismatch between the source and target domain is even more increased with this layer. 3. Training the network in an discriminative and generative fashion. An auto-encoder with an auxiliary discriminative loss function is one way to perform this task. 108

Chapter 6 Contributions & Future Work 6.1

Contributions

In this thesis, we have focused on solutions for challenges presented in MIEEG classification for BCI systems. We have contributed the following methodologies: • We have proposed a new representation the EEG data using a combination of FBCSP and envelope extraction. To our knowledge, up until the time of the writing of this thesis, this novel method has not been used by any research group. Furthermore, this new representation, combined with a CNN, has shown to have superior results relative to when common energy features are used for classification. • Two novel architectures have been proposed which have utilized the envelope representations and yielded superior results relative to previous attempts to classify MI-EEG signals. These architectures were optimized using cross-validation to find the best combination of parameters. The visualization of the networks has also confirmed that the network has learned important structures regarding the data and can be key in understanding the nature of the data. • Using DL methods, we have also reduced the time and data needed 109

to classify new data from unseen subjects by transferring networks trained on previously recorded data. This result shows the potential of the network to, first, learn from other subjects and second, use a small number of samples to estimate the FBCSP and still get significantly higher results relative to when a simple classifier such as SVM is trained on the small number of samples.

6.2

Publications

From the above contributions, we have published the following papers: 1. Yang, Huijuan, Siavash Sakhavi, Kai Keng Ang, and Cuntai Guan. ”On the use of convolutional neural networks and augmented CSP features for multi-class motor imagery of EEG signals classification.” In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2. Sakhavi, Siavash, Cuntai Guan, and Shuicheng Yan.“Parallel convolutionallinear neural network for motor imagery classification.” In 2015 23rd European Signal Processing Conference (EUSIPCO). 3. Jie, Zequn, Wen Feng Lu, Siavash Sakhavi, Yunchao Wei, Eng Hock Francis Tay, and Shuicheng Yan.“Object proposal generation with fully convolutional networks.” In IEEE Transactions on Circuits and Systems for Video Technology (2016). 4. Sakhavi, Siavash, Cuntai Guan, Shuicheng Yan. “Learning temporal information for brain-computer interface using convolutional neural networks.” In IEEE Transactions on Neural Networks and Learning Systems (under third revision). 5. Sakhavi, Siavash, Cuntai Guan. “Convolutional Neural Networkbased Transfer Learning and Knowledge Distillation using Multi110

Subject Data in Motor Imagery BCI.” In 2017 8th International IEEE EMBS Conference on Neural Engineering.

6.3

Future Work

In this section, we will be proposing some future research directions that can be considered in future designs of architectures and learning methods for MI-EEG or more generally, all types of EEG classification and analysis.

6.3.1

Neural Network Integration of the Preprocessing

In our view, one of the ultimate goals for MI-EEG classification is designing an end-to-end system that can learn spatial filters, identify significant frequency bands, select the most prominent features and classify them all in one forward pass. The significance of this architecture and system would be reducing time and resources in training the network. The following are some of the caveats of the proposed method in this thesis and possible solutions: • A time-consuming step in the FBCSP method is the filter-bank output calculation. This step is usually done separately for each filterbank, but in nature, the filter-bank is parallel and can be potentially implemented using a CNN which will speed up the system significantly especially when we are in need for fast online processing of the EEG. • In the current pipeline, mutual information-based feature selection is used to select envelope features based on energy features. This selection based on the energy features results in the selection method not being optimized for the proposed envelope features. If the selection

111

method could be incorporated into the preprocessing stages or inside the network itself, it will save time and be more data driven. • Envelope extraction is non-linear process heavily dependent on the calculation of the analytic version of the signal. Replacing this function by a neural network can help with an easier integration.

6.3.2

Beyond Handcrafted Representations

In the proposed methodology for the classification of MI-EEG in this thesis, the FBCSP algorithm plays a major role in extracting the representation of the EEG signals. From a systems point of view, the representation of the EEG is handcrafted and independent from the classification stage. Although in the current framework, the performance is state-of-the-art, it is desired for the preprocessing also to have a learning element so that it can be generalized to all data. For example, we wish to implement the FBCSP in a way that it can learn from multiple subjects and be able to generalize to unseen subjects, or, to be able to perform fine-tuning in the preprocessing when transferring to a new subject. To have a good estimate of spatial filters, preprocessing operations such as FBCSP require to perform on the average covariance matrix of each class rather than individual covariance matrices in each trial. This is potentially the direction to take when designing networks to perform the preprocessing: updating weights based on batch statistics rather than batch values. By introducing an auxiliary object function closer to the preprocessing layers, the update of these layers can then be performed by the statistics (such as the covariance matrix) rather than direct backpropagation. A similar approach has been used in the Neural Statistician [182] to train a variational auto-encoder [81].

112

6.3.3

Regularization

Currently, there are no constraints or regularization elements used in the training of the network other than dropout and batch normalization. The data structure of the FBCSP-based envelope input is unique: it contains information regarding the different classes and their relationships with each other. This “side information” [183] of the data can be utilized and added inside the optimization function to have better regularization over the loss landscape, especially in the case of EEG data where the number of samples is usually smaller than the dimension of the data. Also, there is room to put more constraints on the convolutional kernels as well. For example, in the channel convolution, where the channels are mixed, there can be a sparsity constraint on the kernel weights, so not all of the channels are selected to mix, or a unit-norm constraint can be put on the kernels so the values of the kernel weights do not increase without meaning. Generally speaking, the structure of the architecture, learning methods, and the regularization carried out in image processing and classification should be different from the items and elements used for signal processing and more specifically, EEG signal processing.

6.3.4

New Learning Methods and Architectures

Residual Neural Networks Residual neural networks [162, 184], which are CNNs with an additional identity mapping, have improved the accuracy of computer visions models significantly. The underlying reason for the success of such networks is still a topic of research, but it has been shown that it behaves like an ensemble of smaller networks [185] or unrolled iterative estimation [186]. If designed properly, these networks may also have application in better convergence

113

for networks with signals as inputs.

Adversarial Networks Generative Adversarial Networks (GAN) [187] have been coined one of the breakthroughs in deep learning since the success of Hinton in 2006. The initial idea in this network was based on two modules: a generator and a discriminator. The generator generates an output, such as an image, from a random input and the discriminator has to identify whether it is from the generator or not. This “minimax” game that the discriminator and generator play leads to a generator that produces images that are so real that the discriminator cannot identify whether it is real or from the generator. Although this was the initial idea of GANs, they have been used in multiple applications such as Super Resolution [188], text-to-image synthesis [189] and domain-adversarial learning [190]. Based on this idea of network training, applications in MI-EEG classification can also be imagined. For example, in the context of transfer learning, a network can be trained to be independent of subjects but at the same time be able to classify based on the given labels. Alternatively, another potential application of the adversarial model is to learn a classconditional generative model [191] for the EEG data and then use the model for understanding the underlying process in generating the EEG.

Riemannian Networks Symmetric Positive Definite (SPD) matrices have special properties which cannot be handled like normal vectors or matrices; they cannot be vectorized and simply be used for classification. For SPD matrices to be handled, they are assumed to be on a Riemannian manifold and therefore projected to a Tangent space. In a recent study, Huang [138] proposed a convolutional-like neural network, operating on SPD matrices. This net-

114

work, named Riemannian network, defines new operations that replace conventional operations used for CNNs to make them suitable for SPD matrices. For example, instead of linear projection layer, a bilinear layer is defined, or instead of a ReLU unit, an EigRelu unit is defined that operated on the eigenvalues of the matrix. These operational modules can be stacked in a network and designed to process SPD matrices in multiple layers. Covariance matrices are SPD matrices and the backbone of CSP-based algorithms in MI-BCI systems. The Rayleigh quotient defined in the CSP algorithm can be viewed as bilinear transformation with a certain objective function. The network approach to the classification of covariance matrices can also be seen as an extension to the sub-bilinear transformation optimized in [43]. Overall, the application of deep learning methods in EEG processing is endless and challenging. Finding the right representation and the right preprocessing methods is key in using deep learning.

115

Bibliography [1] O. Braddick and J. Atkinson, “Development of human visual function,” Vision Research, vol. 51, no. 13, pp. 1588–1609, 2011. [2] S. J. Paterson, S. Heim, J. T. Friedman, N. Choudhury, and A. A. Benasich, “Development of structure and function in the infant brain: implications for cognition, language and social behaviour.” Neuroscience and biobehavioral reviews, vol. 30, no. 8, pp. 1087–105, 2006. [3] N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A. J. Rodriguez-Sanchez, and L. Wiskott, “Deep Hierarchies in the Primate Visual Cortex: What Can We Learn for Computer Vision?” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1847–1871, Aug. 2013. [4] D. T. STUSS, “Frontal lobes and attention: Processes and networks, fractionation and integration,” Journal of the International Neuropsychological Society, vol. 12, no. 02, pp. 261–71, Mar 2006. [5] M. A. Lebedev and M. A. L. Nicolelis, “Brain-Machine Interfaces: From Basic Science to Neuroprostheses and Neurorehabilitation,” Physiological Reviews, vol. 97, no. 2, 2017. [6] “Physionet Charts.” [Online]. Available: physiobank/charts/chbmit.png

116

https://physionet.org/

[7] R. J. Barry, A. R. Clarke, S. J. Johnstone, C. A. Magee, and J. A. Rushby, “EEG differences between eyes-closed and eyes-open resting conditions,” Clinical Neurophysiology, vol. 118, no. 12, pp. 2765– 2773, Dec 2007. [8] “Brain Waves.” [Online]. Available:

https://cdn.imotions.com/

wp-content/uploads/2016/02/Brain-waves.jpg [9] R. A. Ramadan and A. V. Vasilakos, “Brain computer interface: control signals review,” Neurocomputing, vol. 223, pp. 26–44, 2017. [10] M. van Gerven, J. Farquhar, R. Schaefer, R. Vlek, J. Geuze, A. Nijholt, N. Ramsey, P. Haselager, L. Vuurpijl, and S. Gielen, “The brain-computer interface cycle,” Journal of Neural Engineering, vol. 6, no. 4, p. 41001, 2009. [11] E. Yin, Z. Zhou, J. Jiang, Y. Yu, and D. Hu, “A dynamically optimized SSVEP brain-computer interface (BCI) speller,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 6, pp. 1447–1456, Jun 2015. [12] M. Nakanishi, Y.-T. Wang, T.-P. Jung, J. K. Zao, Y.-Y. Chien, A. Diniz-Filho, F. B. Daga, Y.-P. Lin, Y. Wang, and F. A. Medeiros, “Detecting Glaucoma With a Portable Brain-Computer Interface for Objective Assessment of Visual Function Loss,” JAMA Ophthalmology, vol. 11, no. 4, p. 119, Apr 2017. [13] A. Materka and P. Poryzala, “A Robust Asynchronous SSVEP BrainComputer Interface Based on Cluster Analysis of Canonical Correlation Coefficients.”

Springer International Publishing, 2014, pp.

3–14.

117

[14] “A new approach for EEG feature extraction in P300-based lie detection,” Computer Methods and Programs in Biomedicine, vol. 94, no. 1, pp. 48–57, 2009. [15] B. Rivet, H. Cecotti, R. Phlypo, O. Bertrand, E. Maby, and J. Mattout, “EEG sensor selection by sparse spatial filtering in P300 speller brain-computer interface,” in 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology. IEEE, Aug 2010, pp. 5379–5382. [16] A. Bashashati, M. Fatourechi, R. K. Ward, and G. E. Birch, “A survey of signal processing algorithms in brain-computer interfaces based on electrical brain signals,” Journal of Neural engineering, vol. 4, no. 2, p. R32, 2007. [17] F. Lotte, M. Congedo, A. L´ecuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for EEG-based brain-computer interfaces,” Journal of Neural Engineering, vol. 4, 2007. [18] C. Spence and O. Deroy, Multisensory Imagery.

New York, NY:

Springer New York, 2013, pp. 157–183. [19] S. de Vries and T. Mulder, “Motor imagery and stroke rehabilitation: A critical discussion,” pp. 5–13, Jan 2007. [20] H. C. Dijkerman, M. Ietswaart, M. Johnston, A. Guillot, and C. Collet, “Motor imagery and the rehabilitation of movement disorders: an overview,” The neurophysiological foundations of mental and motor imagery, pp. 127–144, 2010. [21] M. Ietswaart, M. Johnston, H. C. Dijkerman, S. Joice, C. L. Scott, R. S. MacWalter, and S. J. C. Hamilton, “Mental practice with motor imagery in stroke recovery: Randomized controlled trial of efficacy,” Brain, vol. 134, no. 5, pp. 1373–1386, May 2011. 118

[22] K. K. Ang, K. S. G. Chua, K. S. Phua, C. Wang, Z. Y. Chin, C. W. K. Kuah, W. Low, and C. Guan, “A Randomized Controlled Trial of EEG-Based Motor Imagery Brain-Computer Interface Robotic Rehabilitation for Stroke.” Clinical EEG and neuroscience, pp. 1 550 059 414 522 229–, Apr. 2014. [23] S. Teillet, F. Lotte, B. N’Kaoua, and C. Jeunet, “Towards a Spatial Ability Training to Improve Mental Imagery based Brain-Computer Interface (MI-BCI) Performance: a Pilot Study,” in IEEE International Conference on Systems, Man, and Cybernetics (SMC 2016), 2016, p. 6. [24] C. Vidaurre and B. Blankertz, “Towards a cure for BCI illiteracy,” Brain Topography, vol. 23, no. 2, pp. 194–198, Jun 2010. [25] C. Jeunet, E. Jahanpour, and F. Lotte, “Why standard braincomputer interface (BCI) training protocols should be changed: an experimental study,” Journal of Neural Engineering, vol. 13, no. 3, p. 036024, Jun 2016. [26] G. Pfurtscheller and F. H. Lopes Da Silva, “Event-related EEG/MEG synchronization and desynchronization: Basic principles,” pp. 1842– 1857, Nov 1999. [27] G. Pfurtscheller, C. Brunner, A. Schl¨ogl, and F. H. Lopes da Silva, “Mu rhythm (de)synchronization and EEG single-trial classification of different motor imagery tasks.” NeuroImage, vol. 31, no. 1, pp. 153–9, 2006. [28] S. Sakhavi, C. Guan, and S. Yan, “Parallel convolutional-linear neural network for motor imagery classification,” in 2015 23rd European Signal Processing Conference (EUSIPCO). 2736–2740. 119

IEEE, Aug 2015, pp.

[29] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller, “Optimal spatial filtering of single trial EEG during imagined hand movement,” Rehab. Eng., IEEE Trans. on, vol. 8, no. 4, pp. 441–446, 2000. [30] F. Lotte and C. Guan, “Regularizing common spatial patterns to improve BCI designs: unified theory and new algorithms,” Biomedical Engineering, IEEE Trans. on, vol. 58, no. 2, pp. 355–362, 2011. [31] K. K. Kai Keng Ang, Z. Y. Zhang Yang Chin, H. Haihong Zhang, and C. Cuntai Guan, “Filter Bank Common Spatial Pattern (FBCSP) in Brain-Computer Interface,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, Jun 2008, pp. 2390–2397. [32] K. K. Ang, Z. Y. Chin, C. Wang, C. Guan, and H. Zhang, “Filter bank common spatial pattern algorithm on BCI competition IV datasets 2a and 2b,” Front. in neurosci., vol. 6, 2012. [33] M. Arvaneh, C. Guan, K. K. Ang, and C. Quek, “Optimizing the channel selection and classification accuracy in EEG-based BCI.” IEEE Transactions on Biomedical Engineering, vol. 58, no. 6, pp. 1865–73, Jun. 2011. [34] W. Samek, C. Vidaurre, K.-R. M¨ uller, and M. Kawanabe, “Stationary common spatial patterns for brain-computer interfacing,” Journal of Neural Engineering, vol. 9, no. 2, p. 26013, 2012. [35] W. Samek, M. Kawanabe, and K.-R. Muller, “Divergence-based framework for common spatial patterns algorithms,” Biomedical Engineering, IEEE Reviews in, vol. 7, pp. 50–72, 2014. [36] W. Wu, Z. Chen, X. Gao, Y. Li, E. Brown, and S. Gao, “Probabilistic common spatial patterns for multichannel eeg analysis,” Pattern

120

Analysis and Machine Intelligence, IEEE Trans. on, vol. 37, no. 3, pp. 639–653, March 2015. [37] W. Wu, S. Nagarajan, and Z. Chen, “Bayesian Machine Learning: EEG\/MEG signal processing measurements,” IEEE Signal Processing Magazine, vol. 33, no. 1, pp. 14–36, Jan 2016. [38] W. Wu, X. Gao, B. Hong, and S. Gao, “Classifying single-trial EEG during motor imagery by iterative spatio-spectral patterns learning (ISSPL),” IEEE Transactions on Biomedical Engineering, vol. 55, no. 6, pp. 1733–1743, Jun 2008. [39] F. Qi, Y. Li, and W. Wu, “RSTFC: A novel algorithm for spatiotemporal filtering and classification of single-trial EEG,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 12, pp. 3070–3082, Dec 2015. [40] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Multiclass BrainComputer Interface Classification by Riemannian Geometry,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 4, pp. 920–928, Apr 2012. [41] Barachant, Alexandre and Bonnet, Stephane and Congedo, Marco and Jutten, Christian, “Classification of covariance matrices using a Riemannian-based kernel for BCI applications,” Neurocomputing, vol. 112, pp. 172–178, 2013. [42] F. Yger, M. Berar, and F. Lotte, “Riemannian approaches in BrainComputer Interfaces: a review,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, pp. 1–1, 2016. [43] X. Xie, Z. L. Yu, H. Lu, Z. Gu, and Y. Li, “Motor Imagery Classification based on Bilinear Sub-Manifold Learning of Symmetric Positive-

121

Definite Matrices,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. PP, no. 99, p. 1, 2016. [44] M. Wronkiewicz, E. Larson, and A. K. C. Lee, “Leveraging anatomical information to improve transfer learning in braincomputer interfaces,” Journal of Neural Engineering, vol. 12, no. 4, p. 046027, Aug 2015. [45] H. Kang, Y. Nam, and S. Choi, “Composite common spatial pattern for subject-to-subject transfer,” IEEE Signal Processing Letters, vol. 16, no. 8, pp. 683–686, Aug 2009. [46] M. Arvaneh, I. Robertson, and T. E. Ward, “Subject-to-subject adaptation to reduce calibration time in motor imagery-based braincomputer interface,” Conference proceedings, Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE Engineering in Medicine and Biology Society, vol. 2014, pp. 6501–6504, Aug 2014. [47] M. Sugiyama, M. Krauledat, and K.-R. M¨ uller, “Covariate Shift Adaptation by Importance Weighted Cross Validation,” Journal of Machine Learning Research, vol. 8, no. May, pp. 985–1005, 2007. [48] Yan Li, H. Kambara, Y. Koike, and M. Sugiyama, “Application of Covariate Shift Adaptation Techniques in BrainComputer Interfaces,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 6, pp. 1318–1324, Jun 2010. [49] H. Raza, H. Cecotti, Y. Li, and G. Prasad, “Adaptive learning with covariate shift-detection for motor imagery-based braincomputer interface,” Soft Computing, vol. 20, no. 8, pp. 3085–3096, Aug 2016.

122

[50] E. Haselsteiner and G. Pfurtscheller, “Using time-dependent neural networks for eeg classification,” IEEE Transactions on Rehabilitation Engineering, vol. 8, no. 4, pp. 457–463, 2000. [51] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, “Brain–computer interfaces for communication and control,” Clinical neurophysiology, vol. 113, no. 6, pp. 767–791, 2002. [52] K. P. Murphy, Machine Learning: A Probabilistic Perspective. MIT Press, 2012. [53] B. Settles, “Active Learning Literature Survey,” University of Wisconsin-Madison, Tech. Rep. Computer Sciences Technical Report 1648, 2009. [54] J. Foulds and E. Frank, “A review of multi-instance learning assumptions,” The Knowledge Engineering Review, vol. 25, no. 01, p. 1, Mar 2010. [55] Suk, Heung-Il and Lee, Seong-Whan, “A novel bayesian framework for discriminative feature extraction in brain-computer interfaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 2, pp. 286–299, 2013. [56] C. Park, C. Took, and D. Mandic, “Augmented complex common spatial patterns for classification of noncircular eeg from motor imagery tasks,” Neural Systems and Rehabilitation Engineering, IEEE Trans. on, vol. 22, no. 1, pp. 1–10, Jan 2014. [57] M. Arvaneh, C. Guan, K. K. Ang, and C. Quek, “EEG data space adaptation to reduce intersession nonstationarity in brain-computer interface.” Neural comp., vol. 25, no. 8, pp. 2146–71, Aug. 2013.

123

[58] Y. Zhang, G. Zhou, J. Jin, Q. Zhao, X. Wang, and A. Cichocki, “Sparse bayesian classification of eeg for brain–computer interface,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 11, pp. 2256–2267, 2016. [59] H. Bashashati, R. K. Ward, and A. Bashashati, “User-customized brain computer interfaces using Bayesian optimization.” Journal of Neural Engineering, vol. 13, no. 2, p. 026001, Jan 2016. [60] W. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bulletin of Mathematical Biophysics, vol. 7, pp. 115–133, 1943. [61] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review, vol. 65, no. 6, p. 386, 1958. [62] H. Wang and B. Raj, “On the Origin of Deep Learning,” Feb 2017. [Online]. Available: http://arxiv.org/abs/1702.07800 [63] M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry, 1969. [64] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, oct 2014. [65] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989. [66] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in International Joint Conference on Neural Networks (IJCNN). IEEE, 1989, pp. 593–605.

124

[67] K. Fukushima, “Neural network model for a mechanism of pattern recognition unaffected by shift in position,” Trans. IECE, vol. J62A(10), pp. 658–665, 1979. [68] Fukushima, Kunihiko, “Neocognitron: A hierarchical neural network capable of visual pattern recognition,” Neural networks, vol. 1, no. 2, pp. 119–130, 1988. [69] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Back-Propagation Applied to Handwritten Zip Code Recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989. [70] Y. LeCun, “A theoretical framework for Back-Propagation,” in Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, Eds. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988, pp. 21–28. [71] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [72] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, May 2006. [73] G. Hinton and R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science, vol. 313, no. 5786, pp. 504– 507, 2006. [74] D. Erhan, P. A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “The difficulty of training deep architectures and the effect of unsupervised pre-training,” in Proceedings of The Twelfth International

125

Conference on Artificial Intelligence and Statistics (AISTATS-09). Citeseer, 2009, pp. 153–160. [75] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [76] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in International Conference on Machine Learning (ICML), 2010. [Online]. Available:

https:

//www.cs.toronto.edu/∼hinton/absps/reluICML.pdf [77] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [78] K.-S. Oh and K. Jung, “{GPU} implementation of neural networks,” Pattern Recognition, vol. 37, no. 6, pp. 1311–1314, 2004. [79] Y. Bengio, Learning Deep Architectures for AI. Foundations and Trends in Machine Learning, V2(1). Now Publishers, 2009. [80] P. Vincent, L. Hugo, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, ser. ICML ’08.

New York, NY, USA: ACM, 2008, pp. 1096–1103.

[Online]. Available: http://doi.acm.org/10.1145/1390156.1390294 [81] D. P. Kingma and M. Welling, “Auto-encoding Variational Bayes,” The International Conference on Learning Representations (ICLR), 2014.

126

[82] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in Neural Information Processing Systems (NIPS), 2007. [83] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010. [84] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,” Google, Tech. Rep. arXiv:1409.4842 [cs.CV], 2014. [85] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” arXiv preprint, Dec 2015. [Online]. Available: http://arxiv.org/abs/1512. 00567 [86] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, InceptionResNet and the Impact of Residual Connections on Learning,” Feb 2016. [Online]. Available: http://arxiv.org/abs/1602.07261 [87] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of The 32nd International Conference on Machine Learning, 2015, pp. 448–456. [88] S. Ioffe, “Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models,” arXiv preprint, Feb 2017. [Online]. Available: http://arxiv.org/abs/1702.03275 [89] Y. Wu,

M. Schuster,

Z. Chen,

Q. V. Le,

M. Norouzi,

W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, 127

Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” ArXiv e-prints, pp. 1–23, Sep 2016. [Online]. Available: http://arxiv.org/abs/1609.08144 [90] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/ [91] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb 2015. [92] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi´ nska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis, “Hybrid computing using a neural network with dy-

128

namic external memory,” Nature, vol. 538, no. 7626, pp. 471–476, Oct 2016. [93] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Van Den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, oct 2017. [Online]. Available: http://www.nature.com/doifinder/10.1038/nature24270 [94] “Facebook AI Research (FAIR) Facebook Research.” [Online]. Available: https://research.fb.com/category/facebook-ai-research-fair/ [95] “Microsoft Research Emerging Technology, Computer, and Software Research.” [Online]. Available: https://www.microsoft.com/en-us/ research/ [96] “Deep Learning — NVIDIA Developer.” [Online]. Available: https://developer.nvidia.com/deep-learning [97] “Deep Learning on AWS.” [Online]. Available: https://aws.amazon. com/deep-learning/ [98] “Home - Disney Research.” [Online]. Available:

https://www.

disneyresearch.com/ [99] A. Mohamed and G. E. Hinton, “Phone Recognition using Restricted {Boltzmann} Machines,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 4354–4357. [100] O. Abdel-hamid, L. Deng, and D. Yu, “Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition,” in 14th Annual Conference of the International Speech 129

Communication Association, (INTERSPEECH 2013), no. August, 2013, pp. 3366–3370. [101] D. Y. O. Abdel-Hamid L. Deng and H. Jiang, “Deep Segmental Neural Network for Automatic Speech Recognition,” in Proc. of Interspeech, Lyon, France, no. 1, 2013. [102] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional Neural Networks for Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533–1545, Oct 2014. [103] A. Graves, A.-R. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013, pp. 6645–6649. [104] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” Sep 2016. [Online]. Available: http://arxiv.org/abs/1609.03499 [105] V. Kalingeri and S. Grandhe, “Music Generation with Deep Learning,” Dec 2016. [Online]. Available:

http://arxiv.org/abs/

1612.04928 [106] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. Courville, “Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks,” Jan 2017. [Online]. Available: http://arxiv.org/abs/1701.02720 [107] J. Chen, Y. Wang, S. E. Yoho, D. Wang, and E. W. Healy, “Largescale training to increase speech intelligibility for hearing-impaired

130

listeners in novel noises,” The Journal of the Acoustical Society of America, vol. 139, no. 5, pp. 2604–2612, May 2016. [108] H. Cecotti and A. Gr¨aser, “Convolutional neural networks for P300 detection with application to brain-computer interfaces.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 433–45, Mar 2011. [109] H. Cecotti, “A time–frequency convolutional neural network for the offline classification of steady-state visual evoked potential responses,” Pattern Recognition Letters, vol. 32, no. 8, pp. 1145–1153, 2011. [110] H. Cecotti, M. P. Eckstein, and B. Giesbrecht, “Single-trial classification of event-related potentials in rapid serial visual presentation tasks using supervised spatial filtering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 11, pp. 2030–2042, 2014. [111] S. Stober, D. J. Cameron, and J. A. Grahn, “Using Convolutional Neural Networks to Recognize Rhythm Stimuli from Electroencephalography Recordings,” in Advances in Neural Information Processing Systems, 2014, pp. 1449–1457. [112] S. Stober, A. Sternin, A. M. Owen, and J. A. Grahn, “Deep Feature Learning for EEG Recordings,” Arxiv, Nov 2015. [Online]. Available: http://arxiv.org/abs/1511.04306 [113] P. Bashivan, I. Rish, M. Yeasin, and N. Codella, “Learning Representations from EEG with Deep Recurrent-Convolutional Neural Networks,” pp. 1–15, Nov 2015. [Online]. Available: http://arxiv.org/abs/1511.06448

131

[114] M. Hajinoroozi, Z. Mao, and Y. Huang, “Prediction of driver’s drowsy and alert states from EEG signals with deep learning,” in 2015 IEEE 6th International Workshop on Computational Advances in MultiSensor Adaptive Processing (CAMSAP). IEEE, Dec 2015, pp. 493– 496. [115] M. Hajinoroozi, Z. Mao, T.-P. Jung, C.-T. Lin, and Y. Huang, “EEGbased prediction of driver’s cognitive performance by deep convolutional neural network,” Signal Processing: Image Communication, May 2016. [116] X. An, D. Kuang, X. Guo, Y. Zhao, and L. He, “A Deep Learning Method for Classification of EEG Data Based on Motor Imagery,” in Intelligent Computing in Bioinformatics.

Springer, 2014, pp. 203–

210. [117] E. Santana, A. J. Brockmeier, and J. C. Principe, “Joint optimization of algorithmic suites for EEG analysis,” in Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE. IEEE, 2014, pp. 2997–3000. [118] A. Yuksel, T. Olmez, R. Williams, K. Muller, J. Wolpaw, and A. Schlogl, “A Neural Network-Based Optimal Spatial Filter Design Method for Motor Imagery Classification,” PLOS ONE, vol. 10, no. 5, p. e0125039, May 2015. [119] P. Merinov, M. Belyaev, and E. Krivov, “Filter bank extension for neural network-based motor imagery classification,” in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, Sep 2016, pp. 1–6.

132

[120] A. Yuksel and T. Olmez, “Filter Bank Common Spatio-Spectral Patterns for Motor Imagery Classification.” Springer, Cham, 2016, pp. 69–84. [121] Huijuan Yang, S. Sakhavi, Kai Keng Ang, and Cuntai Guan, “On the use of convolutional neural networks and augmented CSP features for multi-class motor imagery of EEG signals classification,” in 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, Aug 2015, pp. 2620– 2623. [122] Y. R. Tabar and U. Halici, “A novel deep learning approach for classification of EEG motor imagery signals,” Journal of Neural Engineering, vol. 14, no. 1, p. 016003, Feb 2017. [123] R. T. Schirrmeister, L. Gemein, K. Eggensperger, F. Hutter, and T. Ball, “Deep learning with convolutional neural networks for decoding and visualization of EEG pathology,”

Human

Brain Mapping, vol. 38, no. 11, pp. 5391–5420, nov 2017. [Online]. Available:

http://doi.wiley.com/10.1002/hbm.23730http:

//arxiv.org/abs/1708.08012 [124] K. G. Hartmann, R. T. Schirrmeister, and T. Ball, “Hierarchical internal representation of spectral features in deep convolutional networks trained for EEG decoding,” nov 2017. [Online]. Available: http://arxiv.org/abs/1711.07792 [125] A. Bamdadian, C. Guan, K. K. Ang, and J. Xu, “The predictive role of pre-cue EEG rhythms on MI-based BCI classification performance.” Journal of Neuroscience Methods, vol. 235, pp. 138–44, Sep. 2014.

133

[126] K. P. Thomas, C. Guan, Lau Chiew Tong, and V. A. Prasad, “An adaptive filter bank for motor imagery based Brain Computer Interface,” in 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, Aug 2008, pp. 1104–1107. [127] D. J. McFarland, L. M. McCane, S. V. David, and J. R. Wolpaw, “Spatial filter selection for EEG-based communication,” Electroencephalography and clinical Neurophysiology, vol. 103, no. 3, pp. 386– 394, 1997. [128] A. Hyv¨arinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 4, pp. 411– 430, 2000. [129] V. V. Nikulin, G. Nolte, and G. Curio, “A novel method for reliable and fast extraction of neuronal EEG/MEG oscillations on the basis of spatio-spectral decomposition,” NeuroImage, vol. 55, no. 4, pp. 1528–1535, Apr 2011. [130] S. D¨ahne, F. C. Meinecke, S. Haufe, J. H¨ohne, M. Tangermann, K.R. M¨ uller, and V. V. Nikulin, “SPoC: A novel framework for relating the amplitude of neuronal oscillations to behaviorally relevant parameters,” NeuroImage, vol. 86, pp. 111–122, Feb 2014. [131] M. Tangermann, K.-R. M¨ uller, A. Aertsen, N. Birbaumer, C. Braun, C. Brunner, R. Leeb, C. Mehring, K. J. Miller, G. R. M¨ uller-Putz, G. Nolte, G. Pfurtscheller, H. Preissl, G. Schalk, A. Schl¨ogl, C. Vidaurre, S. Waldert, and B. Blankertz, “Review of the BCI Competition IV,” Frontiers in Neuroscience, vol. 6, p. 55, 2012.

134

[132] M. Congedo, A. Barachant, and R. Bhatia, “Riemannian geometry for EEG-based brain-computer interfaces; a primer and a review,” Brain-Computer Interfaces, pp. 1–20, Mar 2017. [133] M. T. Harandi, M. Salzmann, and R. Hartley, “From Manifold to Manifold: Geometry-Aware Dimensionality Reduction for SPD Matrices,” Jul 2014. [Online]. Available: http://arxiv.org/abs/1407. 1120 [134] Z. Huang, R. Wang, S. Shan, X. Li, and X. Chen, “Log-Euclidean Metric Learning on Symmetric Positive Definite Manifold with Application to Image Set Classification,” Proc.ICML (2015), vol. 37, pp. 720–729, 2015. [135] Z. Huang, J. Wu, and L. Van Gool, “Building Deep Networks on Grassmann Manifolds,” Nov 2016. [Online]. Available: http: //arxiv.org/abs/1611.05742 [136] Z. Huang and L. Van Gool,

“A Riemannian Network for

SPD Matrix Learning,” arXiv, Aug 2016. [Online]. Available: http://arxiv.org/abs/1608.04233 [137] Z. Huang, R. Wang, X. Li, W. Liu, S. Shan, L. Van Gool, and X. Chen, “Geometry-aware Similarity Learning on SPD Manifolds for Visual Recognition,” Aug 2016. [Online]. Available: http://arxiv.org/abs/1608.04914 [138] Z. Huang and L. V. Gool, “A Riemannian Network for SPD Matrix Learning,” AAAI Conference on Artificial Intelligence; Thirty-First AAAI Conference on Artificial Intelligence, 2017. [139] Z. Huang, C. Wan, T. Probst, and L. Van Gool, “Deep Learning on Lie Groups for Skeleton-based Action Recognition,” Dec 2017. [Online]. Available: https://arxiv.org/abs/1612.05877 135

[140] M. Harandi, M. Salzmann, and R. Hartley, “Dimensionality Reduction on SPD Manifolds: The Emergence of Geometry-Aware Methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2017. [141] Guo Xiaojing, Wu Xiaopei, and Zhang Dexiang, “Motor imagery EEG detection by empirical mode decomposition,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, Jun 2008, pp. 2619– 2622. [142] C. Park, D. Looney, A. Ahrabian, and D. P. Mandic, “Classification of motor imagery BCI using multivariate empirical mode decomposition,” Neural Systems and Rehabilitation Engineering, IEEE Trans. on, vol. 21, no. 1, pp. 10–22, 2013. [143] S. K. Bashar and M. I. H. Bhuiyan, “Classification of motor imagery movements using multivariate empirical mode decomposition and short time Fourier transform based hybrid method,” Engineering Science and Technology, an International Journal, vol. 19, no. 3, pp. 1457–1464, 2016. [144] S. M. Hazarika, “Bispectrum analysis of EEG for motor imagery based BCI,” in 2012 2nd National Conference on Computational Intelligence and Signal Processing (CISP). IEEE, Mar 2012, pp. 27–27. [145] N. Kotoky and S. M. Hazarika, “Bispectrum analysis of EEG for motor imagery classification,” in 2014 International Conference on Signal Processing and Integrated Networks (SPIN). IEEE, Feb 2014, pp. 581–586. [146] B. Das, M. Talukdar, R. Sarma, and S. M. Hazarika, “Multiple Feature Extraction of Electroencephalograph Signal for Motor Imagery 136

Classification through Bispectral Analysis,” Procedia Computer Science, vol. 84, pp. 192–197, 2016. [147] M. T. F. Talukdar, S. K. Sakib, N. S. Pathan, and S. A. Fattah, “Motor imagery EEG signal classification scheme based on autoregressive reflection coefficients,” in 2014 International Conference on Informatics, Electronics & Vision (ICIEV).

IEEE, May 2014, pp.

1–4. [148] I. T. Hettiarachchi, T. T. Nguyen, and S. Nahavandi, “Multivariate Adaptive Autoregressive Modeling and Kalman Filtering for Motor Imagery BCI,” in 2015 IEEE International Conference on Systems, Man, and Cybernetics. IEEE, Oct 2015, pp. 3164–3168. [149] L. Qin and B. He, “A wavelet-based timefrequency analysis approach for classification of motor imagery for braincomputer interface applications,” Journal of Neural Engineering, vol. 2, no. 4, pp. 65–72, Dec 2005. [150] W.-Y. Hsu and Y.-N. Sun, “EEG-based motor imagery analysis using weighted wavelet transform features,” Journal of Neuroscience Methods, vol. 176, no. 2, pp. 310–318, 2009. [151] J. Kevric and A. Subasi, “Comparison of signal decomposition methods in classification of EEG signals for motor-imagery BCI system,” Biomedical Signal Processing and Control, vol. 31, pp. 398–406, 2017. [152] B. Kamousi, A. N. Amini, and B. He, “Classification of motor imagery by means of cortical current density estimation and Von Neumann entropy,” Journal of Neural Engineering, vol. 4, no. 2, pp. 17–25, Jun 2007. [153] D. Xiao, Z. Mu, and J. Hu, “Classification of Motor Imagery EEG Signals Based on Energy Entropy,” in 2009 International Symposium 137

on Intelligent Ubiquitous Computing and Education.

IEEE, May

2009, pp. 61–64. [154] L. Gao, J. Wang, and L. Chen, “Event-related desynchronization and synchronization quantification in motor-related EEG by Kolmogorov entropy,” Journal of Neural Engineering, vol. 10, no. 3, p. 036023, Jun 2013. [155] V. J. Lawhern,

A. J. Solon,

N. R. Waytowich,

Gordon, C. P. Hung, and B. J. Lance, “Eegnet:

S. M.

A compact

convolutional network for eeg-based brain-computer interfaces,” CoRR, vol. abs/1611.08024, 2016. [Online]. Available:

http:

//arxiv.org/abs/1611.08024 [156] T. Uktveris and V. Jusas, “Application of convolutional neural networks to four-class motor imagery classification problem,” Information Technology And Control, vol. 46, no. 2, pp. 260–273, 2017. [157] M. Le Van Quyen, J. Foucher, J.-P. Lachaux, E. Rodriguez, A. Lutz, J. Martinerie, and F. J. Varela, “Comparison of Hilbert transform and wavelet methods for the analysis of neuronal synchrony,” Journal of Neuroscience Methods, vol. 111, no. 2, pp. 83–98, Sep 2001. [158] R. Collobert S. Bengio and J. Marithoz, “Torch: a modular machine learning software library,” IDIAP-RR, Tech. Rep. 02-46, 2002. [159] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [160] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http://www.csie. ntu.edu.tw/∼cjlin/libsvm. 138

[161] J. Demˇsar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, 2006. [162] K. He, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in Deep Residual Networks,” Mar 2016. [Online]. Available: http://arxiv.org/abs/1603.05027 [163] S. J. Wright, “Coordinate descent algorithms,” Mathematical Programming, vol. 151, no. 1, pp. 3–34, Mar 2015. [164] D. Kingma and J. Ba, “Adam:

A Method for Stochastic

Optimization,” Dec 2014. [Online]. Available: http://arxiv.org/abs/ 1412.6980 [165] M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” NYU, Tech. Rep. arXiv:1311.2901 [cs.CV], 2013. [166] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, “Understanding Neural Networks Through Deep Visualization,” International Conference on Machine Learning - Deep Learning Workshop 2015, p. 12, Jun 2015. [Online]. Available:

http:

//arxiv.org/abs/1506.06579 [167] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” ICLR, pp. 1–, Dec 2014. [Online]. Available: http://arxiv.org/abs/1312.6034 [168] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨ uller, and W. Samek, “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation,” PLOS ONE, vol. 10, no. 7, p. e0130140, Jul 2015.

139

[169] I. Sturm, S. Lapuschkin, W. Samek, and K.-R. M¨ uller, “Interpretable deep neural networks for single-trial eeg classification,” Journal of Neuroscience Methods, vol. 274, pp. 141–145, 2016. [170] S. Sakhavi and C. Guan, “Convolutional neural network-based transfer learning and knowledge distillation using multi-subject data in motor imagery BCI,” in 2017 8th International IEEE/EMBS Conference on Neural Engineering (NER).

IEEE, may 2017, pp.

588–591. [Online]. Available: http://ieeexplore.ieee.org/document/ 8008420/ [171] S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, Oct 2010. [172] T. Tommasi, “Learning to Learn by Exploiting Prior Knowledge,” Ph.D. dissertation,

EPFL, 2013. [Online]. Available:

http:

//infoscience.epfl.ch/record/186007/files/EPFL{ }TH5587.pdf [173] C. Vidaurre, M. Kawanabe, P. von Bunau, B. Blankertz, and K. R. Muller, “Toward Unsupervised Adaptation of LDA for BrainComputer Interfaces,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 3, pp. 587–597, Mar 2011. [174] Xinyang Li, Cuntai Guan, Kai Keng Ang, Haihong Zhang, and Sim Heng Ong, “Spatial filter adaptation based on the divergence framework for motor imagery EEG classification,” in 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, Aug 2014, pp. 1847–1850. [175] X. Li, C. Guan, H. Zhang, K. K. Ang, and S. H. Ong, “Adaptation of motor imagery EEG classification model based on tensor decom-

140

position,” Journal of Neural Engineering, vol. 11, no. 5, p. 056020, Oct 2014. [176] F. Lotte, “Signal Processing Approaches to Minimize or Suppress Calibration Time in Oscillatory Activity-Based Brain-Computer Interfaces,” Proceedings of the IEEE, vol. 103, no. 6, pp. 871–890, Jun 2015. [177] V. Jayaram, M. Alamgir, Y. Altun, B. Scholkopf, and M. GrosseWentrup, “Transfer Learning in Brain-Computer Interfaces,” IEEE Computational Intelligence Magazine, vol. 11, no. 1, pp. 20–31, Feb 2016. [178] M. Krauledat, M. Tangermann, B. Blankertz, and K.-R. R. M¨ uller, “Towards Zero Training for Brain-Computer Interfacing,” PLoS ONE, vol. 3, no. 8, p. e2967, Aug 2008. [179] H. Kang and S. Choi, “Bayesian common spatial patterns for multisubject EEG classification,” Neural Networks, 2014. [180] D. Heger, F. Putze, C. Herff, and T. Schultz, “Subject-to-subject transfer for CSP based BCIs: Feature space transformation and decision-level fusion,” in Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS. IEEE, July 2013, pp. 5614–5617. [181] M. Alamgir, M. Grosse-Wentrup, and Y. Altun, “Multitask learning for brain-computer interfaces,” in International Conference on Artificial Intelligence and Statistics, 2010, pp. 17–24. [182] H. Edwards and A. Storkey, “Towards a Neural Statistician,” Jun 2016. [Online]. Available: http://arxiv.org/abs/1606.02185

141

[183] A. Mollaysa, P. Strasser, and A. Kalousis, “Regularising Non-linear Models Using Feature Side-information,” Mar 2017. [Online]. Available: http://arxiv.org/abs/1703.02570 [184] K. He,

X. Zhang,

S. Ren,

and J. Sun,

“Deep Residual

Learning for Image Recognition,” Dec 2015. [Online]. Available: http://arxiv.org/abs/1512.03385 [185] A. Veit, M. Wilber, and S. Belongie, “Residual Networks Behave Like Ensembles of Relatively Shallow Networks,” May 2016. [Online]. Available: http://arxiv.org/abs/1605.06431 [186] K. Greff, R. K. Srivastava, and J. Schmidhuber, “Highway and Residual Networks learn Unrolled Iterative Estimation,” Dec 2016. [Online]. Available: http://arxiv.org/abs/1612.07771 [187] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” arXiv preprint arXiv:1406.2661, 2014. [188] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” Sep 2016. [Online]. Available: http://arxiv.org/abs/1609.04802 [189] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas, “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks,” Dec 2016. [Online]. Available: http://arxiv.org/abs/1612.03242 [190] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial train-

142

ing of neural networks,” Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, 2016. [191] M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” Nov 2014. [Online]. Available: http://arxiv.org/abs/1411. 1784

143

Suggest Documents