THEORETICAL AND EMPIRICAL EXTENSIONS

THEORETICAL AND EMPIRICAL EXTENSIONS OF THE DENDRITIC CELL ALGORITHM

FENG GU

Thesis submitted to the University of Nottingham for the degree of Doctor of Philosophy

November 2011

Dedicated to my family. . .

i

Abstract The area of Artificial Immune Systems (AIS) that bridges immunology, computer science, engineering and mathematics, has gained great interests in the last decade. One of the well known AIS, the Dendritic Cell Algorithm (DCA), has shown promising performance on the anomaly detection and attribution problem. A number of interesting properties of the algorithm could be the reason for successful applications in the past. However, there are several issues with the algorithm that limit its applicability and accessibility. As a result, this thesis aims to address and resolve these issues, by investigating algorithmic properties of the DCA from both the theoretical and empirical perspectives. This leads to the research questions that are related to formalisation and runtime analysis, online analysis and automated pre-processing. In order to answer these questions, a formalisation of the DCA is first provided and runtime analyses of the algorithm are subsequently performed to show its low runtime complexity; an online analysis component is then proposed and tested to improve the algorithm’s online detection capability; an automated pre-processing module is also developed and validated to enable the algorithm to automatically adapt to the underlying characteristics of a given problem domain. In summary, this work extends the original DCA to be more accessible to future users and more applicable to the problem of interest. The findings of this work provide novel contributions to the development and analysis of the DCA, as well as useful implications to the AIS community.

ii

Acknowledgements First of all, I would like to thank my Mum and Dad for their support and encouragement that led me to where I am. Without them, I would never have come to the completion of my PhD. Huge thanks go to my supervisors Prof. Uwe Aickelin and Dr. Julie Greensmith, for their countless advice, support feedback and encouragement throughout my PhD. Great thanks go to Mr. Wei Chen and Dr. Thomas Jansen for their valuable comments on the theoretical investigation of this thesis. I gained a large amount of knowledge on theoretical computer science through discussions with them. I would also like to thank many of my colleagues and friends from the Intelligent Modelling and Analysis research group at the University of Nottingham. In particular, my thanks go to Dr. Yousof Al-Hammadi, Dr. Jan Feyereisl, Dr. Robert Oates, Dr. Peer-Olaf Siebers and Dr. William Wilson, who not only helped me with my initial steps into the wonderful world of research, but always found the time to discuss any topic that fascinated me at that particular point in time. Finally, I want to give my special thanks to the examiners, Prof. Emma Hart and Prof. Natalio Krasnogor, of my PhD defence, who provided very constructive comments on improving my thesis.

iii

Publications The following publications were produced during the course of this thesis. Where applicable, corresponding chapter numbers indicate which parts of this thesis a given publication is relevant. F. Gu, J. Greensmith and U. Aickelin (2011) ‘An Investigation of the Automated Data Pre-Processing of the Dendritic Cell Algorithm’. Applied Soft Computing, Elsevier (submitted). Featured most of the content in Chapter 6, I contributed the majority of the work. F. Gu, W. Chen, J. Greensmith and U. Aickelin (2011) ‘Formulation and Analysis of the Dendritic Cell Algorithm’. BioSystems, Elsevier (submitted). Featured most of the content in Chapter 4, I contributed the majority of the work. F. Gu, J. Greensmith and U. Aickelin (2011) ‘The Dendritic Cell Algorithm for Intrusion Detection’, Bio-inspired Communications and Networks, IGI publication (in print). Featured the material that Chapter 2 is partially based on, I contributed the majority of the work. U. Aickelin, D. Dasgupta and F. Gu (2010) ‘Artificial Immune Systems’, Search Methodologies: Introductory Tutorials in Optmization and Decision Support Techniques (2nd Edition), Springer (in print). For this book chapter, I contributed to updating of the newer content in the area. F. Gu, J. Greensmith, R. Oates and U. Aickelin (2009) ‘PCA 4 DCA: the Ap-

iv

plication of Principal Component Analysis to the Dendritic Cell Algorithm’. In Proceedings of the 9th Annual Workshop on Computational Intelligence (UKCI). Featured in Chapter 6, section 6.2, I contributed to the majority of the work. F. Gu, J. Greensmith and U. Aickelin (2009) ‘Integrating Real-Time Analysis with the Dendritic Cell Algorithm through Segmentation’. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp. 1203-1210. Featured in Chapter 5, I contributed to the majority of the work. F. Gu, J. Greensmith and U. Aickelin (2009) ‘Exploration Of The Dendritic Cell Algorithm With The Duration Calculus’. In Proceedings of the 8th International Conference on Artificial Immune Systems (ICARIS), no. 5666 in LNCS, pp. 54-66, Berlin-Heidelberg. Springer. Featured in Chapter 4, section 4.3, I contributed to the majority of the work. F. Gu, J. Greensmith and U. Aickelin (2008) ‘Further Exploration of the Dendritic Cell Algorithm: Antigen Multiplier and Time Windows’. In Proceedings of the 7th International Conference on Artificial Immune Systems (ICARIS), no. 5132 in LNCS, pp. 142-153, Berlin-Heidelberg. Springer. This was a validation of the DCA, and the dataset and antigen multiplier method used are included in Chapter 6, I contributed to the majority of the work. F. Gu, U. Aickelin and J. Greensmith (2007) ‘An Agent-based Classification Model’. In the 9th European Agent Systems Summer School, Durham, UK. This paper was my first attempt to understand the DCA through implementing it in agent based simulation environment, which is not included in the thesis, I contributed to the majority of the work.

v

Contents

Publications

iv

Table of Contents

xi

List of Figures

xv

List of Tables

xix

1 Introduction 1.1

1.2

1.3

Motivation

1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.1

Artificial Immune Systems . . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

The Dendritic Cell Algorithm . . . . . . . . . . . . . . . . . . . .

3

1.1.3

Theoretical and Empirical Investigations . . . . . . . . . . . . . .

5

Aim and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.1

Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.2

Contributions to Knowledge . . . . . . . . . . . . . . . . . . . . .

7

Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2 Background and Context 2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 11

vi

CONTENTS

2.2

2.3

2.4

2.5

2.6

The Human Immune System . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.1

Immune Components

. . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.2

Computability of Immune System . . . . . . . . . . . . . . . . .

14

Artificial Immune Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.3.1

Overview of AIS . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.3.2

Negative Selection Algorithms

. . . . . . . . . . . . . . . . . . .

18

2.3.3

Clonal Selection Algorithms . . . . . . . . . . . . . . . . . . . . .

19

2.3.4

Immune Network Based Algorithms . . . . . . . . . . . . . . . .

22

2.3.5


24

Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.4.1

Denning’s Intrusion Detection Model . . . . . . . . . . . . . . . .

34

2.4.2

Approaches of Intrusion Detection . . . . . . . . . . . . . . . . .

36

2.4.3

Conventional Solutions to Intrusion Detection . . . . . . . . . . .

38

AIS in Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.5.1

AIS Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.5.2

Negative Selection Based Approaches

. . . . . . . . . . . . . . .

40

2.5.3

The Immune Network Approach . . . . . . . . . . . . . . . . . .

42

2.5.4


43

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3 Overview of Techniques

47

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

3.2

The Dendritic Cell Algorithm . . . . . . . . . . . . . . . . . . . . . . . .

48

3.2.1

48

Application of the DCA . . . . . . . . . . . . . . . . . . . . . . .

vii

CONTENTS

3.3

3.4

3.5

3.6

3.7

3.2.2

Pre-Processing Phase . . . . . . . . . . . . . . . . . . . . . . . .

49

3.2.3

Detection Phase . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.2.4

Analysis Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

Formal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.3.1

Duration Calculus . . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.3.2

Example of a Gas Burner . . . . . . . . . . . . . . . . . . . . . .

57

Pre-Processing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . .

58

3.4.1

Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . .

58

3.4.2

Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . .

63

3.4.3

Automated Data Pre-Processing Methods . . . . . . . . . . . . .

64

Online Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . . .

70

3.5.1

Segmentation Approaches . . . . . . . . . . . . . . . . . . . . . .

70

3.5.2

Antigen Based Segmentation . . . . . . . . . . . . . . . . . . . .

70

3.5.3

Time Based Segmentation . . . . . . . . . . . . . . . . . . . . . .

71

Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . .

72

3.6.1

Supervised Learning Techniques . . . . . . . . . . . . . . . . . .

72

3.6.2

K-Nearest-Neighbour . . . . . . . . . . . . . . . . . . . . . . . . .

72

3.6.3

Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.6.4

Support Vector Machines . . . . . . . . . . . . . . . . . . . . . .

76

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4 Theoretical Aspects of the DCA

83

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.2

A Single-Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

viii

CONTENTS

4.3

4.4

4.5

4.6

4.2.1

Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

4.2.2

Formal Specifications . . . . . . . . . . . . . . . . . . . . . . . . .

86

Formalisation of the DCA . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.3.1

Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.3.2

Procedural Operations . . . . . . . . . . . . . . . . . . . . . . . .

91

Analysis of Runtime Complexity . . . . . . . . . . . . . . . . . . . . . .

94

4.4.1

The Standard DCA . . . . . . . . . . . . . . . . . . . . . . . . .

94

4.4.2

The DCA with Segmentation . . . . . . . . . . . . . . . . . . . .

98

4.4.3

The DCA with Automated Pre-Processing . . . . . . . . . . . . .

100

Formulation of Runtime Properties . . . . . . . . . . . . . . . . . . . . .

103

4.5.1

Number of Matured DCs . . . . . . . . . . . . . . . . . . . . . .

103

4.5.2

Number of Processed Antigens . . . . . . . . . . . . . . . . . . .

105

4.5.3

Moving Window Effect . . . . . . . . . . . . . . . . . . . . . . . .

107

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109

5 Online Analysis Component

111

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

112

5.2

System Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

114

5.2.1

Online Analysis Component . . . . . . . . . . . . . . . . . . . . .

114

5.2.2

Segmentation Approaches . . . . . . . . . . . . . . . . . . . . . .

115

Experimental Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . .

117

5.3.1

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . .

117

5.3.2

The SYN Scan Dataset . . . . . . . . . . . . . . . . . . . . . . .

117

5.3.3

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .

119

5.3

ix

CONTENTS

5.4

5.5

5.6

Analytical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

5.4.1

Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

5.4.2

Calculation of Detection Performance . . . . . . . . . . . . . . .

123

Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

125

5.5.1

Analysis of Offline Performance . . . . . . . . . . . . . . . . . . .

125

5.5.2

Analysis of Online Performance . . . . . . . . . . . . . . . . . . .

128

5.5.3

Internal and External Comparisons . . . . . . . . . . . . . . . . .

132

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135

6 Automated Pre-Processing Module

137

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

138

6.2

Automated Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . .

139

6.2.1

Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

139

6.2.2

Correlation Based Method . . . . . . . . . . . . . . . . . . . . . .

140

6.2.3

Information Gain Based Method . . . . . . . . . . . . . . . . . .

141

6.2.4

PCA Based Method . . . . . . . . . . . . . . . . . . . . . . . . .

142

Experimental Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . .

143

6.3.1

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . .

143

6.3.2

Testing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .

143

6.3.3

Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . .

145

6.3.4

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .

147

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

149

6.4.1

The 10% KDD Dataset . . . . . . . . . . . . . . . . . . . . . . .

149

6.4.2

The Whole KDD Dataset . . . . . . . . . . . . . . . . . . . . . .

150

6.3

6.4

x

CONTENTS

6.4.3 6.5

6.6

Comparison with the State-of-the-Art . . . . . . . . . . . . . . .

153

Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

155

6.5.1

Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . .

155

6.5.2

Interpretation of Selected Features . . . . . . . . . . . . . . . . .

157

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

159

7 Conclusions and Future Work 7.1

7.2

7.3

161

Conclusions of Research Questions . . . . . . . . . . . . . . . . . . . . .

162

7.1.1

Aim Satisfaction . . . . . . . . . . . . . . . . . . . . . . . . . . .

162

7.1.2

Formalisation and Runtime Analysis . . . . . . . . . . . . . . . .

162

7.1.3

Online Analysis Component . . . . . . . . . . . . . . . . . . . . .

163

7.1.4

Automated Pre-Processing Module . . . . . . . . . . . . . . . . .

165

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

166

7.2.1

Short-Term Advances . . . . . . . . . . . . . . . . . . . . . . . .

166

7.2.2

Long-Term Advances . . . . . . . . . . . . . . . . . . . . . . . . .

169

The Future of AIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

171

References

174

A Additional Figures of Chapter 5

192

B Additional Tables of Chapter 6

200

C Samples of the SYN Scan Dataset

205

xi

List of Figures

2.1

An illustration of the key immune components involved in the interaction between the innate and adaptive immune systems [44]. . . . . . . . . . .

13

2.2

A state-chart describing the three states of an individual DC. . . . . . .

26

2.3

An illustration of the signal transformation process of the DCA. . . . .

27

2.4

An illustration of different steps of the DCA, where the initialisation and analysis steps are performed at the population level and the rest of the steps (bounded within the two vertical lines) are performed at the individual DC level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

Development pathway of the DCA, where the iDCA is a hypothetical integrated system this thesis is intended to develop towards.

3.1

29

. . . . . .

31

An illustration of the integrated system, where the modified pre-processing and post-processing phases are presented. The rectangle boxes in grey colour represent the optional components of each step. . . . . . . . . . .

3.2

An example of the linear discriminant function in two-dimensional space, where the red dashed line represents the decision boundary found. . . .

3.3

66

77

An example of a separable problem for SVM in a two-dimensional space, where two red dashed lines represent the two hyperplanes that defined the maximised margin and the data point lying on either of the hyperplanes are the support vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . .

80 xii

LIST OF FIGURES

4.1

The behavioural flowchart of the single-cell model of the DCA, where events related to the algorithm’s functionality are included. . . . . . . .

4.2

Interpretation for E1 , E2 , E3 , E4 , and E5 , and the whole interval is divided into subintervals by the events. . . . . . . . . . . . . . . . . . . .

5.1

87

88

Values of input signals against time series (moving average with intervals of 100, per selected signal category, used for plotting the graph, but not the actual input of the system), showing the inherent noise in the dataset. The number of each antigen type per second against time series (moving average with intervals of 100, per antigen type of interest, used for plotting the graph, but not the actual input of the system). . . . . . . . . . . . .

5.2

120

Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the antigen based segmentation (z = 102 ), where both versions of anomaly metrics (Kα and M CAV ) are used. Due to the large amount of segments generated in this setting, the plots on the left side might seem overly crowded, however each point only corresponds to a single segment index. In addition, on the right side some of the bars are really low, this only means they have an extremely small number of points with respect to the total number of segments, but they are generally not zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

xiii

LIST OF FIGURES

5.3

Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the time based segmentation (z = 1), where both versions of anomaly metrics (Kα and M CAV ) are used. Due to the large amount of segments generated in this setting, the plots on the left side might seem overly crowded, however each point only corresponds to a single segment index. In addition, on the right side some of the bars are really low, this only means they have an extremely small number of points with respect to the total number of segments, but they are generally not zero. . . . .

6.1

132

Boxplots of the ROC results of all applied methods on the 10% KDD dataset, including the Manual (MAN), Correlation based (COR), Information Gain based (IFG) and PCA based (PCA) methods, as well as K-Nearest Neighbour (KNN), Decision Trees (DTree), and Support Vector Machines (SVM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2

151

Boxplots of the ROC results of all applied methods on the whole KDD dataset, including the Manual (MAN), Correlation based (COR), Information Gain based (IFG) and PCA based (PCA) methods, as well as K-Nearest Neighbour (KNN), Decision Trees (DTree), and Support Vector Machines (SVM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3

152

Barplot of the detection accuracy of all the above methods, including Manual (MAN), Correlation based (COR) based, Information Gain based (IFG), and PCA based (PCA) methods of the DCA, as well as the stateof-the-art techniques, i.e. Naive Bayes (Bayes), Decision Trees (DTres) [3], Self-Organising Map (SOM) [95], Genetic Algorithms (GA) based Support Vector Machines (SVM) [98], and V-detector Negative Selection Algorithm (VNSA) [151]. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

154

xiv

LIST OF FIGURES

A.1 Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the antigen based segmentation (z = 103 ), where both versions of anomaly metrics (Kα and M CAV ) are used. . . . . . . . . . . .

193


194


195


196

A.5 Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the time based segmentation (z = 10), where both versions of anomaly metrics (Kα and M CAV ) are used. . . . . . . . . . . . . . .

197

A.6 Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the time based segmentation (z = 102 ), where both versions of anomaly metrics (Kα and M CAV ) are used. . . . . . . . . . . . . . .

198

A.7 Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the time based segmentation (z = 103 ), where both versions of anomaly metrics (Kα and M CAV ) are used. . . . . . . . . . . . . . .

199 xv

LIST OF FIGURES

C.1 A sample of the antigen instances in the SYN scan dataset. . . . . . . .

206

C.2 A sample of the signal instances in the SYN scan dataset. . . . . . . . .

207

xvi

List of Tables

4.1

Details of primitive operations of Algorithm 5. . . . . . . . . . . . . . .

96

5.1

Weights of signal transformation for SYN scan application. . . . . . . .

119

5.2

Experimental parameters of SYN scan application.

119

5.3

The p-values (significance level α = 0.05) of one-sample one-sided Mann-

. . . . . . . . . . .

Whitney tests for H5.1 , with respect to Kα values and the antigen based segmentation. The highlighted cells represent the cases where the antigen based segmentation outperforms the standard DCA (‘non-seg’ stands for non-segmentation and ‘g’ indicates greater than). 5.4

. . . . . . . . . . . .

124

The p-values (significance level α = 0.05) of one-sample one-sided MannWhitney tests for H5.1 , with respect to M CAV values and the antigen based segmentation. The highlighted cells represent where the antigen based segmentation outperforms the standard DCA (‘non-seg’ stands for non-segmentation and ‘l’ indicates less than). . . . . . . . . . . . . . . .

5.5

125

The p-values (significance level α = 0.05) of one-sample one-sided MannWhitney tests for H5.1 , with respect to Kα values and the time based segmentation. The highlighted cells represent the cases where the time based segmentation outperforms the standard DCA (‘non-seg’ stands for non-segmentation and ‘l’ indicates less than). . . . . . . . . . . . . . . .

125

xvii

LIST OF TABLES

5.6

The p-values (significance level α = 0.05) of one-sample one-sided MannWhitney tests for H5.1 , with respect to M CAV values and the time based segmentation. The highlighted cells represent the cases where the time based segmentation outperforms the standard DCA (‘non-seg’ stands for non-segmentation and ‘l’ indicates less than). . . . . . . . . . . . . . . .

5.7

126

The p-values (significance level α = 0.05) of two-sample one-sided MannWhitney tests for H5.2 , with respect to Kα values and antigen based segmentation. The highlighted cells represent the cases where a smaller segment size outperforms a larger one (‘g’ indicates greater than and ‘N.A.’ represents not applicable).

5.8

. . . . . . . . . . . . . . . . . . . . .

127

The p-values (significance level α = 0.05) of two-sample one-sided MannWhitney tests for H5.2 , with respect to M CAV values and the antigen based segmentation. The highlighted cells represent the cases where a smaller segment size outperforms a larger one (‘l’ indicates less than and ‘N.A.’ represents not applicable). . . . . . . . . . . . . . . . . . . . . . .

5.9

128

The p-values (significance level α = 0.05) of two-sample one-sided MannWhitney tests for H5.2 , with respect to Kα values and the time based segmentation (‘N.A.’ represents not applicable).

. . . . . . . . . . . . .

129

5.10 The p-values (significance level α = 0.05) of two-sample one-sided MannWhitney tests for H5.2 , with respect to M CAV values and the time based segmentation. The highlighted cells represent the cases where a smaller segment size outperforms the larger one (‘l’ indicates less than and ‘N.A.’ represents not applicable).

. . . . . . . . . . . . . . . . . . . . . . . . .

130

5.11 The detection accuracy based on the Kα values for all segment sizes applied in both segmentation approaches. . . . . . . . . . . . . . . . . . . .

133

5.12 The detection accuracy based on the M CAV values for all segment sizes applied in both segmentation approaches. . . . . . . . . . . . . . . . . .

133

xviii

LIST OF TABLES

5.13 Detection accuracy’s of DCAO and DCAN , with respect to the antigen based segmentation and the same series of segment sizes. . . . . . . . . .

134

6.1

Weight matrix for signal transformation of the DCA. . . . . . . . . . . .

148

6.2

The p-values (significance level α = 0.05) of one-sided Wilcoxon tests for comparing the manual method and the automated methods on both the 10% and the whole KDD datasets, including True Positive Rate (TPR), False Positive Rate (FPR) and Detection Accuracy (ACC) of the Manual (MAN), Correlation based (COR), Information Gain based (IFG) and PCA based (PCA) methods. The highlighted cells represent the cases where an automated method outperforms the manual method. . . . . .

6.3

156

The p-values (significance level α = 0.05) of one-sided Wilcoxon tests for comparing the DCA-related methods and the machine learning techniques on both the 10% and the whole KDD datasets, including True Positive Rate (TPR), False Positive Rate (FPR) and Detection Accuracy (ACC) of the Manual (MAN), Correlation based (COR), Information Gain based (IFG) and PCA based (PCA) methods of the DCA, as well as K-Nearest Neighbour (KNN), Decision Trees (DTree), and Support Vector Machines (SVM). The highlighted cells represent the cases where a DCA-related method outperforms a machine learning technique. . . . . . . . . . . . .

6.4

The input features generated by the information gain based method of all the subsets of the 10% KDD dataset. . . . . . . . . . . . . . . . . . .

6.5

157

158

The input features generated by the correlation based method of all the subsets of the 10% KDD dataset. . . . . . . . . . . . . . . . . . . . . . .

159

B.1 Results of the DCA on the 10% KDD dataset, including True Positive Rate (TPR), False Positive Rate (FPR) and Detection Accuracy (ACC) for Manual (MAN), Information Gain based (IFG), PCA based (PCA) and Correlation based (COR) methods. . . . . . . . . . . . . . . . . . .

201 xix

LIST OF TABLES

B.2 Results of the machine learning techniques on the 10% KDD dataset, including True Positive Rate (TPR), False Positive Rate (FPR) and Detection Accuracy (ACC) for K-Nearest Neighbour (KNN), Decision Trees (DTree), and Support Vector Machines (SVM) algorithms. . . . . . . . .

202

B.3 Results of the DCA on the whole KDD dataset, including True Positive Rate (TPR) and False Positive Rate (FPR) for Manual (MAN), Information Gain based (IFG), PCA based (PCA) and Correlation based (COR) methods.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

203

B.4 Results of the machine learning techniques on the whole KDD dataset, including True Positive Rate (TPR), False Positive Rate (FPR) and Detection Accuracy (ACC) for K-Nearest Neighbour (KNN), Decision Trees (DTree), and Support Vector Machines (SVM) algorithms. . . . . . . . .

204

xx

Chapter 1

Introduction

1

1. Introduction

1.1 1.1.1

Motivation Artificial Immune Systems

From a computational perspective, the natural immune system has a series of appealing properties, e.g. ‘pattern recognition’, ‘diversity’, ‘self organisation’, ‘noise tolerance’ and ‘anomaly detection’ [53]. Such a powerful and diverse set of features is hard to find in other biological systems [36]. As a result, it is believed that novel solutions can be developed by capturing some of the computationally plausible features of the immune system. The resulting research is known as Artificial Immune Systems (AIS), which bridges multiple disciplines, i.e. immunology, computer science, engineering and mathematics. According to Timmis et al. [77],“the AIS community has produced a prolific amount of research ranging from modelling natural immune systems, solving artificial or benchmark problems, to tackling real-world applications, using an equally diverse set of immune inspired algorithms”. One responsibility of the natural immune system is to protect the body from a wealth of invading micro-organisms, and a series of immune-inspired algorithms have been developed to provide similar defensive properties within a computing context. Initially, immune-inspired algorithms were based on simple or outdated models of the human immune system, e.g. ‘self/non-self discrimination’ [13], where any non-self substances are considered harmful. As noted by Stibor et al. [151], ‘first generation algorithms’, including negative and clonal selection, do not necessarily produce the same high performance as the human immune system. These algorithms, negative selection in particular, are prone to problems with scalability and the generation of excessive false alarms. Recently developed AIS solutions use more rigorous and up-to-date immunology and are developed in close collaboration with immunologists. The resulting algorithms are believed to encapsulate some of the more desirable properties of immune systems, such as robustness and error tolerance [127]. One such ‘second generation’ AIS is the Dendritic Cell Algorithm (DCA) [62], which is

2

1. Introduction

inspired by functions of dendritic cells in the innate immune system. It is also based on the principles of a novel theory in immunology, known as the danger theory [109], which states that the recognition of a pathogen is based on environmental context (signals) rather than the simple self/non-self principle. The algorithm exhibits several interesting properties, such as data fusion, noise filtering, temporal correlation, and multi-scale aggregation [120]. Such a set of potentially beneficial properties could be the reason for successful applications of the algorithm including, port scan detection [63, 68], anomaly detection of a benchmark intrusion detection dataset [70], and botnet detection in networks based on IRC or P2P protocol [1, 2]. The algorithm has also been shown to have good performance in terms of detection rate, and the ability to reduce the rate of false alarms in comparison to other systems, such as Self Organising Maps (SOM) [67]. However, the DCA is still a fairly new algorithm whose algorithmic properties have not yet been fully explored to reveal its strengths and weaknesses. This led to the author’s initial interest in the DCA, from an algorithmic analysis and development point of view. As a result, it is necessary to conduct theoretical and empirical investigations of the algorithm, to further understand and explore its various properties.

1.1.2

The Dendritic Cell Algorithm

In terms of information processing, the DCA receives two types of input from a variety of sources, namely signals and antigens. They are usually derived from a data pre-processing phase that interfaces a given problem domain with the algorithm’s input space. Signals are represented as vectors of real-valued numbers and are periodic measures of features within the problem environment. An assumption made by the algorithm is that the presence or absence of an anomaly can be detected from these features [69]. Antigens are symbols (typically represented as an enumerated type), which represent items of interest within the environment. It is assumed that some of the antigens have a causal relationship with observed anomalies [69]. The DCA is a population-based algorithm, where several heterogeneous agents (cells) monitor the same inputs (signals

3

1. Introduction

and antigens) in parallel. Each cell stores a history of the received input signals, while maintaining a weighted sum of their magnitudes. As soon as the sum of the input signal magnitudes reaches a predefined decision threshold, the cell makes a decision of the sampled antigens based on the signal history. Once the decision has been recorded, the cell is reset and instantaneously returned to the population. Each cell is assigned a different decision threshold generated from a predefined probability mass function, ensuring that cells observe the data over different time scales [69]. In addition to the essential anomaly detection ability, the DCA exhibits several potentially beneficial properties. Firstly, the algorithm is empirically lightweight in terms of running time, due to the computationally and algorithmically simple linear functions employed by the algorithm [64, 121]. The algorithm is also capable of removing relatively high frequency noise within the input data due to its filtering property, which is similar to a moving average filter in the area of signal processing [122, 123]. Moreover, the algorithm performs temporal correlation that links identified anomalies to potential causes in a monitored system, where multiple data sources (signals and antigens) are fused according to some causal relationship [1, 66]. These additional features could potentially make the DCA advantageous over conventional techniques for problems where large sized, noisy and time-ordered data is involved. However, there are also issues with the DCA that may limit its applicability to problems and accessibility to users. One criticism is the lack of a formal definition of the algorithm [73], which could result in ambiguities of understanding its algorithmic properties. The algorithm’s computational complexity has not yet been theoretically analysed [71], and thus its scalability to data size or dimensionality of a given dataset cannot be generalised. In addition, the analysis phase of the algorithm is often performed once at the end [72], rather than being performed periodically and continuously in parallel with the detection, and this limits the algorithm’s online detection capability. Finally, the data pre-processing phase of the algorithm is performed manually by users based on their expertise of a given problem domain [74], instead of being an automated and adaptable

4

1. Introduction

process. This can be problematic for cases where expert knowledge is difficult or even infeasible to obtain. In order to resolve these issues, two investigations are to be conducted, one theoretical and one empirical. The theoretical investigation will focus on improving the algorithm’s accessibility, and is aimed at producing a formal definition and runtime analysis of the algorithm. The empirical investigation is to be concentrated on enhancing the algorithm’s applicability, with a focus on integrating techniques from other fields, such as signal processing, machine learning, and statistical inference. This also involves validating each of the newly developed components through well designed experimentation. As a result, a clear formulation, regarding what the DCA is, and when and how to apply the algorithm, will be produced. These investigations are to prevent incorrect implementations and applications due to misunderstanding in the future.

1.1.3

Theoretical and Empirical Investigations

The theoretical investigation presented in this work has two facets, one dependent on the other. A formal definition of the DCA can be seen as the pre-condition of the subsequent runtime analysis. Both of these steps are application independent, meaning no particular problem domain is considered during this investigation. On the other hand, the empirical investigation often involves testing a method on a series of datasets from a certain problem domain. As a result, it is necessary to conduct such an investigation with respect to some real-world application. Despite the variety of the DCA’s applications, it was originally designed and used as an anomaly detection and attribution algorithm [62]. In this thesis, the anomaly detection problem is defined as a binary classification problem (two-class), performed on (potentially noisy) discrete time series data [69]. No assumptions are made about the relative persistence of anomalous states and normal states, though the persistence of both states is assumed to be sufficiently long to differentiate them from noise. This is in contrast to the many alternate definitions of the anomaly detection problem, where there can be

5

1. Introduction

an implicit assumption that anomalies are transient or the assumption that only normal behaviour can be studied a priori, reducing the problem to a single class classification. For the investigation a separate, related problem is also defined, termed ‘the anomaly attribution problem’. This is the problem of attributing causal relationships between the presence of elements in the environment and the occurrence of identified anomalies [69], e.g. through temporal correlation. As a result, the aim of this thesis is to investigate algorithmic properties of the DCA from both the theoretical and empirical perspectives, with respect to the anomaly detection and attribution problem. It is intended to make the algorithm more accessible to future users and more applicable to the problem of interest.

1.2 1.2.1

Aim and Scope Research Questions

The main objective of the thesis is to investigate a number of algorithmic properties of the DCA from both theoretical and empirical perspectives, in order to improve the algorithm’s applicability and accessibility. With respect to this objective, three research questions need to be answered, involving formalisation and runtime analysis, development of an online analysis component for post-processing, and automation of the data pre-processing. The research questions are listed as follows: 1. Formalisation and runtime analysis: how can the DCA be formally defined to avoid ambiguities for future users? how can the runtime analysis of the algorithm be performed based on the formal definition? This is the focus of Chapter 4; 2. Online post-processing: how can the analysis phase of the DCA be modified, in order to perform periodic and continuous analysis (online) in parallel with the detection? This is the focus of Chapter 5;

6

1. Introduction

3. Automated pre-processing: how can the data pre-processing phase of the DCA be modified to make the system automated and adaptable to a given problem domain? This is the focus of Chapter 6. The second and third research questions refer to the empirical investigation that involves a real-world application of the anomaly detection and attribution problem. On such application is the anomaly-based intrusion detection that often requires an algorithm to have the ability not only to identify anomalies, but also to link them to potential causes. As a result, the empirical investigation in Chapter 5 and Chapter 6 will use anomaly-based intrusion detection datasets in experimentation.

1.2.2

Contributions to Knowledge

This thesis provides several contributions to the theoretical and empirical aspects of the DCA, as well as to the research of AIS in general, listed as follows. • Formalisation and runtime analysis – Firstly, a number of important properties of the DCA are formally defined and clearly presented as formal logic expressions or mathematical functions in Chapter 4. Based on the formal definition, runtime analysis of the algorithm is performed to show that it has a low runtime complexity. This is the first attempt at performing any formalisation and runtime analysis of the algorithm. Further analyses also show that the modifications made at both the pre-processing and post-processing phases do not increase the system’s runtime complexity. This implies that, in theory, the system where both the online analysis component and the automated data pre-processing module are integrated with the DCA, is scalable to the size of a given dataset. These findings provide a theoretical foundation and guidelines that could be insightful for the future development of the algorithm. • Online post-processing component – An online analysis component based on segmentation is introduced to the DCA to replace the original offline one in Chap7

1. Introduction

ter 5. Previous work in [67, 122] took brief glances at the use of segmentation for the online analysis of the DCA. However, this work examines the approaches in much greater detail, in terms of their effect on the algorithm’s online detection capability. The modified system is able to perform periodic and continuous (online) analysis in parallel with the detection. The interval between analysis processes is controlled by segment size and is adjustable. According to the experimental results, the introduction of segmentation results in significant improvements on the algorithm’s detection performance, compared to the standard DCA and a state-of-the-art machine learning technique. This indicates that the online detection capability of the DCA could be improved by the proposed online analysis component. Additional insights are also provided about choosing the appropriate parameters, e.g. the version of anomaly metrics and the value of segment sizes. This gives guidelines for future applications to similar problem domains. • Automated pre-processing module – Three automated data pre-processing methods based on dimensionality reduction and statistical inference techniques are proposed and tested in Chapter 6. The work in [74] conducted a preliminary investigation of the possibility of using a dimensionality reduction technique for the data pre-processing of the DCA. This work extends those ideas and develops an automated and adaptive data pre-processing module for the algorithm. Such a module could generate the proper input to the DCA from a given problem domain, without requiring expert knowledge. According to the experimental results, the automated methods, the PCA based method in particular, could produce significantly better detection performance than the manual method. However, none of the DCA-related method could outperform the machine learning techniques. The work shows the possibility of automating the data pre-processing phase of the DCA, which makes the algorithm potentially more applicable to a larger set of problems. However, it also implies the classification performance of the DCA is dependent on the effectiveness of data pre-processing. As a result, further investigations should be carried out, e.g. developing more effective pre-processing 8

1. Introduction

methods, or more fundamentally modifying the algorithm’s internal components to improve its detection capability. • Development of immune-inspired algorithms – This thesis also provides a constructive example for the future development of AIS. The theoretical work essentially fits the current trend of adding mathematical rigour to AIS research. In addition, segmentation could be used for application where online detection or large datasets are involved, through dividing the data into smaller and more manageable pieces. Even though the automated data pre-processing methods are tailored for the DCA, the work still shows the possibility of using techniques from other more established research fields to enhance the applicability of immuneinspired algorithms. As a result, the importance of theoretical foundations and the usefulness of integrating techniques from more established fields are demonstrated, with respect to the development of immune-inspired algorithms.

1.3

Thesis Structure

This thesis is organised as follows. A literature review of the human immune system, the field of AIS, the area of intrusion detection and solutions to the intrusion detection problem, particularly those from AIS, is provided in Chapter 2. The methodologies, algorithms and techniques used throughout the thesis are detailed in Chapter 3. The first research question is explored in Chapter 4, by using formal definitions of the DCA, performing theoretical analysis of its runtime complexity, and by formulating of runtime variables of interest. The second research question is explored in Chapter 5, through the development of an online analysis component based on segmentation approaches. The third research question is explored in Chapter 6, by the development of an automated data pre-processing module based on a combination of dimensionality reduction and statistical inference techniques. Finally in Chapter 7, conclusions with respect to the research questions are drawn, and possibilities of future work are also presented.

9

Chapter 2

Background and Context

10

2. Background and Context

2.1

Introduction

Since the early 1990s, biologically-inspired computing has become one of the most active interdisciplinary research topics. It generally involves biology, computer science, engineering and mathematics. Researchers in this area draw inspiration from nature, to create computer systems encapsulating the appealing properties of biological systems. Biologically-inspired computing is intimately related to the field of artificial intelligence, due to its tremendous influence on many machine learning techniques. Typical examples include: genetic algorithms that are inspired by the evolutionary process of organisms [113]; cellular automata which mimics cells evolving based on the states of their neighbouring cells through a number of discrete time steps [21]; artificial neural networks that create a network or circuit of artificial neurones similar to those seen in the human brain [114]; artificial life that studies systems related to life and tries to represent them in artificial environments [102]. In addition to these, a new research area, Artificial Immune Systems (AIS), has emerged. This area focuses on drawing inspiration from theoretical immunology and observed immune functions, principles and models, to develop computer systems for real-world problem solving [36]. The natural immune system is evolved to protect the body from a wealth of invading micro-organisms, and AIS are designed and developed to provide same defensive properties within a computing context. A large amount of applications of immune-inspired algorithms have been originally concentrated on the area of computer security, especially intrusion detection [99]. It is believed that AIS could be advantageous over other conventional techniques in certain cases, due to a rich set of properties mapped from the natural system. This chapter is organised as follows. An introduction to the human immune system is provided, to demonstrate its computationally plausible properties. This is followed by an overview of the area of AIS, including the most popular AIS paradigms, as well as the immunology background, algorithmic details and development pathways. Additionally, intrusion detection (a subset of computer security) is described, along with open 11


issues of the area, well known intrusion approaches and solutions of conventional techniques. Finally, applications of AIS to intrusion detection are demonstrated, to show AIS paradigms, e.g. the DCA, are suitable for the problem of interest.

2.2 2.2.1

The Human Immune System Immune Components

The human immune system is often described as a defence system that has evolved to protest the body (host) from harmful micro-organisms such as bacteria and virus, named pathogens [60]. The human immune system can be further divided into the innate immune system and adaptive system that feature different components and mechanisms [26]. The innate immune system is evolved and passed from one generation to another, while the adaptive immune system is developed throughout the lifespan of an individual. In addition, the innate immune system often responds to invading pathogens immediately, such as within hours of an initial encounter. Conversely, the adaptive immune system requires a much longer time to generate a more specific response, which can take days or even months to accomplish. The immune system consists of specialised molecules, cells and organs that interact with each other. Interactions between immune components result in a rich and complex set of behaviours that are aimed at the recognition and elimination of threats to the host. Major players of the immune system are different populations of cells, including antigen presenting cells (APCs), T-cells, and B-cells. In order to carry out cell-to-cell or cellular communications, one type of molecules, knows as cytokines, are employed as message carriers. Immune cells usually exhibit various receptors on the surface, to recognise particular cytokines to receive the messages sent by other cells. These receptors can also bind with certain proteins, such as antigens. In fact, a commonly used definition of antigens is anything that can be recognised or bind to receptors of immune cells [26]. The ability of a receptor to bind with a single pattern of antigens is defined as the

12

2. BackgroundRand E V IContext EWS Innate immunity (rapid response)

Adaptive immunity (slow response) Dendritic cell

Mast cell

B cell

Macrophage γδ T cell

T cell

Natural killer cell

Basophil

Complement protein Natural killer T cell

Eosinophil Granulocytes

Antibodies CD4+ T cell

CD8+ T cell

Neutrophil

Figure 1 | The innate and adaptive immune response. The innate immune response functions as the first line of defence against infection. It consists of soluble factors, such as complement proteins, and diverse cellular components including granulocytes (basophils, eosinophils and neutrophils), mast cells, macrophages, dendritic cells and natural involved killer cells. Thein adaptive Figure 2.1: An illustration of the key immune components the immune interaction response is slower to develop, but manifests as increased antigenic specificity and memory. It consists of antibodies, B cells, and between the innate and adaptive immune systems [44]. CD4+ and CD8+ T lymphocytes. Natural killer T cells and γδ T cells are cytotoxic lymphocytes that straddle the interface of innate and adaptive immunity.

specificity of this receptor. The strength of the binding is measured by an affinity, and Chemical carcinogens. The characterization of cancer-

healing. However, defective cell-mediated immunity also

of dense intratumoral lymphocyte infiltrates indicates

expression in the tumour and therefore enhance its

higher affinitytogether results a tighter molecular [60]. specific antigens withfrom the prognostic importance hasbinding a crucial role, as IFN-γ can upregulate MHC class I 32 that innate some attempt is made by the immune system to as non-specific IMMUNOGENICITY .immunity, Mice that lacked T and B cells — The immune system, also referred is the first-line de-

impede tumour growth. To delineate a potential role for

because of a deletion in the Rag2 protein, which is

cytokines in inhibiting formation, required the starts rearrangement of T-cell and B-cell fence of the immunetumour system againstseveral infections [91].for It with the physical and groups examined the susceptibility of immunodeficient

receptors during lymphocyte development — or that

mice to chemical carcinogens 1). Schreiber lackedtypes important of lymphocyteand cytotoxicity chemical barriers, such (TABLE as skin, some and particular of mediators acids, enzymes, soluble others demonstrated that, compared with wild-type

such as γδ T cells, perforin or TNF-related apoptosis-

controls, mice impairedthe interferon-γ func-system inducing (Trail) similarly manifested an increased proteins. In with addition, innate(Ifn-γ) immune alsoligand employs a different group of cells,

IMMUNOGENICITY

The ability of an antigen or vaccine to stimulate an immune response.

tion had an increased susceptibility to the polycyclic susceptibility to methycholanthrene treatment32–37. Moreover, tumours from immunodeficient hosts, hydrocarbon as measured by a e.g. naturalmethylcholanthrene, killer cells, Antigen Presenting Cells (APCs), to eliminate threats orbutinnot immune competent hosts, were efficiently rejected higher frequency of tumour formation and a shorter 27–29 . IFN-γ, a cytokine that system. is after transplantation into wild-type animals, revealing an period of tumour teracting with latency the rest of the immune The adaptive immune system is also produced by T cells, NK cells, NKT cells and, to a lesser increased sensitivity of the tumours to cell-mediated extent, DCs macrophages, wasimmunity, previously shown to involves recognitionantigen-specific and cytotoxicity by the immune of known as and antigen-specific which cells and system immune mediate pleiotropic effects in the innate and adaptive wild-type animals32. sensitivity to car- of same organisms [162]. It is designed to response tomechanism infection30. A comparable memory to prevent reinfection Immunodeficiency and spontaneous tumours. To test cinogens was detected in mice that are deficient in Ifn-γ, whether hostthe immunity also suppress Ifn-γ receptor or Stat1,pathogens a transcriptionthat factorescaped eliminate the(Ifn-γr) remaining from initialcould defence of the sponinnate taneous cancers, several investigators delineated that is crucial for Ifn-γ signalling. Mice that lacked the theofincidence of tumours in aged immunodeficient p40 subunit of interleukin and Il-23, two immune system. Unlike(Il)-12 the instant responses the innate immune system, responses mice (TABLE 1). Smyth and colleagues reported cytokines that stimulate Ifn-γ production, and mice that that perforin -knockout and, to a greater extent, cells, a key source ofsystem Ifn-γ, also developed oflacked the NKT adaptive immune take much longer, as T-cells and B-ells specific forperthe forin / Ifn- γ double-knockout mice succumbed to more tumours in response to chemical carcinogens than 31 38,39 . to the disseminated lymphomas normal mice invading micro-organisms need to undergo clonal expansion before they. Analogous differentiate into The mechanisms by which IFN-γ deficiency promotes chemically induced tumours from immunodeficient increased tumour formation are probably multifactorial hosts, these neoplasms were readily eliminated foland include attenuated control of target-cell growth and lowing transplantation into wild-type mice through a apoptosis, increased angiogenesis and impaired wound mechanism that involved CD8 + T cells. Similarly,13

NATURE REVIEWS | C ANCER

VOLUME 4 | JANUARY 2004 | 1 3


effector cells with various antigen specificities for eliminating infections. The innate and adaptive immune systems are interconnected and inseparable, due to the need of the collaboration between each other to defend the body against various types of invading pathogens. The linkage between these two immune systems is managed by a group of APCs, called Dendritic Cells (DCs) [109]. In order to induce immune responses against specific pathogens, DCs are required to interact with T-cells. Based on the type of antigens, B-cells may also be involved. The key immune components involved in the interaction between the innate and adaptive immune systems are displayed in Figure 2.1 [44].

2.2.2

Computability of Immune System

Computationally plausible properties of the natural immune system are the inspiration for building artificial systems that exhibit similar functionality. Such an ability to compute is closely related to the immune system’s decision making process. In order to achieve decision making, the immune system employs a process known as ‘immune learning’ [50] that involves two types of learning, namely long-term evolutionary learning in the species level and short-term reinforcement learning in the individual level. Long-term learning is the process of developing a repertoire of detectors used as the basis of pattern recognition. An individual normally inherits such a repertoire from their parents, which inherit theirs from their parents and so forth. This evolutionary process could take an extensive period of time to accomplish, exemplified in the innate immune system. Short-term learning, on the other hand, is developed through the immune system being exposed to various proteins during the lifetime of an individual, as performed by the adaptive immune system. The immune system uses negative selection and clonal selection mechanisms [60] for short-term learning. This makes it possible to effectively and quickly respond to both previously known and unknown threats. The key to negative selection is to generate a population of T-cells that react to non-self substances, as self/non-self theory [13]

14


states that any non-self substances are generally considered harmful to host. Naive Tcells are continuously exposed to the environment, those which react to self substances are removed from the population, leaving those that would only respond to pathogens. In addition, the goal of clonal selection is to generate a population of B-cells that are able to produce effective antibodies to recognise invading antigens. The knowledge of pathogens is proposed to be ‘remembered’ by the immune memory [132], which is responsible for enhancing the effectiveness and strength of second-encountered immune responses. The short-term learning of immune system expresses certain properties that are metaphorically similar to pattern recognition in computer science. The decision making process of the immune system leads to the state change of the immune system change or the body, which exhibits similar computational properties to a Turing Machine [144]. The state of the immune system refers to the information inherent in its organisation and the way the system responds to the defined input [24]. The input to the immune system is the state of the body, and the output of the immune system is the healing process (the inflammatory response) that maintains a healthy body [24]. As a result, the immune system can be seen as a computation machine that transforms the body-state data into the immune-system data, where feedbacks are given simultaneously to modify the body’s state and restore its health [23]. The decision making process regulates state changes of individual immune cells, which could result in state changes of the immune system. For instance, the shift of the immune system to a defensive state requires a range of immune cells to change their states, to develop the population of effective immune agents for immune responses. The area of AIS mainly focuses on capturing these intriguing computational properties of the immune system, to develop novel solutions in computer science.

15


2.3 2.3.1

Artificial Immune Systems Overview of AIS

The area of AIS, as stated previously, bridges multiple disciplines, including immunology, mathematics, computer science and engineering. The development of immune-inspired algorithms is often accomplished through mathematical or computational modelling of immunology, abstraction from those models into algorithms, subsequently design and implementation in the context of computer science and engineering [152]. Cohen defines three types of AIS researchers in [24]: • Literal school that mimics the exact behaviours of the natural immune system; • Metaphorical school that draws inspirations from the natural immune system to build computer systems with similar functionality; • Modellers that use computational or mathematical modelling for the purpose understanding the natural immune system. The first two types of researchers focus on the application perspective of AIS, whereas the third type of people refer to those working on immune modelling. In addition to the application and modelling perspectives of AIS, recently theoretical analyses of existing immune algorithms has gained great interests, an overview of theoretical advances in AIS is given in [154]. The objective of this thesis is to investigate algorithmic properties of an immune-inspired algorithm and thus is application driven. As a result, this chapter examines the application and theoretical perspectives of AIS rather than immune modelling. Unlike evolutionary algorithms [113], there is no one archetypal AIS, instead there are four major AIS paradigms, termed Negative Selection Algorithms [83], Clonal Selection Algorithms [37], idiotypic network based algorithms [77], and various versions of the Dendritic Cell Algorithm (DCA) [62]. These are the most commonly used AIS approaches, although hybrid algorithms are also frequently applied to optimisation problems. 16


In the immune system, immune components interact with each other to form their immunological functions or behaviours. In order to represent these immune components in a computer system, a layered framework for designing AIS was proposed by de Castro and Timmis in [36] as a template, to describe the structure of AIS paradigms. From the problem domain that AIS are applied to, three layers are usually involved as follows: • Component Representations to represent functional components of the system; • Affinity Measures to quantify the interactions between the components of the system; • Immune Algorithms to determine the system dynamics of the behaviours of all components in the system. One well known type of component representations was introduced by Perlson and Oster in [130], where the notion of shape space is defined. In the shape space S, two types of immune components, antigen (Ag) and antibody (Ab) are defined as d-dimensional vectors. The type of these vectors determines the type of shape space, and it often falls into one of the following categories [36]: • Real-Valued: features are represented as real-valued numbers; • Hamming: features are elements of a set of finite alphabets; • Symbolic: features are of nominal type, represented as symbols. Antibodies can be replaced by other types of detectors if necessary, to interact with antigens. This modification is presented in the DCA, where the notation of Ab can be generalised to represent any types of detectors that interact with Ag. In most immune-inspired algorithms, an affinity measure function F (Ab, Ag) is employed to quantify the interaction between a detector and an antigen. This function calculates a distance metric of Ab and Ag in shape space S, which is frequently the Euclidean distance between two vectors. A threshold ε can be applied, and if 17


Algorithm 1: Generic Negative Selection Algorithm [152]. input : Sseen = set of known self elements output: D = set of generated detectors 1 2 3 4 5 6 7 8 9 10 11

begin repeat Randomly generate a set P of potential detectors; Determine the affinity of each element of P with each element of Sseen ; if one element in Sseen recognises a detector in P then Remove the detector from P ; else Add the detector to D; end until stopping criteria has been reached ; end

F (Ab, Ag) < ε, then Ab is bound Ag, or in other words, the detector recognises the antigen. Based on the satisfaction of this conditional statement, further actions, such as removing a detector from the current population or adding it to the detector repertoire, can be made. The role of such affinity measure function may vary in different algorithms. However, the threshold-based principle often stays the same.

2.3.2

Negative Selection Algorithms

Negative selection algorithms are based on the maturation process of T-cells biologically. This process is known as ‘negative selection’, which takes place in the thymus [128]. Firstly, the immune system generates a population of naive T-cells, each of which is assigned with a random specificity capable of binding to one type of antigens. Any self reactive naive T-cells (able to bind with self substances presented by APCs) are then removed from the population. Remaining T-cells can only be reactive to non-self substances, which according to self/non-self theory [26] are generally threats to the host. A generic negative selection algorithm based on [55] is given in Algorithm 1. Negative selection in an AIS context was first proposed by Forrest et al. [55] as an anomaly-based detection method. The algorithm was purely based on the self/non-self theory, where the shape space was presented as bit strings. It consisted of three phases, 18


namely defining self, generating detectors and monitoring the occurrence of anomalies [99]. In subsequent research, a negative selection based architecture, proposed by Hofmeyer and Forrest [82] for network security, attracted great interests, due to its potential. It was one of the first systematic demonstrations of both the algorithmic and application aspects of negative selection algorithms. This research contributed significantly to the development of AIS as a field in its own role. Gonzalez and Dasgupta [61] extended the component representations of negative selection algorithms from binary strings to real-valued numbers. However, modification to the representation greatly increased the computational complexity of the algorithm. Theoretical aspects of negative selection algorithms were initially shown by Esponda et al. [48, 49] to reveal a connection between the Boolean satisfiability problem and a negative database. However, the work carried out by Stibor et al. [148, 151] followed by Stibor’s thesis [147] pointed out the extreme difficulty of generating detectors disregarding the type of component representations. The authors stated that the generation of detectors in negative selection algorithms could result in holes in the shape space, which leads to the problem of either under-fitting or over-fitting. Additionally, the process can take an extensive amount of time to complete, with no guarantee of producing a complete detector set. The string-based negative selection paradigm was then revisited by Elberfeld and Textor [46], to show the possibility of reducing the worst-case runtime complexity of the algorithm from exponential to polynomial, through compressing r-chunk and r-contiguous detectors. More recently, the work of Liskiewicz and Textor [107] showed that negative selection can be implemented without generating detectors explicitly, instead a limited amount of detectors is generated through non-exhaustive negative selection.

2.3.3

Clonal Selection Algorithms

Clonal selection algorithms are based on Burnett’s theory of clonal selection and immune memory [26], which involve another population of immune cells, B-cells. Generated in

19


Algorithm 2: Generic Clonal Selection Algorithm [152]. input : S = set of patterns to be recognised, n = the number of worst element to remove output: M = set of memory detectors capable of recognising unseen patterns 1 2 3 4 5

6 7

8

9 10

begin Create an initial random set of antibodies A; forall the patterns in S do Determine the affinity with each antibody in A; Generate clones of a subset of the antibodies in A with the highest affinity, the number of clones is inversely proportional to its affinity; Mutate attributes of these clones inversely proportional to its affinity; Add these clones to the set A, and place a copy of the highest affinity antibodies in A into the memory set M ; Replace the n lowest affinity antibodies in A with new randomly generated antibodies; end end

the bone marrow, B-cells are capable of producing antibodies to detect diverse and numerous patterns of invading pathogens. This mechanism consists of several adaptive and learning processes, including antigen driven affinity maturation process of B-cells and associated hypermutation mechanism. Immune memory is composed of ‘memory cells’ that are able to ‘remember’ patterns of previously seen pathogens, forming the secondary immune responses. Two features of B-cells make clonal selection computationally plausible: firstly, the proliferation of a B-cell is proportional to the affinity of its antibody-antigen binding, and thus the higher affinity the binding is, the more clones of this B-cell are produced; secondly, the mutation of a B-cell’s antibody is inversely proportional to the antibody-antigen binding, as this process operates at an extremely high rate, such a mechanism is referred to hypersomatic mutation [26]. A generic clonal selection algorithm is presented in Algorithm 2. Clonal selection algorithms are closely related to evolutionary algorithms [35], but they have several distinct properties. For instance, the selection and mutation mechanisms are determined by the affinity of antibody-antigen binding, the cross-over mechanism of evolutionary algorithms is absent, and multiple solutions are usually allowed to be found

20


as optima. One of the most well known and widely used clonal selection algorithms is CLONALG developed by de Castro and Von Zuben [39], which shares the same principles as Algorithm 2. The work of Timmis and Neal [155] introduced the concept of artificial recognition balls that contain a number of identical B-cells for the systems, where resources are limited. Therefore, individual B-cells are not longer represented in the system, instead clusters of identical B-cells are used in order to save resources. Adaptive immune mechanisms of B-cells are introduced in AIRS [161], which extended CLONALG by introducing the metaphor of the immune network theory. The resulting system was applied to both unsupervised and supervised learning tasks [131, 160]. Other clonal selection algorithms such as the B-cell algorithm [97] used a unique contiguous hypermutation operator to perform function optimisation tasks, and showed significantly fewer evaluations than a hybrid genetic algorithm without compromising the quality of solutions obtained. The work done by Culleto et al. [33] incorporated the criteria of setting a generic type of clonal selection algorithms in [134], the resulting algorithm is known as the Immune Algorithm (IA). The authors showed the possibility of using various schemes for hypermutation and ageing, and the two sufficient conditions for convergence. Theoretical analysis in [154] showed that the evolution of a population in clonal selection algorithms can be defined in a discrete state space that changes based on probability rules. As the probability of transitions to a new state is only dependent on the current state, a Markov chain model can be used to describe clonal selection algorithms in a more mathematically sound manner. Preliminary work carried out by VillalobosArias et al. [159] used the Markov chain theory to prove the convergence of a clonal selection based algorithm, called MISA [32]. The authors stated the premise for the proof to hold is that an elitist memory set must be maintained. The same proof method was adopted by Clark et al. [20] through modelling the hypermutation operator of the B-cell algorithm, to prove its convergence. The authors showed that the algorithm could find at least one global optima with probability one as the operator follows the limit t → ∞. More recently, the work of Zarges [168, 169] theoretically analysed a 21


vital operator, named ‘inversely proportional mutation rate’. The analysis focused on the effect of this operator to the system performance, and suggestions regarding its settings were given based on the theoretical analysis on runtime complexity. Jansen and Zarges [92] performed a theoretical analysis of a specific type of mutation operator, called immune-inspired somatic contiguous hypermutation. The authors showed that despite the limitations, this operator can still perform much better for function optimisation than standard bit mutation commonly used on evolutionary algorithms.

2.3.4

Immune Network Based Algorithms

Immune network based algorithms are derived from the immune network theory proposed by Jerne [94]. It suggests that the immune system can be seen as a network, in which immune entities interact with each other even when antigens are absent. Interactions can be initialised not only between antigens and antibodies, but also between antibodies. This can induce either stimulating or suppressive immune responses, which result in a series of immunological behaviours, including the tolerance and memory emergence. There are three major factors that affect the stimulation level of B-cells [77], i.e. the contribution of the antigen binding, the contribution of neighbouring B-cells, and the suppression of neighbouring B-cells. As the stimulation level of a B-cell increases, the amount of clones it produces increases accordingly. At the population level, this results in a diverse set of B-cells. In addition, three mutation mechanisms are introduced, including crossover, inverse and point mutation, in contrast to clonal selection algorithms. Immune network models can be either continuous (based on ordinary differential equations for understand the behaviour of real immune network), or discrete (based on iterative procedures of adaption for the purpose of problem solving). The major difference of immune network based algorithms from other AIS paradigms is the allowance for interactions between immune components, e.g. between antibodies. Basic concepts of the immune network theory are implemented in immune algorithms, such as AINE [155] and

22


Algorithm 3: Generic Immune Network Algorithm [152]. input : S = set of patterns to be recognised, nt = network affinity threshold, ct clonal pool threshold, h number of highest affinity clones, a number of new antibodies to introduce output: N = set of memory detectors capable of recognising unseen patterns 1 2 3 4 5 6

7 8

9

10

11 12 13

14 15 16

begin Create an initial random set of network antibodies N ; repeat forall the patterns in S do Determine the affinity with each antibody in N ; Generate clones of a subset of the antibodies in N with the highest affinity, the number of clones is proportional to its affinity; Mutate attributes of these clones inversely proportional to its affinity; Place the h number of highest affinity clones into a clone memory set C; Eliminate all elements of C whose affinity with the antigen is less than ct; Determine the affinity amongst all the antibodies in C and eliminate those antibodies whose affinity with each other is less than ct; Incorporate the remaining clones in C into N ; end Determine the affinity between each pair of antibodies in N and eliminate all antibodies whose affinity is less than nt; Introduce a number a of new randomly generated antibodies into N ; until a stopping condition is met; end

aiNet [38], for the application to pattern recognition and data clustering respectively. The algorithm aiNet is essentially a modified version of CLONALG with the addition of suppressive interactions between antibody components. A generic immune network algorithm based on aiNet is presented in Algorithm 3. The aiNet algorithm gained great popularity for solving problems of optimisation, as shown in [5, 22]. However, a recent theoretical analysis of aiNet performed by Stibor and Timmis [150] pointed out that aiNet algorithm had weaknesses for the problems that involve clustering non-uniformly distributed data. This is due to the distance metric based implementation of the suppressive mechanism in the algorithm, which results in either significant information loss cause by compression or over-redundant representation of input space. Other immune network based algorithms include the 23


work of Whitbrook et al. [163] that used the short-term learning mechanism presented in the immune network for the application to mobile robots. A combined short-term and long-term learning approach based on the behaviours of the immune network was proposed in [164]. This method was applied to solve mobile-robot navigation problems, which are presented and tested both in simulation and on actual robots.

2.3.5

The Dendritic Cell Algorithm

Biological Background The DCA is inspired by properties shown by the DCs of the innate immune system, which forms part of the body’s first line of defence against invaders. DCs have the ability to combine a multitude of molecular information and to interpret this information for the T-cells of the adaptive immune system. This can result in the induction of various immune responses against perceived pathogenic threats. Therefore, DCs can be seen as detectors responsible for policing different tissues, as well as inductive mediators for a variety of immune responses. In general, two types of molecular information processed by DCs, namely ‘signal’ and ‘antigen’. Signals are collected by DCs from their local environment and consist of indicators of the health of the monitored tissue. Throughout its lifespan, an individual DC will exist in one of three states, namely ‘immature’, ‘semi-mature’ and fully ‘mature’, as shown in Figure 2.2. In the initial immature state, DCs are exposed to a combination of signals, and perform phagocytosis to ingest substances from their surroundings. Based on the concentration of presented signals, DCs differentiate into either a ‘fully mature’ form to activate the adaptive immune system, or a ‘semi-mature’ form to suppress it. If a DC is exposed to a combination of signals generated from a healthy or steady state tissue environment, such as no occurrence of tissue damage, it more likely becomes a semimature DC. Conversely, if a DC is presented with a combination of signals generated from a damaged tissue environment, such as the presence of unregulated cell death, it more likely becomes a fully mature DC. Natural DCs bind to and process many 24


cytokine signals. In an abstract model of DC behaviour developed by Greensmith [62], the following categories are defined. • PAMP: Pathogenetic Associated Molecular Patterns, molecular signatures of pathogens which are recognised by Toll-Like Receptors (TLRs) on the surface of DCs, and they are highly influential to the transition from immature state to fully mature state; • Danger: released by damaged tissue cells subject to necrosis (unregulated cell death), they have a lower effect than PAMPs on the maturation towards fully mature state; • Safe signals are derived from the cells that encounter apoptosis (programmed cells death), TNF-α (Tumour Necrosis Factor) is one candidate of safe signals, they contribute to the maturation from immature state to semi-mature state; During the immature state, DCs also collect debris in the tissues which are subsequently combined with the environmental signals. Some of the ‘suspicious’ debris collected are known as antigens, and they are proteins originating from potential invading entities. DCs combine the ‘suspect’ antigens with evidence in the form of signals to correctly instruct the adaptive immune system to respond, or become tolerant to the presented antigens. For more detailed information regarding the underlying biological mechanisms, please refer to [62, 109].

Algorithmic Details The DCA was designed and developed based on the abstract DC model created by Greensmith [62]. It incorporates the functionality of DCs including data fusion, state differentiation and causal correlation. As per the natural system, there are two types of input data, namely ‘antigen’ and ‘signal’. It is generally assumed that certain causal relationship exists between the two data streams. Antigens are categorical values that can be various states of a problem domain or the entities of interest associated with 25


Figure 2.2: A state-chart describing the three states of an individual DC. a monitored system. Signals are represented as vectors of real-valued numbers, and they are measures of a monitored system’s status within certain time periods. In realworld applications, antigens represent what to be classified within the given problem domain. For example, they can be process IDs in computer security problems [2, 63], a small range of positions and orientations of robots [121], the proximity sensors of online robotic systems [115], or the time stamps of records collected in biometric datasets [74]. Signals represent system context of a host or a measure of network traffic [2, 63], the readings of various sensors in robotic systems [115, 121], or the biometric data captured from a monitored automobile driver [74]. Signals are normally pre-categorised as ‘PAMP’,‘Danger’ or ‘Safe’. The semantics of these signal categories is listed as follows: • PAMP: increases in value as the observation of anomalous behaviour, it is a confident indicator of anomaly, which usually presented as signatures of the events that can definitely cause damage to the system; • Danger: reflects to potential anomalies, as the values increases, the confidence of

26


Input

Output

PAMP

CSM

Danger

Safe

K

Positive weight Negative weight 2 Friday, 29 July 2011

Figure 2.3: An illustration of the signal transformation process of the DCA. the abnormal status of the monitored system increases accordingly; • Safe: increases in value in conjunction with observed normal behaviour, this is a confident indicator of normal, predictable or steady-state system behaviour. Increases in the value of safe signal suppress the effect of the PAMP and Danger signals within the algorithm, as per what is observed in the natural system. This immunological property has been incorporated within the DCA in the form of predefined weights for each signal category, for the transformation from input signals to output signals, which are ‘CSM ’ and ‘K’ signals. The CSM signal reflects the amount of information a DC has processed, i.e. when to make decisions, while the K signal is a measure indicating the polarisation towards anomaly or normality, i.e. how to make decisions. The output signals are used to evaluate the status of the system monitored by the analysis component of the algorithm. Such a signal transformation process is displayed in Figure 2.3. In order to achieve its detection ability, the DCA initialises a population of artificial DCs operating in parallel as detectors. Each DC is given a particular limit of its lifespan, which creates a dynamic time window effect in the population [122]. This leads to the 27


same signal and antigen data streams being processed by every DC, during different time periods across the analysed time series. A temporal correlation between signals and antigens is also performed by each DC internally, to capture the causal relationship within the data. As suggested in [64], to perform correct correlation, the signals are supposed to appear after the antigens, and the delay should be shorter than the time window created by each DC. During detection, an individual DC updates its antigen profile by storing the sampled antigens internally. In the meantime, the output signals produced by the signal transformation are accumulated, to update the DC’s lifespan and signal profile. The DC’s lifespan is subtracted by the cumulative CSM, which gives the difference between the amount of information initially allowed for a DC and that has been processed by the DC so far. Such difference indicates if the DC has processed sufficient information and is ready to make decisions. On the other hand, its signal profile is added by the cumulative K, to aggregate the polarisation towards anomaly or normality indicated by its tendency toward −∞ or +∞. As soon as the DC’s lifespan reaches zero, it stops performing signal transformation and temporal correlation. The association between the cumulative K and sampled antigens within the DC, termed ‘processed information’, is then presented by the matured DC to the analysis phase. Once a matured DC has presented the processed information, a new immature DC is created in its place with default values. Here, the population size is generally kept constant, but can be user specified. The entire process of different steps of the DCA is illustrated in Figure 2.4, and a generic implementation of the DCA is shown in Algorithm 5.

Development Pathway Numerous versions of the DCA have been developed over the past 10 years. In this section, a brief history of these different variants and their applications is presented. To demonstrate some the diversity of the DCA’s development, an overview of the development pathway is shown in Figure 2.5.

28


Figure 2.4: An illustration of different steps of the DCA, where the initialisation and analysis steps are performed at the population level and the rest of the steps (bounded within the two vertical lines) are performed at the individual DC level. Following the initial abstract model, the applicability of the DCA was first demonstrated in a prototype system (pDCA) [65]. Here, the pDCA was applied to a standard machine learning problem to demonstrate that this population-based algorithm was computationally capable of performing binary-class discrimination on an ordered dataset. In this application, a timestamp formed antigen and a combination of pre-processed features composited three signal categories. After the encouraging results achieved, the pDCA was further developed into a larger system, combined with the immune-inspired agent-based framework libtissue [157]. This version (ltDCA) [68] was stochastic in nature and contained numerous somewhat unpredictable elements, including random sampling of incoming antigen, signal decay and randomly assigned lifespans. Initially, the ltDCA was applied to problems in computer and network security, including port scan detection and sensor network security [66, 68]. While ltDCA yielded positive results in a number of applications, it contained too many arbitrary and random components, rendering detailed study of its behaviour as an algorithm complex and difficult. In parallel to the development of ltDCA, the algorithm steadily increased in popularity

29


Algorithm 4: A generic implementation of the DCA. input : antigen and signal instances output: antigen types and anomaly metric Kα set DC population size; initialise DCs; while data do if antigen then agCounter++; cellIndex = agCounter % populationSize; DC of cellIndex assigned antigen; update DC’s antigen profile; end if signal then calculate csm and k; foreach DC do DC.lifespan -= csm; DC.sumK += k; if DC.lifespan 0 divided by the total number of DCs that sampled an antigen type. A threshold εm can also be applied for further classification, and it is equal to the mean magnitude of the summation of PAMP signal and Danger divided by the mean magnitude of Safe signal in a given dataset [62].

3.3

Formal Methods

3.3.1

Duration Calculus

In order to formalise the DCA for the runtime analysis, several formal methods are used. Among them, duration calculus [172] is a temporal logic and calculus for describing and reasoning properties of a real-time system [125] over time intervals. It can specify the safety properties, bounded responses and duration properties of a real-time system, which can be logically verified through inductions. Unlike predicate calculus [43] that uses time points to express time-dependent state variables or observables of the specified system, duration calculus uses time intervals with the focus on the implicit semantic level rather than the explicit syntactic level. As a result, it is more convenient and concise to use duration calculus to specify patterns or behaviour sequences of a real-time system over time intervals, compared to predicate calculus. Duration calculus was first introduced by Zhou and Hansen [171] as an extension of the Interval Temporal Logic [116]. It uses continuous time for specifying desired properties of a real-time system without considering its implementation. The specifications are presented by formulas, which express the behaviour of time-dependent variables or observables of a real-time system within certain time intervals. In the duration calculus specifications, both abstract high-level and detailed low-level specifications can be formulated according to the selected variables or observables. This makes it possible to specify the system from different perspectives at various levels. There are different versions of duration calculus [172], e.g. the classic duration calculus and the extended duration calculus. The following description is based on the classic duration calculus,

54

3. Overview of Techniques

as it is sufficient for specifying the system presented. The syntax defining the structure of the duration calculus specifications and the semantics explaining its meaning are described here. The duration calculus specifications consist of three elements, state assertions, terms and formulas. The formal definitions [125] are as follows. Definition 1. State assertions are Boolean combinations of basic properties of state variables, defined as P ::= 0 | 1 | X = d | ¬P | P1 ∧ P2 .

(3.5)

As a state assertion, the Boolean value of the observable of P can be either 0 or 1; It can have a state variable X whose value is d of data type D; There are situations where P does not hold; There are also situations where the sub-states of P , P1 and P2 , both hold. The semantics of a state assertion involves the interpretation of time-dependent variables that occur within it. Let I be an interpretation, the semantics of a state assertion P is a function defined in 3.6,

IJP K : Time → {0, 1}

(3.6)

where 0 or 1 represents the Boolean value of P at t ∈ Time. This can also be written as I(P )(t). Definition 2. Terms are expressions that denote real numbers related to time intervals, defined as Z θ ::= x | l |

P | f (θ1 , ..., θn ).

(3.7)

The expression above states that given an interval l during which the state assertion P holds, there is a global variable x that is related to the valuation a n-ary function f . The semantics of a term depends on the interpretation of state variables of the state assertion, the valuation of the global variables, and the given time interval. The

55


semantics of a term θ is defined as shown in 3.8.

IJθK : Val × Intv → R

(3.8)

where Val stands for the valuation (V) of the global variables, and Intvl is the given interval which can be defined in 3.9. def

Intv ⇐⇒ {[b, e] | b, e ∈ Time and b ≤ e}

(3.9)

So this term can also be written as IJθK(V, [b, e]). Definition 3. Formulas describe properties of obervables depending on time intervals, defined as F ::= p(θ1 , ..., θn ) | ¬F1 | F1 ∧ F2 | ∀x • F1 | F1 ; F2 .

(3.10)

This expression shows that there is a n-ary predicate with the terms of θ1 , ..., θn defined in the interval of l, during which F1 does not hold or F1 and F2 hold. The quantitative part of the expression is separated by the symbol of ‘•’. It states that for all x, F1 holds in the interval of l, or there are situations where F1 and F2 hold respectively in the subintervals of l. The symbol ‘;’ is the chop operator used for dividing the given time interval into subintervals. The semantics of a formula involves an interpretation of the state variables, a valuation of the global variables and a given time interval, defined as shown in 3.11. IJF K : Val × Intv → {tt, ff}

(3.11)

where tt stands for true and ff for false. It can also be written as IJF K(V, [b, e]), which stands for the truth value of F under the interpretation I, the valuation V, and the interval [b, e].

56


3.3.2

Example of a Gas Burner

The authors of [133] show how to specify the external behaviour of a control program, through refining its abstract requirements in several steps until a formula is reached. The example used to demonstrate such process is the control study of a gas burner, which is regulated by a thermostat to directly control the gas value and monitor the flame. Here we present step by step the process of using duration calculus to specify such control program. In order to present the current state of the gas burner, two time-dependent and Booleanvalued state variables are defined as follows,

G, F : Time → {0, 1}

(3.12)

where G and F stand for the gas value (gas on or off) and the flame (on or off) respectively. The values of Boolean observables being ‘1’ or ‘0’ correspond to the states being on or off. The state expression modelling that gas is leaking can be also defined as,

L ::= G ∧ ¬F

(3.13)

where L is a state assertion. It states that the state of leaking is defined by the state of gas being on and the state of flame being off. Let [b, e] ⊆ Time be an interval, the safety requirement is that the gas must at most be leaking 1/20 of the elapsed time, defined as Z (e − b) ≥ 60 s ⇒ 20

e

L(t)dt ≤ (e − b)

(3.14)

b

It states that if an interval is longer than 60 seconds, the duration of leaking would be shorter than 1/20 of such interval. In addition, a design decision in terms of safety, such as leaks should be detectable and stoppable within one second can be defined as,

L[c, d] ⇒ (d − c) ≤ 1 s

∀c, d : b ≤ c < d ≤ e

(3.15)

57


where P ::=

Rd c

P (t)dt = (d − c) > 0 indicating that P holds throughout the interval

[c, d]. It states that for all subintervals of [b, e] if a leak is detectable and stoppable, the duration of leaking [c, d] would be shorter than one second. Such state expression, denoted by P , holds throughout the entire duration of [c, d], if 0 < c − d ≤ 1 is valid. The example of the gas burner is rather simple and has not used all the syntaxes and semantics of duration calculus, but it provides a demonstrative case of specifying a control program. This can give the reader a worked out example to better understand the application of duration calculus for specifying real-time systems. The use of duration calculus for the DCA from a behavioural perspective will be further explored in Chapter 4, for the formalisation of a single-cell model of the algorithm.

3.4 3.4.1

Pre-Processing Techniques Dimensionality Reduction

One of the most important steps in data pre-processing is dimensionality reduction [89]. It involves generating a new lower dimensional feature set that is representative of the original features. Methods used for dimensionality reduction often fall into two categories, namely feature selection or feature extraction. Feature selection involves selecting the best subset of input features from the original feature set for a given problem domain. Feature extraction, on the other hand, creates new features through transforming or combining all features in the original feature set. In feature selection, there are three sub-categories, namely ‘filters’, ‘wrappers’ and ‘embedded methods’ [75]. Filters select subsets of features as a pre-processing step that is independent of the chosen classifier. Wrappers treat the learner as a black box and score subsets of features according to their predictive power. Embedded methods perform feature selection during the training phase and are usually specific to given classifiers. Both wrappers and embedded methods depend on the chosen classifiers or learners, and they are not applicable here since no explicit learning is involved in the data pre-processing

58


phase of the DCA. As a result, only filter based methods and unsupervised feature extraction techniques are considered. Several standard dimensionality reduction techniques explored in this investigation include correlation coefficient, information gain and Principal Component Analysis (PCA) are described as follows.

Correlation Coefficient In statistics, correlation is a means of measuring whether two continuous variables are correlated to each other. A common usage of correlation is to determine whether a feature is redundant in the original feature set. If two features are highly correlated, one of them can be replaced by the other. The correlation coefficient between a feature in the original feature set and the class labels can also indicate the predictive power of the feature. Selecting features based on their correlation coefficients with the class labels can be suitable for the data preprocessing of the DCA. Let X and Y be two random variables, the correlation coefficient of X and Y can be calculated as shown in 3.16.

Corr(X, Y ) =

Cov(X, Y ) , σ(X)σ(Y )

(3.16)

where Cov(X, Y ) is the covariance of X and Y defined in Equation 3.17, σ(X) and σ(Y ) are the standard deviations of X and Y respectively. This equation refers to the Pearson product-moment correlation coefficient or Pearson’s correlation [25].

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])],

(3.17)

where E[·] is the expectation operator. The correlation coefficient, sometimes also called the cross-correlation coefficient, is a quantity that measures the quality of a least square fitting to the original data in regression [45]. It can be seen as the normalised covariance to measure the dependence between two random variables. Its value is bounded by [−1, 1], if the value tends to 1, the two variables are increasingly positively correlated; if it tends to −1, the two variables are negatively (inversely) correlated; if it is 0, the

59


two variables are independent to each other. Features that are either highly positively correlated or highly negatively correlated to class labels are usually considered as useful for classification. Such intuition can be used for the data pre-processing of DCA, as it considers both a feature’s predictive power and its analogy to the semantics of the algorithm’s signal categories. Details will be described in Section 3.4.3.

Information Gain In information theory, entropy is a measure of the randomness or uncertainty of a sequence of symbols drawn from a discrete distribution [45]. For certain cases, the more new information a feature contains, the more representative it might be of the problem domain. Let {v1 , v2 , ..., vm } be a discrete set of symbols associated with probabilities pi , where i ∈ {1, 2, ..., m}. The calculation of the entropy of a random variable is defined as shown in 3.18. H=−

m X

pi log2 pi

(3.18)

i=1

The above calculation can also be presented as H = E[log 1/P ], where P is a random variable whose possible values are p1 , p2 , ..., pm . This suggests that the entropy does not depend on the symbols themselves, but rather on their probabilities. If a random variable is drawn from a continuous distribution, Equation 3.18 can be extended as Equation 3.19. Z

∞

H=−

p(x) log p(x)dx,

(3.19)

−∞

where p(x) is the probability density function of a random variable X, and x ∈ X. The relative entropy, also known as Kullback-Leibler divergence, can be used to measure the distance between two distributions of the same variable. Let p(x) and q(x) be two probability density functions of the same variable x. The discrete version and continuous version of the relative entropy are defined in Equation 3.20 and Equation 3.21,

KL(p(x), q(x)) =

X x∈X

q(x) ln

q(x) p(x)

(3.20)

60


Z

∞

q(x) ln

KL(p(x), q(x)) = −∞

q(x) dx p(x)

(3.21)

where KL(p(x), q(x)) ≥ 0, and KL(p(x), q(x)) = 0 if and only if p(x) = q(x). Relative entropy is not symmetric, that is, KL(p(x), q(x)) 6= KL(q(x), p(x)). In feature selection, measure of uncertainty between two random variables is more commonly used. Let X and Y be two random variables with probability density functions p(x) and q(y) respectively, where x ∈ X and y ∈ Y . The mutual information is the reduction in uncertainty about one variable X to the knowledge of the other variable Y . It is calculated as shown in Equation 3.22.

I(p; q) = H(p) − H(p|q) =

XX

r(x, y) log

x∈X y∈Y

r(x, y) , p(x)q(y)

(3.22)

where r(x, y) is the joint probability of finding value x and y at the same time. Mutual information is essentially the relative entropy between the joint distribution r(x, y) and the product distribution p(x)q(y). Variables X and Y are statistically independent if and only if r(x, y) = p(x)q(y). Therefore, mutual information measures how much the distribution of the variables differs from statistical independence. In the case of continuous variables, one can try to discretise the variables or approximate their probability density functions with a nonparametric method such as Parzen windows [156], since they are difficult and sometimes impossible to obtain. Mutual information can be applied as a measure between a feature in the original feature set and the class labels for feature selection, also known as ‘information gain’. Here p(x) represents the variable of class label and q(y) is any feature of the original feature set. Information gain measures how much the distribution of a feature is relevant to that of the class labels, as it is closely related to the feature’s predictive power. As a result, selecting the features that have a higher information gain as the input to the DCA can be an effective dimensionality reduction method.

61


Principal Component Analysis One of the best known methods for feature extraction is principal component analysis (PCA). It can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximised [85]. It is a mathematical operation that transforms a finite number of possibly correlated vectors into a smaller number of uncorrelated vectors, termed ‘principal components’. PCA provides a simple way of reducing complex data to a lower dimension to reveal any hidden or simplified structure. It involves evaluating the sample set mean and the covariance matrix of a given dataset, to find the eigenvectors corresponding to the largest eigenvalues. In addition to the reduction of noise in the given dataset by maximising the variance, PCA can also reduce redundancy within the data, which is measured by covariance. A high covariance between two random variables indicates that one of them can be predicted from the other and thus is redundant. PCA produces a list of principal components ranked by their variances, which will link to the DCA as the DCA’s signal transformation process is sensitive to changes of data streams. Take another original feature set X (n × m matrix), thus the transformation of PCA is defined in Equation 3.23,

Y = PX

(3.23)

where Y is the derived n×m matrix of the new feature set, and P is an orthogonal matrix whose columns are eigenvectors. PCA aims to maximise the variance and minimise the corvariance of the input data X. It can be seen as a process of finding an orthogonal matrix P such that CP ≡

1 T n−1 YY

is diagonalised. In this way, the diagonal elements of

CP are maximised, and thus the variance is maximised; additionally all the non-diagonal elements are equal to zero, and thus covariance is minimised. PCA is intimately related to an algebraic technique, known as singular value decomposition (SVD) [143]. In fact, a commonly used solution to PCA is performed through

62


SVD. Let XT X be a rank r, square, symmetric n × n matrix, {ˆ v1 , v ˆ2 , ..., v ˆr } be the set √ of orthonormal m × 1 eigenvectors with associated eigenvalues {λ1 , λ2 , ..., λr }, σi ≡ λi be singular values, and {ˆ u1 , u ˆ 2 , ..., u ˆ r } be the set of orthonormal n × 1 vector defined by u î ≡

1 vi . σi Xˆ

We can construct a diagonal matrix Σ, where the diagonal entries are the

rank ordered set of singular values (σ˜1 ≥ σ˜2 ≥ ... ≥ σr˜). Similarly we can also construct two orthogonal matrices V and U, where V = [ˆ v˜1 v ˆ˜2 ... v ˆm u˜1 u ˆ ˜2 ... u ˆ n˜ ] with ˜ ] and U = [ˆ an additional orthonormal vectors filled up respectively. As a result X can be represented in the form of singular value decomposition as in Equation 3.24, which states that any arbitrary matrix X can be converted to an orthogonal matrix, a diagonal matrix and another orthogonal matrix [143].

X = UΣVT

(3.24)

When PCA applied to classification problems, a common approach in the literature [90] is to obtain P through Y = PX from the training set X, and then apply this to the testing set X∗ using Y∗ = PX∗ . Here P can be seen as some sort of weight matrix acquired through training. The data instances in the testing set are projected onto the principal subspace through transformation, where both the noise and redundancy of the input data can be reduced to a minimum.

3.4.2

Statistical Inference

In statistical inference, a method for quantifying the difference between an estimator and the true values of the quantity being estimated is mean square error (MSE). Let a random sample (x1 , x2 , ..., xn ) be drawn from a probability distribution f (x; θ). The use ˆ 1 , x2 , ..., xn ), of x1 , x2 , ..., xn to estimate θ involves finding a function of x1 , x2 , ..., xn , θ(x ˆ 1 , X2 , ..., Xn ) is called an estithat best represents θ. The corresponding function θ(X mator of θ. MSE is one of the ways of assessing the quality of an estimator. It is defined in Equation 3.25 [58]. ˆ = E[(θˆ − θ)2 ] MSE(θ)

(3.25) 63


ˆ is the property of an estimator θˆ that takes account of both its bias and variance, MSE(θ) as shown in Equation 3.26, ˆ = E[{(θˆ − θ) ¯ + (θ¯ − θ)}2 ] MSE(θ) ¯ 2 ] + E[(θ¯ − θ)2 ] + 2(θ¯ − θ)E[(θˆ − θ)] ¯ = E[(θˆ − θ) ˆ + [bias(θ)] ˆ 2 + (0) = Var(θ)

(3.26)

ˆ and bias(θ) ˆ = E[θˆ − θ] is the bias of θˆ and if E[θ] ˆ = θ, θˆ is an unbiased where θ¯ = E(θ) estimator. According to Equation 3.26, if bias → 0 and variance → 0 as n → ∞, then MSE→ 0 as n → ∞. As a result, if the MSE of θˆ is minimised, both its bias and variance are kept to minimum and the quality of θˆ as an estimator can be maximised. MSE can be applied to the signal categorisation step of the DCA, after the signal selection has been performed through dimensionality reduction. For instance, a new feature set is either selected or extracted from the original feature set of a given dataset, subsequently it is mapped to the input signals of the DCA. In order to justify whether the current settings are effective, one can compare the predictions produced by the system with the truth that is represented as the class labels. The lower MSE the system produces, the better predictive power it has. Here, the predictions of the system form θˆ which is an estimator of the truth θ whose elements are the class labels.

3.4.3

Automated Data Pre-Processing Methods

Three automated data pre-processing methods are developed through combining the dimensionality reduction and statistical inference techniques, and an illustration of the integrated system is displayed in Figure 3.1. Let X be a n × m matrix (n data instances and m dimensions) that represents the input data space of a given problem domain, where each entry xij ∈ R and i and j are the index of rows and columns respectively. Dimensionality reduction involves generating a new feature space S that is a n × d matrix and d < m, through either feature selection or feature extraction. This new

64


feature set S should be as representative as the original feature set X as possible, that is, the information loss due to dimensionality reduction should be kept to a minimum. At the same time, the new feature set S should contain as little noise and redundancy as possible, that is, dimensionality reduction is intended to remove as much noise and redundancy as possible from the original feature set X. After dimensionality reduction, features in S need to be grouped into one of the three signal categorises of the DCA. Dependent on whether a statistical inference technique is used, two data pre-processing approaches are derived, namely ‘regression based approach’ and ‘inference based approach’. In addition, according to the dimensionality reduction technique applied, three methods are developed, namely the ‘correlation based’, ‘information gain based’ and ‘PCA based’ methods.

Regression Based Approach Based on the DCA’s defined semantics of signal categories, the magnitude of the PAMP and Danger signals is correlated to anomalous situations, while the magnitude of the Safe signal is correlated to normal situations or inversely correlated to anomalous situations. In addition, the PAMP signal usually present a higher correlation than the Danger signal to anomalous situations. Anomalous situations and normal situations within a binary labelled dataset can be indicated by class labels where ‘+1’ stands for anomalous and ‘−1’ stands for normal. As a result, the correlation coefficient between each feature in the original feature set and the class labels can be used for signal selection and categorisation. The principle for signal selection and categorisation is as follows: • The feature with the highest positive correlation coefficient is the PAMP signal; • The feature with the second highest positive correlation coefficient is the Danger signal; • The feature with the lowest negative correlation coefficient is the Safe signal.

65


Figure 3.1: An illustration of the integrated system, where the modified pre-processing and post-processing phases are presented. The rectangle boxes in grey colour represent the optional components of each step. 66


As the correlation based method performs both signal selection and signal categorisation, statistical inference techniques used for signal categorisation are not required here. This method is designed to identify the features that are most appropriately allocated to each signal category according to their correlation coefficients with the class labels. However, whether simple evaluation of the correlation coefficients is sufficient to encode the characteristics of each signal category is still uncertain. As a negative sign is assigned to Safe signal, there is no guarantee that linear combination of highly correlated features still results in high correlation. For example, let p, d, s ∈ [0, 1]n be the chosen features for the PAMP, Danger and Safe signals respectively. If the predefined weight vector for signal transformation is (2, 1, −3)T , the resulting linear combination 2p + d − 3s may not be necessarily highly correlated to the class label. More details will be discussed in Chapter 6.

Inference Based Approach Unlike the correlation based method, it is difficult to relate the features derived from the information gain based or PCA based methods to any of the signal categories of the DCA. The ultimate objective of data pre-processing is to generate proper input data that will eventually result in high classification accuracy. As a result, any method that can minimise errors of prediction by searching for the best combination of derived features is able to fulfil such objective. In the DCA, Kα is used as the anomaly metric for determining whether an antigen type is anomalous or not. The higher the Kα the more anomalous an antigen type is and vice versa. Therefore, in order to maximise the classification performance, the distance between the values of Kα of anomalous and normal antigen types should be maximised. The Kα of an antigen type is calculated from aggregating the cumulative K values associated with all the DCs that sampled this antigen type. Each DC individually acts as a moving time window which sums and averages the K values transformed from input signal instances within certain time intervals. To maximise the distance between

67


anomalous Kα values and normal Kα values, it is necessary to maximise the distance between the cumulative K values of anomalous situations and the cumulative K values of normal situations. The class labels indicate whether a period in the time series is normal or anomalous. As a result, if the difference between the sequence of class labels and the sequence of K values within a duration is minimised, and high classification performance can be expected. We adopt notations used in pattern recognition and regression estimation to explain this process. Suppose we are given a training set (x1 , y1 ), ..., (xn , yn ) ∈ X × Y , where X ∈ Rm (m-dimensional) and Y ∈ {±1}. Each pair of (xi , yi ) represents a sample or data instance with the associated truth or class label. The task is to predict the output y for previously unseen input x contained in a testing set, and the performance of predictions are often measured by a defined loss function. The loss function that measures the empirical error of the DCA predictions on the given input data space X is defined as 3.27. n

LX (f (xi ), yi ) =

1X ˆ (ki − yi )2 , n

(3.27)

i=1

where f (xi ) is the function that transforms an input data instance xi ∈ X to the output signal of the DCA, normalised K signal kî with respect to the range of the class label or truth yi . This involves two steps: firstly, the data instance xi is transformed to a signal instance si ∈ S through data pre-processing. Secondly, the signal instance is transformed through the signal transformation function of the DCA using associated weight matrix to the output signal kî . Mean Square Errors (MSE) can be applied to assess the distance or dissimilarity between the distributions of two random variables. Let θ be distribution of the class labels yi derived from a given dataset and θˆ be the distribution of kî values that are transformed by the DCA from input signal instances. They are defined as θ = {yi | yi ∈ {±1}} and ˆ θˆ = {kî | kî ∈ [−1, 1]}. The loss function in 3.27 is essentially equivalent to MSE(θ). As a result, the task now becomes the search for the combination of derived features ˆ The distance between this that produces an estimator θˆ with the minimum MSE(θ).

68


estimator θˆ and θ is minimised and thus the best combination for signal categorisation can be found. Candidate features derived from PCA or information gain are evaluated using MSE to search for the optimal combination that minimises the error of predictions. PCA uses all the original features and transforms them onto the principal subspace through linear combination. The first few principal components are often used to capture the majority of the underlying characteristics within the dataset, due to their extensive contributions to the variance. Conversely, the information gain based method chooses a subset of the original feature set, in which the features have significant higher information gains than other features. For the sake of fairness, the amount of features chosen by the information gain based method is equal to the number of principal components used in the PCA based method. The use of MSE for signal categorisation is as follows. Let A = {A1 , A2 , ..., A2d } be a set of features, and A1 = {A1 , A2 , ..., Ad } contains the features generated through dimensionality reduction thus the column vectors of S and A2 = {Ad+1 , Ad+2 , ..., A2d } consists of their inverses. The proportionality of a derived feature on each signal category can be either positive or negative, so both the feature and its inverse are included. The task is to search for a set A0 ⊂ A such that the θˆ calculated from its elements gives the ˆ minimum MSE(θ). In the current systems implemented, d = 3, so the PCA based method uses the first three principal components to transform the original feature set to A1 on the principal subspace, whereas information based method chooses the top three features with the highest information gain to derive A1 . In each iteration of search, three elements Ai , ˆ are derived. This process Aj , Ak (i 6= j 6= k) are selected from set A, and θˆ and MSE(θ) ˆ is found. keeps repeating until the estimator θˆ that produces the minimum MSE(θ) As a result, the error of predictions of the DCA is minimised and high classification accuracy can be expected. The combination of features that produces the smallest error of prediction during training can be identified. This knowledge can then be applied to

69


the testing phase when effective signal categorisation is needed. The disadvantage of such approach is that it can only handle three features and greedy search raises the issue of scalability and time requirement.

3.5 3.5.1

Online Analysis Techniques Segmentation Approaches

Another issue with the DCA is the offline analysis phase that is performed after all the data is processed. This highly limits the algorithm’s online detection capability, as an online detection system often performs analysis in a periodic and continuous fashion [72]. To develop the DCA into an online detection system, it is obvious an online analysis component should be developed. If online analysis is to be performed during detection, one issue needs resolved, that is, when to perform the analysis. This can be overcome through segmentation as a means of implementing online analysis. Segmentation in this context means the process of dividing a sequence of information into smaller equally sized pieces. As processed information is presented by matured DCs over time, a sequence of processed information is generated during detection. Segmentation involves partitioning this sequence into relatively smaller segments, in terms of the number of data items or time. All the generated segments have an identical size, and the analysis is performed within each individual segment. Therefore, in each segment, one set of detection results is generated, in which intrusions appeared within the duration of this segment can be identified. Two segmentation approaches are applied, namely ‘antigen based segmentation’ and ‘time based segmentation’.

3.5.2

Antigen Based Segmentation

The number of sampled antigens indicates the amount of potential suspects sampled by the system, the quantity of antigens to be classified. As certain antigen types may appear at different points across the entire time series, it is advisable to identify whether they 70


are responsible for the current situation of the monitored system as soon as sufficient information is obtain for decision making. The number of sampled antigens can be used as an indicator for whether an analysis of the current batch of processed information should be performed. The antigen based segmentation approach creates a segment whenever the number of sampled antigens reaches a predefined segment size, and the analysis is performed within this segment. Similar work was done in [101], in which the overall network traffic is partitioned into subsets of manageable size, and the analysis is performed within each partition.

3.5.3

Time Based Segmentation

The processed time determines the quantity of evidence that can be used for supporting classification, as signal instances are fed into the system on a regular basis, for example once per second. The accumulated evidence can improve the decision making of the current situation of the monitored system. Therefore, the processed time can be used as another indicator for when to perform an analysis on the current batch of processed information. The time based segmentation approach sets the segment size based on the processed time, which implies the quantity of signal instances processed. The time based segmentation approach creates a segment whenever the defined time period elapses, and the analysis is also performed within each segment. This approach is commonly used in real-time robotics control, for example, to periodically compute the next steering command in motion planning to avoid collisions [56]. Both segmentation approaches can also be effectively applied to partitioning a large sized dataset into smaller segments. The main challenge of processing large sized datasets involves insufficient computational power and extensive memory usage. Once segmented the amount of computational power and memory space required for each segment can be reduced significantly. As a result, large sized datasets become more manageable in terms of computational resources required. Similar to the application to online analysis, segments of the given dataset can be generated based on either the number of processed

71


data instances or the amount of elapsed time. Further details of both segmentation approaches will be covered in Chapter 5.

3.6 3.6.1

Machine Learning Techniques Supervised Learning Techniques

The development of automated data pre-processing methods involves a comparison to manual methods. Additionally, to make sure the experimental results are useful, three standard machine learning techniques are also applied on the same datasets to obtain baseline performance. One of these techniques is also intimately related to the future work of the DCA. These machine learning techniques include K-Nearest-Neighbour (KNN), Decision Trees and Support Vector Machines (SVM). KNN and Decision Trees techniques are chosen due to their simplicity, whereas SVM is chosen because of its state-of-the-art performance.

3.6.2

K-Nearest-Neighbour

A general goal of learning machines is to generalise a mathematical model from the training data, which best approximates the distribution from which all data points are drawn. This can be seen as a density estimation where the density function p(x) is estimated through the given data points in the training set. The K-Nearest-Neighbour (KNN) technique is one of the popular techniques used for such a purpose, which can be easily applied to a range of static classification problems. Assume a dataset contain P N data points, in which Nk data points belong to class Ck , and we have k Nk = N . In order to classify a new data point x, we can draw a sphere centred on x consisting of K points irrespective to their classes. Let V be the volume of this sphere and Kk the number of points from class Ck . Based on the properties of kernel density estimators

72


and Bayes’s theorem [10], the density estimation of each class is defined in 3.28.

p(x|Ck ) =

Kk Nk V

(3.28)

The unconditional density and the class priors can be easily derived from Equation 3.28, and they are given in Equation 3.29.

p(x) =

K Nk and p(Ck ) = NV N

(3.29)

By combining Equation 3.28 and Equation 3.29, we get the posterior probability class membership as shown in Equation 3.30.

p(Ck |x) =

Kk p(x|Ck )p(Ck ) = p(x) K

(3.30)

The classification of a new data point is performed by assigning it to the class Ck that gives the highest probability, corresponding to the largest value of Kk /K. In simple terms, the KNN algorithm works like a voting system, where the K points included in the sphere are the voters and data point x is the object they vote for. As a result, if the majority of all voters consider that x belongs to class k, then it is assigned with a class label Ck . An interesting property of the KNN algorithm when K = 1 is that if N → ∞ the error rate is never more than twice the minimum error rate achieved by an optimal classifier that use the true class distribution [29]. The KNN algorithm shows the advantage in terms of classification accuracy in general, however it requires the entire training data to be stored during runtime. As a result, if applied to large datasets, the KNN algorithm could be slower and more computationally expensive than other machine learning techniques.

73


3.6.3

Decision Trees

Another common technique used is Decision Trees classification. It involves partitioning the input space into a number of cuboid regions, the edges of which are aligned with the axes, and each region assigned a model or simply a constant. The prediction or classification of a new data point is performed by identifying which particular region it belongs to. This decision making process is related to the traversal of a binary tree that splits two branches at each node. One of such tree-based techniques is known as classification and regression trees (CART) [12]. In a decision tree, each node represents a feature or variable in the input data associated with certain property, such as x > θ where a binary decision can be made. The binary decision made at each node produces outcomes, called splits, which correspond to splitting a subspace of the input space. Let t be a target variable for prediction from a D-dimensional vector x = (x1 , x2 , ..., xD )T of input features. The training data contain input vectors {x1 , x2 , ..., xN } associated with a set of labels {t1 , t2 , ..., tN }. The objective is to create a tree in which the sum-of-squares error function is minimised. So the optimal target value of prediction within any given region is equal to the average of the values of tn for those data points included in that region. The training phase involves determining the input feature chosen for each node to form the split criterion, as well as the value of threshold θ for a split. The process is listed below: 1. Select a feature as the root node of the decision tree, based on a measure of how informative each feature is; 2. Grow the tree by adding one node each time, with the choice of which feature to split and the value of threshold determined by the local average of the data; 3. Repeat step 2 for all possible features to be split, until the one that gives the minimum sum-of-squares error is found, performed by exhaustive search; 4. Keep growing the tree until the number of data points associated with the leaf nodes reaches a defined threshold; 74


5. Prune back the resulting tree based on a specific criterion that optimises the balance between sum-of-squares error and model complexity. Let the resulting tree before pruning be T0 , and T ⊂ T0 be a subtree of T0 if it can be found by pruning nodes from T0 . In addition, the leaf nodes are indexed by τ = 1, 2, ..., |T |, where leaf node τ represents a region Rτ of input space that contains Nτ data points and |T | is the total number of leaf nodes. The optimal prediction for region Rτ is defined in Equation 3.31.

yτ =

1 X tn Nτ

(3.31)

xn ∈Rτ

So the corresponding sum-of-squares of the prediction is given in Equation 3.32. X

Qτ (T ) =

(tn − yτ )2

(3.32)

xn ∈Rτ

The criterion for pruning back leaf nodes is defined in Equation 3.33.

C(T ) =

|T | X

Qτ (T ) + λ|T |

(3.33)

τ =1

where λ is a regularisation parameter that determines the trade-off between the residual sum-of-squares error and the model complexity measured by the number of leaf nodes |N |, its value is normally derived from cross validation. The sum-of-squares error is the commonly used measure of performance when decision tree algorithms are applied for regression tasks, while the cross-entropy and Gini index are the two popular measures used for classification tasks [45]. They are defined in Equation 3.34 and Equation 3.35 respectively.

Qτ (T ) =

K X

pτ k ln pτ k

(3.34)

k=1

75


Qτ (T ) =

K X

pτ k (1 − pτ k )

(3.35)

k=1

where pτ k represents the proportion of data points in region Rτ assigned to class k (k = 1, 2, ..., K). Apart from the measures of performance, the process of growing and pruning the tree, is rather identical in decision tree algorithms for both regression tasks and classification tasks. During tree growing, cross entropy and Gini index show advantages over misclassification rate for the following reasons: firstly, cross entropy and Gini index are sensitive to the node probabilities; secondly, they are both differentiable, and thus suitable for gradient based optimisation methods. However, during tree pruning no such advantages are shown, therefore the misclassification rate is generally used.

3.6.4

Support Vector Machines

Support Vector Machines (SVM) models use a linear discriminant function for classification, kernel methods to deal with nonlinearly separable data and quadratic optimisation to find the optimal decision boundary. A generic description of SVM is given as below.

Linear Discriminant Function A discriminant is a function that assigns an input vector x to one of K classes, denoted Ck through some transformation. One of these discriminants is known as the linear discriminant function, which is defined in Equation 3.36.

y(x) = wT x + b

(3.36)

where w is the weight vector associated with transformation and b is known as the bias. In the case of binary classification (K = 2), a data point x is assigned to C1 if y(x) > 0, and to C2 otherwise. So the decision boundary of classification is defined by the relation y(x) = 0, which corresponds to an (m − 1)-dimensional hyperplane within an m-dimensional input space. Figure 3.2 shows an example where data points

76


y

W

x

Figure 3.2: An example of the linear discriminant function in two-dimensional space, where the red dashed line represents the decision boundary found. of two classes are represented as green circles and blue squares respectively, and they are separated by the decision boundary represented as the red dashed line, which is perpendicular to the weight vector w. If the data points of an m-dimensional dataset are linearly separable, one can always find a linear hyperplane to separate the two classes in the m − 1 dimension. Let x1 and x2 be two data points lying on the decision surface, that is, y(x1 ) = y(x2 ) = 0. This indicates wT (x1 − x2 ) = 0 and therefore weight vector w is orthogonal to the decision surface as well as any data point lying on it. Similarly if x is point on the decision surface, the normal distance from the origin to the decision surface can be calculated from x, as defined in Equation 3.37. b wT x =− kwk kwk

(3.37)

It is apparent that the intercepts of the decision boundary are determined by the bias parameter b. In addition, the value of y(x) provides a signed measure of the perpendicular distance r of the data point x from the decision surface. Let x⊥ be the orthogonal

77


projection of a data point x onto the decision boundary, and their relationship to the perpendicular distance r is defined in Equation 3.38.

x = x⊥ + r

w kwk

(3.38)

From Equation 3.38, we can derive the equation of r by multiplying wT and then adding b on both sides, this is given in Equation 3.39.

r=

y(x) kwk

(3.39)

Kernel Methods According to the Cover’s theorem [30], data that is nonlinearly separable would more likely become linearly separable if projected onto a higher dimensional space. Additionally, the Mercer’s theorem [112] states that any continuous, symmetric, positive semidefinite kernel function k(x, x0 ) can be expressed as a dot product in a high-dimensional space, where x and x0 are two samples of the input feature space X. A kernel function is defined as in Equation 3.40. k(x, x0 ) = Φ(x)T Φ(x0 )

(3.40)

where Φ is a feature space mapping from the original input feature space X to a possibly higher dimensional dot product feature F . In order to produce kernels, a positive matrix Kij := k(xi , xj ) should be formed first [136]. Since Φ(x)T Φ(x0 ) = Φ(xT x0 ), Kij can be derived from transforming a matrix whose entries are the dot products of xi and xj , using kernel function Φ. Data points that are nonlinearly separable are replaced by the entries of Kij that are linearly separable in the transformed space F . Therefore with kernel methods, in theory any nonlinear learning task can be transformed into a linear one. This results in an easier to solve and computationally less expensive problem. One of the most commonly used

78


kernel functions is the Gaussian kernel, defined in Equation 3.41. 0

k(x, x ) = exp

−kx − x0 k2 2σ 2

(3.41)

In addition to the Gaussian kernel, other alternatives, including the linear kernel, the polynomial kernel, the hyperbolic tangent kernel exist [16]. It is advisable to start with Gaussian kernel in practical applications, and if it fails to produce satisfactory results, one can switch to other options.

Support Vector Machines Suppose in an input feature space X, there are n observations given by a trusted source, each of which consists of a vector x ∈ Rm , i = 1, 2, ..., n, and an associated truth yi ∈ {±1}. If the data points xi are linear separable, then a hyperplane y(x) = wT x + b = 0 can be found to separable the positive from the negative. Any data points lying on the hyperplane satisfy Equation 3.37 and Equation 3.39. Figure 3.3 shows an example of a separable case in a two-dimensional space, where green triangles and blue squares are data points of the two classes, H1 and H2 are two paralleled hyperplanes generated by SVM on which a series of support vectors are located. The Support Vector Machines (SVM) generates the hyperplane which maximises a margin. This can be formulated as follows.

wT xi + b ≥ +1

for yi = +1

(3.42)

wT xi + b ≤ −1

for yi = −1

(3.43)

If we combine Equation 3.42 and Equation 3.43, a set of inequalities can be derived as shown in Equation 3.44. yi (wT xi + b) − 1 ≥ 0 ∀i

(3.44)

79


W

H2

H1

Margin

Figure 3.3: An example of a separable problem for SVM in a two-dimensional space, where two red dashed lines represent the two hyperplanes that defined the maximised margin and the data point lying on either of the hyperplanes are the support vectors. Data points that satisfy Equation 3.42 lie on a hyperplane H1 : wT xi + b = 1 with perpendicular distance from the origin |1 − b|/kwk. Similarly, data points that satisfy Equation 3.43 lie on a hyperplane H2 : wT xi + b = −1 with perpendicular distance from the origin | − 1 − b|/kwk. The margin that is formed by a pair of parallel hyperplanes H1 and H2 is equal to 2/kwk. As a result, the learning task becomes finding the pair of hyperplanes that gives the maximum margin by minimising kwk2 , subject to constraints given in Equation 3.44. In order to find the solution, optimisation based on Lagrangian multipliers is performed. As a result, the constraints in 3.44 are replaced by the Lagrangian multipliers and the optimisation problem becomes easier to solve. In addition, training data in the reformulated problem is in the form of the dot product between input vectors, where kernel methods can be applied for nonlinearly separable cases. Let αi > 0, i = 1, 2, ..., n be the Lagrangian multipliers, which correspond to the constraints in 3.44. If the constraints ci ≥ 0 holds, the Lagrangian can be formed by subtracting the objective function by the multiplication between constraint equations and positive Lagrangian multipliers. This

80


is called the primal form, as shown in Equation 3.45. n

n

i=1

i=1

X X 1 LP = kwk2 − αi yi (wT xi + b) + αi 2

(3.45)

Now the optimisation task is to minimise LP with respect to w and b, with a requirement that the derivatives of LP with respect to all the αi vanish, all subject to the constraints αi ≥ 0. The optimisation becomes a convex quadratic programming problem, as the objective function is its own convex, and all data points that satisfy the constraints also form a convex set. As a result, an equivalent dual form, known as the Wolfe dual [52], can be derived, where LP can be maximised, subject to the constraints that the gradient of LP with respect to w and b vanish, as well as the constraints that αi ≥ 0. This dual form is given in 3.46. LD =

X i

αi −

1X αi αj yi yj xT i xj 2

(3.46)

i,j

During the training phase of SVM, LD is maximised with respect to all αi . In the solution, the points that satisfy αi > 0 are called the support vectors, which lie on one of the hyperplanes H1 and H2 . For the training points that satisfy αi = 0, they lie either on H1 , or H2 , or on the side of H1 or H2 where the inequality in 3.44 holds. Those points are the critical elements of the training set, as they lie closest to the decision boundary. For the cases where data points are nonlinearly separable, xT i xj can be replaced by Φ(xi )T Φ(xj ) through kernel methods. As a result, the training phase is then performed on the dot product space F . Once a solution is found the support vectors can be mapped back to original feature space to form a nonlinear decision boundary.

3.7

Summary

In this chapter the techniques used in this thesis are outlined. The DCA is described to show its three phases, the pre-processing, detection and analysis. An interval temporal logic, named duration calculus, is introduce as a formal method for specifying a single-

81


cell model of the DCA. We also propose automated data pre-processing methods for the DCA, based on dimensionality reduction techniques, namely correlation coefficient, information gain and PCA, and also statistical inference techniques, such as MSE. Moreover, an online analysis component based on segmentation is described. In addition, we introduced three standard machine learning techniques that are used to generate the baseline performance. These are KNN, Decision Trees and SVM. These methods are used for fulfilling the main objective of this thesis. Duration calculus is used to formally specify the algorithm, so that theoretical analysis of the algorithm’s runtime complexity can be performed. The algorithm can be presented in a clearer and more understandable way, as well as be theoretically proved to have a low runtime complexity and thus is lightweight. Automated data pre-processing methods are proposed to replace the manual based methods currently used by the DCA. This can improve the applicability of the algorithm to a wider range of problems, and make it application independent. The online analysis component based on segmentation is proposed to substitute the offline one used in most implementations of the algorithm. We propose that it will improve the algorithm’s capability of online detection and processing of large datasets. The machine learning techniques are employed for the purpose of a baseline comparison, since standard datasets in intrusion detection and machine learning are used to test the modified systems.

82

Chapter 4

Theoretical Aspects of the DCA

83

4. Theoretical Aspects of the DCA

4.1

Introduction

In order to conduct the theoretical investigation of the DCA, the algorithm should be first formally defined. This relates to one criticism of the DCA, i.e. the lack of a formal definition, which could result in ambiguous understanding of the algorithm and thus lead to incorrect applications and implementations. In addition, previous investigations have mainly focused on its empirical aspects, evidenced by experimental results on a range of datasets in various problem domains. A geometrical analysis of the DCA was performed by Stibor et al. [149]. This was later extended in Oates’s thesis [120]. However, in general theoretical analysis of the DCA has barely been performed, and most of the algorithm’s theoretical aspects have not yet been revealed. Other immune inspired algorithms, such as negative and clonal selection algorithms, were theoretically presented in [154]. Elberfeld and Textor [46] theoretically analysed string-based negative selection algorithms, to show the possibility of reducing the worstcase runtime complexity from exponential to polynomial, through compressing detectors. More recently, the work of Zarges [168, 169] theoretically analysed one of the vital components of the clonal selection based algorithms, namely inversely proportional mutation rates. Jansen and Zarges [92] performed a theoretical analysis of immune inspired somatic contiguous hypermutations for function optimisation. As a result, it is important to perform similar theoretical analyses of the DCA, to discover its runtime complexity and numerous algorithmic properties, in line with other algorithms in the field of AIS. The aim of this chapter is to thus present the DCA in a clear and accessible formulation, and to provide an initial theoretical analysis of the algorithm’s runtime complexity and several algorithmic properties. Firstly, specifications of a simplified single-cell model of the DCA at the behavioural level are presented, using an interval temporal logic called duration calculus [172]. This is to show how a single DC in the DCA operates, where events involved during different intervals of its lifespan are included. Followed by a formal definition of the algorithm (for the entire DC population) at the functional level. As future users may not have a deep understanding of advanced formal methods such as 84


the B-method [93], here we use set theory and mathematical functions. From the formal definition, theoretical analyses of the runtime complexity of the algorithm are performed, including the standard DCA, an extended system with additional segmentation and a modified system where automated data pre-processing methods are implemented. The formulations of two important runtime variables are included to present the algorithm’s runtime behaviour, and to give guidelines for future development. Analysis of the moving window effect of the algorithm is also performed, to show its unique filtering property. This chapter is organised as follows: a single-cell model is presented in Section 4.2, alongside its formal specifications; the formalisation of the complete algorithm is described in Section 4.3; analyses of the algorithm’s runtime complexity are shown in Section 4.4, formulation of two runtime variables and the analysis of the moving window effect are demonstrated in Section 4.5; and a summary is drawn in Section 4.6.

4.2 4.2.1

A Single-Cell Model Model Overview

Each artificial Dendritic Cell (DC) in the population is capable of performing a set of identical behaviours, to accomplish its functionality as a detector. In order to understand the theoretical perspectives of the algorithm, we start with describing a simplified singlecell model. A one-cell model is described in [122] for the purpose of analysing the effect of signal frequency. The single-cell model used here focuses on the behavioural level of a single DC from a temporal perspective. The flowchart of the behaviours involved in the single-cell model is displayed in Figure 4.1. Time-dependent events are performed by the DC in each state during particular time intervals. The state and event described here are similar to those defined in temporal logic. Therefore, states must hold over any subintervals of an interval in which they hold. Conversely, events do not hold over any subintervals of an interval in which they hold. In other words, states can be divided into multiple subintervals of an interval, whereas

85


events cannot. Each DC has two states, ‘immature’ and ‘matured’, where semi-mature and fully mature states are combined into one matured state. The states, the events in each state, and the relevant time intervals of the single-cell model are described in the following section.

4.2.2

Formal Specifications

Before defining details of the duration calculus specifications, it is necessary to define the notation used in this section: • I : Time → {0, 1} is a boolean observable indicating if the DC is in the immature state; • M : Time → {0, 1} is a boolean observable indicating if the DC is in the matured state (semi-mature or fully-mature state); • Ei : Time → {0, 1} is a boolean observable representing if the ith event is being performed, where i ∈ {1, 2, 3, 4, 5} is an index of events. More specifically, the definition of each event Ei is listed as follows, which are corresponding to events shown in Figure 4.1: • E1 is a boolean observable of the data processing event; • E2 is a boolean observable of the signal transformation event; • E3 is a boolean observable of the antigen sampling event; • E4 is a boolean observable of the temporal correlation event; • E5 is a boolean observable of the information presenting event. Each DC performs a set of particular events in each state, and the states of a DC can be indicated by the combination of whether a certain event occurs or the value of the

86


Figure 4.1: The behavioural flowchart of the single-cell model of the DCA, where events related to the algorithm’s functionality are included.

87


E1

1 0

E2

1 0

E3

1 0

1

E4 0

1

E5 0

Time 0

1

2

3

4

5

6

Figure 4.2: Interpretation for E1 , E2 , E3 , E4 , and E5 , and the whole interval is divided into subintervals by the events.

boolean observable is either ‘0’ or ‘1’, defined as in 4.1, I ::= 0 | 1 | ¬I | E1 ∨ (E2 ∧ ¬E3 ) ∨ (¬E2 ∧ E3 ) ∨ E4 (4.1) M ::= 0 | 1 | ¬M | E5 where ¬ is logic ‘NOT’, ∨ is logic ‘OR’ and ∧ is logic ‘AND’. In the immature state, the DC is fed with input data instances the type of which can either be signal or antigen. The immature state can be indicated as E1 holds, E2 ∧ ¬E3 holds, ¬E2 ∧ E3 holds, or E4 holds. Conversely, in the matured state, the DC presents the processed information from correlated signals and antigens. The matured state can be indicated as E5 holds. The specification in 4.1 can be expanded by including the time interval of each event, expressed in the form of formulas, in which the temporal dependencies between events

88


are included. For example, both E2 and E3 depend on the completion of E1 , however either E2 or E3 can be performed at one point; E4 depends on the completion of both E2 and E3 , as the temporal correlation event requires both processed signals and sampled antigens; E5 is performed as soon as the DC changes to matured state, it is not dependent on any other events. Two formulas that correspond to the immature state and the matured state of a DC are defined in 4.2. F1 ::= dIe | ¬E5 | (E1 ; E2 ∧ ¬E3 ) ; (E1 ; ¬E2 ∧ E3 ) ; E4 (4.2) F2 ::= dM e | ¬(E1 ∨ E2 ∨ E3 ∨ E4 ) | E5 , where dIe stands for the immature state holding almost everywhere within the time interval constrained by formula F1 , and dM e stands for that matured state holds almost everywhere within the time interval constrained by F2 . Thus in the interval constrained by F1 , it is certain that E5 does not hold. This interval can be divided into multiple subintervals in which E1 , E2 ∧ ¬E3 , ¬E2 ∧ E3 , or E4 holds respectively. In the interval constrained by F2 , none of the E1 , E2 , E3 or E4 holds, but only E5 holds. For instance, the overall length of the time interval while F1 and F2 hold is six, and the time interval of each event is equal to one. Figure 4.2 shows the interpretation of the two formulas. Specifications of the single-cell model define the temporal properties of a DC in the algorithm at the behavioural level, which are identical across the entire DC population. However, the diversity of the DC population is created by the individually assigned lifespan of each DC that limits the amount of information a DC processes over time. The same input data is processed by each DC with a distinct perspective about how much and how often to sample. This is to allow for a multi-perspective assessment of the algorithm. Such assessment uses an ensemble way of classification, where the output of each DC is aggregated at the population level to perform anomaly detection and attribution. Having a population of DCs with varied lifespans highly contributes to the moving average filtering and multi-perspective assessment properties of the algorithm, as described in Section 3.2.2, Chapter 3.

89


4.3 4.3.1

Formalisation of the DCA Data Structures

Here, data structures and procedural operations of the DCA are formally defined at the population level. To simplify the definitions, set theory and mathematical functions, e.g. addition, multiplication and recursion are used. This aims to present the algorithm in a comprehensive way, which can be easily accessed by future users who may not be familiar with formal logic. Define Signal ⊆ Rm and Antigen ⊆ N as the two types of input data. Within a discrete time space Time ⊆ N, the input data can be defined as the function S : Time → Signal ∪ Antigen, and S(t) is a data instance at a time point t ∈ Time. Elements Signal are input signal instances of the algorithm, and are represented as m-dimensional realvalued vectors. These are usually normalised into a non-negative range as the input to the DCA. In many applications, m = 3 is the standard case, corresponding to the three input signal categories of the DCA as mentioned in Section 2. Elements of Antigen are identifiers of certain objects to be classified, and are often represented as natural numbers starting from one, where the order is ignored. Define the weight matrix of signal transformation as   w11 · · · W= w21 · · ·

 w1m   w2m

where each entry wij ∈ R. The weight matrix W is used to transform the m-dimensional input signals to two categories of output signals. It is usually predefined by users and kept constant during runtime. The entries in the weight matrix are based on empirical results from the underlying immunology of natural DCs. Let Population be an index set of DCs and N = |Population| be the population size. The index of a DC is i ∈ Population. The function of assigning the initial lifespan to a DC is defined as I : Population → R. The function of initialising the antigen profile of the 90


DC is defined as M : Population → (ai1 , ai2 , ..., aik , ...), where (ai1 , ai2 , ..., aik , ...) is a sequence storing the antigen instances sampled by a DC and aik ∈ Antigen. The initial signal profile of a DC is usually set to zero. The output of each DC is stored in a list (aik , ri ) ∈ Antigen × R, where ri is the signal profile of a DC when it reaches a termination condition. We also define π1 and π2 as projection functions to obtain the first and second coordinate of a pair respectively.

4.3.2

Procedural Operations

To access the data structures of the DCA, a series of one-step procedural operations is executed. Formally defining these operations is essential for the algorithm’s runtime analysis. At the beginning (t = 1), the algorithm initialises all the DCs indexed by Population, through assigning the initial values of lifespans and signal profiles, namely ‘DC initialisation’. The value of I(i) depends on the distribution function used to generate the initial lifespans of DCs. Both uniform distribution and Gaussian distribution can be applied to generate I(i). The antigen profile of each DC is set as Null or empty, while the signal profile is set as zero. Definition 1 (signal transformation). The signal transformation function O : Time → R × R is defined as

O(t) =

   WT S(t),

if S(t) ∈ Signal;

  0,

otherwise.

This operation is executed whenever S(t) ∈ Signal holds, and it performs the multiplication between a transposed 2 × m matrix and a m-dimensional vector to produce a two-dimensional vector of output signals, namely ‘CSM ’ and ‘K’. These are related to when and how to make decisions respectively. In the case that S(t) ∈ Antigen, the function returns a zero vector.

91


Definition 2 (lifespan update). The lifespan update function F : Time×Population → R is defined as    I(i),    F (t, i) = I(i) − π1 (O(t)),      F (t − 1, i) − π1 (O(t)),

if t = 1; if F (t − 1, i) ≤ 0; otherwise.

When t = 1, the initial value of F is I(i), which is the initial lifespan of the DC with an index i. It is repeatedly subtracted by CSM signal until the termination condition, F (t − 1, i) ≤ 0, is reached. The function is then reset to ‘I(i) − π1 (O(t))’ (not I(i)), due to the function O(t) being executed at a regular basis, e.g. at every single time point t ∈ Time. Definition 3 (signal profile update). The signal profile update function G : Time × Population → R is defined as    0,    G(t, i) = 0 + π2 (O(t)),      G(t − 1, i) + π2 (O(t)),

if t = 1; if F (t − 1, i) ≤ 0; otherwise.

When t = 1, the value of G is zero, which is the initial signal profile of the DC with an index i. It is repeatedly added by K signal until the termination condition is reached. The function is then reset to ‘0 + π2 (O(t))’ (not 0), due to the function O(t) being executed at a regular basis, e.g. at every single time point t ∈ Time. Definition 4 (antigen profile update). The antigen profile update function H : Time × Population → (ai1 , ai2 , ..., aik ) is defined as

H(t, i) =

   (ai1 , ai2 , ..., aik , a i(k+1) ) s.t. ai(k+1) = S(t),

if S(t) ∈ Antigen;

  (ai1 , ai2 , ..., aik ),

otherwise,

92


where H is initially empty. As a new antigen instance arrives, it is sampled by the DC with an index i and its antigen profile is updated until the termination condition is reached. As the length of the sequence (ai1 , ai2 , ..., aik , ...) is usually a small natural number, this function is still considered as a one-step operation. It is performed individually by each DC and the index of the DC selected to sample an incoming S(t) ∈ Antigen is defined as i ≡ θ mod N (i is congruent with θ modulo N ), where θ is the number of antigen instances up to time t. This is termed the ‘sequential sampling’ rule. Definition 5 (output record). Let ri = G(t, i) s.t. F (t − 1, i) ≤ 0 be the signal profile of a DC, and L : N → Antigen × R denote the function that maps an index j ∈ N to an element of the output list. The output record function is defined as

L(j) = (aik , ri )

∀k

where L(j) is the jth element of the list. This function is responsible for recording the decision of a DC into the output list when the termination condition is reached. The list is then used to produce the final detection results in the analysis phase of the DCA. Definition 6 (antigen counter). The antigen counter function C : N × Antigen → {0, 1} is defined as C(j, α) =

   1,

if π1 (L(j)) = α;

  0,

otherwise.

Definition 7 (signal profile abstraction). The signal profile abstraction function R : N × Antigen → R is defined as

R(j, α) =

   π2 (L(j)),

if π1 (L(j)) = α;

  0,

otherwise.

In the two functions above, α ∈ Antigen is an antigen type. The function C counts the number of instances of antigen type α, and the function R calculates the sum of all K values associated with antigen type α. These two operations are performed for every 93


antigen type and involve scanning the sequence of L(j) in its entirety. Definition 8 (anomaly metric calculation). The anomaly metric calculation function is defined as.

K(α) =

n

n

j=1

j=1

X X γ with β = C(j, α) and γ = R(j, α) β

As Antigen 6= ∅ and α ∈ Antigen, the number of this antigen type β ≥ 1. A threshold ε can be applied for further classification. The value of the threshold depends on the underlying characteristics of the dataset used. An antigen type α is classified as anomalous if K(α) > ε, and normal otherwise.

4.4 4.4.1

Analysis of Runtime Complexity The Standard DCA

By combining the procedural operations of the DCA with for, while loops or if statements the algorithm can be presented as in Algorithm 5. Previous applications of the DCA have shown that the runtime of the algorithm is relatively short, and the consumption of computation power is also low [64]. However, theoretical analysis of the runtime complexity of the DCA, given a set of input data, has not yet been performed. Runtime analysis [27] involves calculating the number of primitive operations or steps executed by an algorithm, if the input data size is n and this is relatively large. The analysis is based on the asymptotic theory and its aim is to theoretically show the runtime complexity of an algorithm as n → ∞. Let a be the number of antigen instances within the input data, b = |Antigen| be the number of antigen types and N be the size of DC population which is set independently of n. According to previous applications, N is usually set as 100 and independent of the increase of data size n. Therefore, it is considered as a constant in the following analyses. As the type of input data instances is either Antigen or Signal, if the number 94


Algorithm 5: Pseudocode of the DCA implementation. input : input data S(t) output: anomaly metric K(α) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

foreach DC do DC initialisation; end while input data do if antigen then antigen profile update; end if signal then signal transformation; foreach DC do lifespan update; signal profile update; if termination condition then output record; end end end end while output list do foreach antigen type do antigen counter; signal profile abstraction; anomaly metric calculation; end end

/* Initialisation phase */

/* Detection phase */

/* Analysis phase */

of antigen instances is equal to a, the number of signal instance is n − a. For the ease of analysis, the algorithm is divided into three phases as follows: 1. Initialisation phase - Line 1 to Line 3; 2. Detection phase - Line 4 to Line 18; 3. Analysis phase - Line 19 to Line 25. The calculation of runtime is performed phase by phase. Let T1 (n), T2 (n) and T3 (n) be the runtime of each phase respectively, and T (n) = T1 (n) + T2 (n) + T3 (n) is the overall runtime of the algorithm. Details of all the primitive operations of the algorithm are 95


Line Number 1 2 4 5 6 8 9 10 11 12 13 14 19 20 21 22 23

Description for loop DC initialisation while loop if statement antigen profile update if statement signal transformation for loop lifespan update signal profile update if statement output record while loop for loop antigen counter signal profile abstraction anomaly metric calculation

Times N N n a a n−a n−a (n − a) × N (n − a) × N (n − a) × N (n − a) × N (n − a) × N a a×b a×b a×b a×b

Table 4.1: Details of primitive operations of Algorithm 5.

listed in Table 4.1, including the line number and the description of each operation as well as the number of times an operation is executed, corresponding to Algorithm 5. As all the operations are usually assumed to be executed in a constant time, let c be a constant that represents the numbers of steps or cost required for each operation. As the initialisation phase is only executed once for the entire DC population at the commencement of the algorithm. Its runtime is irrelevant to the data size n, but is determined by the population size N . Therefore, the runtime of the initialisation phase is calculated as follows.

T1 (n) = cN + cN ⇒

{c is a constant} T1 (n) = Θ(N ) = Θ(1)

The runtime of the detection phase depends on the data size n, the number of antigen instances a, the number of signal instances n − a and the size of the DC population N . 96


Thus, the runtime of the detection phase is calculated as follows.

T2 (n) = cn + (c + c)a + (c + c)(n − a) + (c + c + c + c + c)(n − a)N = 3cn + 5cN (n − a) ⇒

{c is a constant and a < n} T2 (n) = Θ(n) + Θ(N (n − a)) = Θ(n)

The runtime of the analysis phase is dependent on the size of the output list that is equal to the number of antigen instances in the input data a and the number of antigen types b. The value of b is determined by the number of states or entities to classify within a problem domain, and two extreme cases are defined as the ‘best-case’ and ‘worst-case’ scenarios. The best-case scenario occurs when b = 1, namely one antigen type. Conversely, the worst-case scenario occurs when b = a, and the number of antigen types is equal to the number of antigen instances, and thus we have 1 ≤ b ≤ a ≤ n. The runtime of the analysis phase is thus calculated as follows.

T3 (n) = ca + cab + (c + c + c)ab ⇒

{c is a constant} Θ(a) + Θ(a) ≤ T3 (n) ≤ Θ(a) + Θ(a2 ) + Θ(a2 )

⇒

{the order of growth} Θ(n) ≤ T3 (n) ≤ Θ(n2 )

Theorem 1. The runtime complexity of the standard DCA is bounded by Θ(n) and Θ(n2 ) for the best-case and the worst-case scenario respectively.

97


Proof.

T (n) = T1 (n) + T2 (n) + T3 (n) ⇒

{T1 (n) = Θ(1), T2 (n) = Θ(n), and Θ(n) ≤ T3 (n) ≤ Θ(n2 )} Θ(1) + Θ(n) + Θ(n) ≤ T (n) ≤ Θ(1) + Θ(n) + Θ(n2 )

⇒

{the order of growth} Θ(n) ≤ T (n) ≤ Θ(n2 )

Bounds provided by Θ-notation are asymptotically tight. As suggested by Theorem 1, the DCA has a runtime complexity Θ(n) in the best-case scenario, when the number of antigen types is equal to one or relatively small. Conversely, the algorithm achieves a runtime complexity Θ(n2 ) in the worst-case scenario, when the number of antigen types is equal to the number of antigen instances and approximate to the size of input data. Therefore, the DCA is capable of processing large-sized datasets while keeping the runtime complexity under control. This is important for further development of the DCA into an online or real-time detection system, as detection speed and capability for handling large-sized datasets are essential.

4.4.2

The DCA with Segmentation

Segmentation is introduced to adapt the algorithm to online analysis [72]. Instead of analysing the processed information in a single operation at the termination of the detection phase, the output list is partitioned into smaller segments and the analysis is performed within each segment. We postulate that segmentation could potentially generate finer grained results, as well as performing analysis in parallel with the detection process. Here, we focus on the antigen based segmentation approach, as it is more favourable in actual applications [72]. It also appears that the system with segmentation produces the final detection results much faster, as the analysis is performed during detection on a much smaller chunk of processed information. Based on the analysis of 98


the standard DCA, it is possible to theoretically analyse the effect of segmentation on the algorithm’s runtime complexity. Let z be a predefined segment size and 1 ≤ z ≤ n. A segment is generated once the size of the output list reaches z, and an analysis on the current batch of processed information in the output list is performed. As segmentation is a post-processing mechanism, it only affects the analysis phase of the algorithm, but not the initialisation phase or detection phase. The search space of the analysis of a segment is determined by the value of z. For the worst-case scenario, there are z antigen types within a segment, while for the best-case scenario, there is only one antigen type in a segment. Since n/z segments are generated, n/z analyses are performed. As a result, the runtime of the analysis phase with segmentation implemented is bounded by z · 1 · (n/z) ≤ T3 (n) ≤ z · z · (n/z). Theorem 2. The runtime complexity of the DCA with segmentation is bounded by Θ(n) and Θ(n2 ) for the best-case scenario and the worst-case scenario respectively. Proof.

z·1· ⇒

n n ≤ T3 (n) ≤ z · z · z z

{arithmetic} n ≤ T3 (n) ≤ z · n

⇒

{1 ≤ z ≤ n} Θ(n) ≤ T3 (n) ≤ Θ(n2 )

⇒

{T1 (n) = Θ(1) and T2 (n) = Θ(n)} Θ(1) + Θ(n) + Θ(n) ≤ T (n) ≤ Θ(1) + Θ(n) + Θ(n2 )

⇒

{the order of growth} Θ(n) ≤ T (n) ≤ Θ(n2 )

99


As shown in Theorem 2, the introduction of segmentation does not change the overall runtime complexity of the algorithm. However, it provides a means of online analysis that continuously and periodically produces results during detection. Additionally, the DCA with segmentation produces significantly different and better results than the standard version [72]. Therefore, segmentation is an important and necessary addition to the DCA from a practical point of view. Thus far only static segmentation with a fixed segment size has been applied to the DCA. The effect of variable segment sizes on the detection performance still requires further investigation.

4.4.3

The DCA with Automated Pre-Processing

Due to the introduction of automated data pre-processing methods, the application of the DCA requires a so-called training phase to extract the expert knowledge for signal selection and categorisation from a given problem domain. Such training phase may affect the runtime complexity of the system, therefore a theoretical analysis of its runtime complexity is required. It involves two steps, namely the dimensionality reduction for finding the most interesting features, and the greedy search based on statistical inference for categorising the derived features. Let T1 (n) and T2 (n) be the runtime of the two steps respectively. The given input data space X is represented as a n×m matrix, where xij stands for the value of the ith data instance and jth feature. In intrusion-detection datasets, it is a valid assumption that the dimensionality of a dataset is usually much smaller (m n). This is because the number of features monitored in a system is often predefined and thus the dimensionality is fixed, while the data size can increase towards infinity, especially for the case of online detection. In order to simplify the following analyses, the calculation of any measure between a pair of features in the original feature set, such as correlation coefficient and information gain, is considered as an operation dependent on the data size n whose runtime complexity is bounded by O(n). Firstly, the correlation based method involves calculating the correlation coefficient between each feature of the original feature set and the

100


class labels. The computational complexity of such calculation is determined by both the data size and the dimensionality of the input feature space. Therefore, the runtime complexity of the calculation of correlation coefficients is bound by O(nm). Secondly, the information gain based method involves the calculation of the information gain between each feature of the original feature set and the class label, which is similar to the correlation based method. Therefore, the runtime complexity of the calculation of information gains is also bounded by O(nm). Thirdly, the PCA based method uses singular value decomposition (SVD) to derive the corresponding eigenvectors and eigenvalues of the input data. As shown in [142], the computational complexity of PCA using SVD is bounded by O(nm2 ). Therefore, the runtime complexity of PCA using SVD is bounded by O(nm2 ). Since m n generally holds with respect to intrusion-detection problems, the runtime complexity of these three dimensionality reduction techniques is actually bounded by O(n) (T1 (n) = O(n)) according to the order of growth. In addition, the greedy search based on statistical inference is used in both the information gain and PCA based methods, and its runtime complexity should also be considered. Let d be the dimensionality of the new feature set S generated through dimensionality reduction, where si is the ith instance, and thus 2d is the dimensionality of the feature set used for greedy search (see Chapter 3 for details). Algorithm 6 presents this process, where the temporary variable ‘temp’ can be any large number. The runtime complexity of the greedy search is mainly contributed by the four-layered loops, thus the runtime can be represented as T2 (n) = 2cd(2d − 1)(2d − 2)n, where c is a constant representing the number of steps or cost of each operation. There are three signal categories in the DCA, thus d = 3 and T2 (n) = O(n). Theorem 3. The runtime complexity of the automated data pre-processing methods developed for the DCA is bounded by O(n).

101


Algorithm 6: Pseudocode of the greedy search. input : a feature space of n × 2d matrix output: the best combination of features temp ← 100; for a ← 1 to 2d do for b ← 1 to 2d − 1 do for c ← 1 to 2d − 2 do for i ← 1 to n do si = {xia , xib , xic }; calculate K(si ); end ˆ calculate MSE(θ); ˆ if MSE(θ) < temp then ˆ temp ← MSE(θ); end end end end

Proof.

⇒

{T1 (n) = O(n) and T2 (n) = O(n)} T (n) = T1 (n) + T2 (n) = O(n) + O(n) = O(n)

Bounds provided by O-notation are asymptotically tight. As proven in Theorem 1 and Theorem 2, the runtime complexity of DCA with or without segmentation is bounded by Θ(n) and Θ(n2 ) for the best-case scenario and worst-case scenario respectively. Based on Theorem 3, regardless of the techniques used, the automated data pre-processing methods proposed do not change the overall runtime complexity of the system. As a result, an integrated system in which both segmentation and automated data pre-processing methods are implemented has the runtime complexity bounded by Θ(n) and Θ(n2 ) for the best-case scenario and worst-case scenario respectively. This makes the integrated system potentially suitable for online detection tasks where detection speed is the priority.

102


4.5 4.5.1

Formulation of Runtime Properties Number of Matured DCs

The number of matured DCs within a time interval is related to the reset frequency of the DC population, which indicates the work-load of the DC population. This can be used to determine whether the setup of the current system should be altered. If the frequency of DC resetting is too high, most of the DCs become matured and get reset before they acquire a sufficient amount of information. As a result, the size of the DC population should be increased to cope with the overwhelming workload. Alternatively, the range of lifespans of the DC population can be increased to extend the time windows of DCs, allowing more information to be obtained. This becomes crucial if the system is deployed online, as an online system is often required to perform continuous detection and adapt to the changes of real-time situations. The number of matured DCs in the DC population depends on the distribution function used for the generation of DC lifespans, in addition to the input data within the time interval of interest. To make the analysis manageable, two types of distributions for generating the initial DC lifespans are considered, namely uniform distribution [6] and Gaussian distribution [6]. Proposition 1 (uniform distribution). If the lifespans of the DC population are generated from an arithmetic series xN = x1 + (N − 1)d, where xN is the Nth element, x1 is the first element and d is the interval between two successive elements, the number of matured DCs in the DC population δ can be described as follows. $ δ=

N (e − b)(x1 +

e X N −1 2 d) t=b

% π1 (O(t))

103


Proof. e

⇒

x1 + xN N −1 1 X π1 (O(t)) and µ1 = = x1 + d} e−b 2 2 t=b % $ e X N Nϕ = δ= π1 (O(t)) µ1 (e − b)(x1 + N 2−1 d) t=b {ϕ =

Where ϕ is the mean value of the CSM signals within the interval [b, e] and µ1 is the mean lifespan of the DC population. Uniform distribution is used in the dDCA [64] to generate the initial lifespans of the DC population. This produces a set of values that are uniformly distributed within a certain range. According to Proposition 1, if we know the parameters (first element x1 and the interval d) of the arithmetic series, the exact number of matured DCs within the time interval [b, e] can be calculated. Proposition 2 (Gaussian distribution). If the lifespans of the DC population are generated from a Gaussian distribution x ∼ N (µ, σ 2 ), then the following equation holds. $ Pr

% $ e X N π1 (O(t)) ≤ δ ≤ (µ − √2σN )(e − b) t=b (µ +

%! e X N π1 (O(t)) = 0.95 2σ √ )(e − b) t=b N

Proof. e

1 X σ2 π1 (O(t)) and µ2 ∼ N (µ, )} e−b N t=b σ σ Pr µ − 2 √ ≤ µ2 ≤ µ + 2 √ = 0.95 N N $ % e Nϕ N X ⇒ {δ = = π1 (O(t)) } µ2 µ2 t=b $ % $ e X N π1 (O(t)) ≤ δ ≤ (µ − √2σN )(e − b) t=b (µ + ⇒

{ϕ =

% e X N π1 (O(t)) 2σ √ )(e − b) t=b N

Pr(·) is the probability operator. If the sample size is N , the sample mean µ2 is bounded 104

4. Theoretical Aspects of the DCA 2

by a Gaussian distribution x ∼ N (µ, σN ) [6]. The lower and upper bounds of the sample mean can be used to induce the bounds of the number of matured DCs. In practice, Gaussian distribution has not been used for generating the lifespans of the DC population, but it has been of great interest and would be a priority of future investigation. According to Proposition 2, if we know the mean (µ) and variance (σ 2 ) of the Gaussian distribution from which the lifespans of the DC population are generated, the size of DC population N , and the input data instances within the time interval [b, e], we can show that there is a 0.95 chance the number of matured DCs is bounded by the lower and upper bounds. This could provide sufficient information for adjusting the system according to real-time scenarios.

4.5.2

Number of Processed Antigens

As demonstrated in [72], segmentation is effective for maintaining or even improving detection accuracy for large sized datasets. This may be due to the fact that the number of processed antigens could determine whether an analysis of the current batch of processed information is required. Different from input antigen instances, processed antigens are those, presented by matured DCs. Investigation of the relationship between the number of processed antigens and the input data becomes essential for understanding the DCA, as well as for the development of integrating segmentation with the algorithm. Additionally, a prior knowledge of the number of processed antigens, based on the input data, may facilitate choosing an appropriate segment size. Here we focus on formulating the relationship between the number of processed antigens and the input data, in particular the number of input antigens. Let θ ∈ N at t ∈ Time be the number of input antigens that are fed into the system, and δ be the mean lifespan of the DC population, which can be derived from either Proposition 1 or Proposition 2 defined in the last section. The method of calculating the number of processed antigens within a given time interval [b, e] should be introduced first. It is similar to placing balls into a number of bins that are ordered based on their indexes in a sequential manner. Placing starts from the first 105


bin then the second bin and so forth. If we reach the last bin, the process starts over again. In the end, a number of bins, starting from the first one, are taken and the number of balls contained is counted. The balls are equivalent to input antigens, the bins are equivalent to DCs, and the action of counting the number of balls is equivalent to the action of counting the number of processed antigens. Proposition 3 formulates the relationship between the number of processed antigens and the input data in two cases. Proposition 3 (number of processed antigens). Let ν be the number of processed antigens within a given interval [b, e], c ≡ δ mod N and d ≡ θ mod N , the following formula of ν holds.  θ δ   )(1 + ),  (δ − N N N ν= θ δ   ) + θ,  (δ − N − N N N

if c < d; otherwise.

Proof.

⇒ ⇒

{transform modulus to floor functions} δ θ c=δ−N and d = θ − N N N {sequential sampling} Case 1 : c < d θ δ θ ν=c + c = (δ − N )(1 + ) N N N Case 2 : c ≥ d θ δ θ ν=c + d = (δ − N − N ) +θ N N N

The number of antigens sampled by each DC is determined by θ mod N , but as only matured DCs present processed antigens, the number of processed antigens is determined by δ and the maximum of c and d. 106


The formulas for the number of processed antigens has two cases, depending on the relationship between c ≡ δ mod N and d ≡ θ mod N . These formulas can link the runtime variables of the algorithm to the input data, without the need to run the algorithm. This provides theoretical insights into tuning the algorithm for a given problem.

4.5.3

Moving Window Effect

Dynamic Moving Windows In the DCA, each DC keeps performing operations such as lifespan update, signal profile update and antigen profile update until the termination condition is reached. This creates a dynamic moving window effect where input signals are averaged within the DC. The processed information of a DC is then presented as its decision, which is aggregated at the population level along with the decisions of other DCs. This exhibits a similar property to ensemble methods in machine learning [42], where decisions of multiple classifier models are combined into a single decision. It is believed that each individual DC is functionally similar to the operation of a filter with a dynamically changing transfer function [122, 124]. This, in combination with the population-based aggregation, adds robustness to the algorithm for transient errors in the input data. Such an operation can be represented as a dynamic moving function as follows. The size of a moving window created by a DC (index i) is determined by two factors, namely the initial lifespan I(i) and time dependent signal instances S(t) ∈ Signal. Let [bik , eik ] ⊆ Time be a time interval of the kth moving window created by the DC with an index i ∈ Population. Based on the termination condition demonstrated in section 2, the relationship between the moving windows and the individually assigned lifespan can de defined as, eik X

π1 (O(t)) ≤ I(i)

(4.3)

t=bik

where bi1 = 1 and bik = ei(k−1) s.t. k > 1. As bi1 starts from one and is dependent

107


on ei(k−1) , the aim is to calculate eik when I(i) is given. To the best of the authors’ knowledge, the formula is difficult to solve unless the probability density function of S(t) from which the input data are generated is known. Therefore, a simplified version of the moving window effect in the DCA should be provided for the ease of analysis. For the sake of simplicity, we set the size of moving windows as static, independent of the input data.

Static Moving Windows Let S = {s1 , s2 , ..., sN } be a set of window sizes, where N = |Population| = |S| is the population size, and k ∈ {1, 2, ..., d sni e} is the index of moving windows when a size si is applied, where n ∈ N is the size of input data. As mentioned before, the algorithm transforms every input signal instance into two output signal instances (Definition 1), and π2 (O(t)) determines the decision making of each DC. The function of aggregating decisions of a moving window is defined as P : N × N → R

P (si , k) =

1 si

ksi X

π2 (O(t))

(4.4)

t=1+(k−1)si

The function of associating the decision of a moving window with the data instances within it can be defined as Q : N × N → R.

Q(t, si ) = P (k, si ) s.t. 1 + (k − 1)si ≤ t ≤ ksi

(4.5)

The function of aggregating the decisions of all the moving windows on each data instance is defined as A : N × N → R.

A(t, si ) =

1 X Q(t, si ) N

(4.6)

si ∈S

The formulation provided in this section may not include all the details of the moving window effect of the DCA. However, it provides an initial step towards mathematically

108


presenting such a property. This could help to understand the theoretical perspectives of some of the interesting properties of the algorithm, which leads to further investigations on the usefulness of these properties. As a result, from an algorithmic point of view, this work is substantial for the analysis and development of the DCA.

4.6

Summary

In this chapter, we provide the specifications of a single-cell model, based on duration calculus, followed by a formal definition of the DCA regarding the data structures and procedural operations involved. It aims to clearly present the algorithm, to prevent future misunderstanding and ambiguity that could result in inappropriate applications and implementations. Based on the formal definition, a theoretical analysis of the runtime complexity of the algorithm is performed. The DCA achieves the best-case runtime complexity bounded by Θ(n) and the worst-case runtime complexity bounded by Θ(n2 ). The analysis of the system where segmentation is introduced is also performed. We have shown that the introduction of segmentation does not change the algorithm’s runtime complexity. In addition, regardless of the techniques used, the automated data pre-processing methods developed for the DCA do not affect the overall runtime complexity of the system. Therefore, an integrated system in which both segmentation and automated data pre-processing methods are implemented presents low runtime complexity. This makes it potentially beneficial for online detection tasks, e.g. anomaly-based intrusion-detection problems, where detection speed is of priority. Moreover, two runtime variables are formulated, namely the number of matured DCs and the number of processed antigens. This shows how the algorithm behaves within a given time interval based on the input data without the need to run the algorithm. As a result, the formulas of the two runtime variables can be used as indicators of adjusting the system’s settings according to real-time situations during detection. Finally, the moving window effect of the algorithm is formulated by presenting two moving window approaches, namely the dynamic and the static approach, in mathematical forms. This is 109


an important step for understanding some of the interesting properties of the algorithm from a theoretical perspective, as well as the usefulness of these properties with respect to the problem of interest. This work gives application independent insights to the algorithm, which can be used as guidelines for future development. One of the goals of the future development of the DCA is to turn it into an automated and adaptive online detection system, and such a system has certain requirements to fulfil. Firstly, the system has to be computationally efficient. The analysis of the runtime complexity of the DCA shows even in worst-case scenarios its runtime complexity does not increase exponentially as input size increases. Secondly, the system should be able to adapt to real-time scenarios encountered during detection. This requires the insights of how the algorithm behaves during runtime, which can be assessed from the two runtime variables. As a result, new components can be developed and integrated within the algorithm, to adjust the system based on the assessment of these two runtime variables. In terms of future work, the specifications can be further simplified and the algorithm can be presented using functional programming approach [28], to reveal more algorithmic details. In addition, various distributions of input data should be tested using the formulas defined in this chapter. We can also explore other properties of the algorithm, e.g. the relationship between the population size and the detection performance. Different methods of generating initial lifespans should also be investigated, in addition to the relationship between the weight matrix and detection performance.

110

Chapter 5

Online Analysis Component

111

5. Online Analysis Component

5.1

Introduction

The first step of the empirical investigation is to assess the possibility of improving the DCA’s online detection capability, through introducing an online analysis component based on segmentation. This is related to a fundamental requirement of a typical anomaly detection and attribution problem, namely anomaly-based intrusion detection. As demonstrated by Zhang et al. [170], in practice intrusion detection is a real-time critical mission. Intrusions should be detected as soon as possible or at least before an attack eventually succeeds. The detection speed, that reflects the time taken for identifying intrusions, is the key to preventing attacks. Here detection speed refers to the amount of time a system requires to produce final detection results where anomalies or intrusions can be identified. Consequently, the analysis process of a system should be performed in a periodic and continuous manner, namely ‘online’. This can be problematic for the DCA, as it often requires an offline analysis phase after all the data is processed. Therefore, it is desired to develop an online analysis component for the DCA. Knowing when to perform analysis is important for the operation of an online analysis component. In order to achieve this, segmentation approaches are introduced into the DCA. As the word suggests, segmentation involves slicing the output data into smaller segments with a view of generating finer grained results, as well as performing analyses in parallel with the detection process [72]. Segmentation is performed based on a fixed quantity of output data items, alternatively on a basis of a fixed time period. It enables the algorithm to perform periodic and continuous analysis whenever sufficient information is presented during detection. As a result, an online analysis component based on segmentation is to be derived to replace the original offline one. The aim of this chapter is to assess two segmentation approaches for online analysis through exploring their effects on the DCA. Such an assessment focuses on the comparison between the modified systems and the standard DCA. A range of segment sizes in both data quantity and time are varied to demonstrate any statistically significant effects. As a result, four null hypotheses for assessing statistical significances resulted 112


from any algorithmic modifications are formed as follows: • H5.1 : segmentation approaches will not result in significantly better anomaly metrics, in comparison to the standard DCA (non-segmentation); • H5.2 : the use of a smaller segment size will not result in significantly better anomaly metrics than that of a larger segment size; • H5.3 : segmentation approaches will not result in significant better detection performance, in comparison to the standard DCA (non-segmentation); • H5.4 : the detection performance of the DCA assessed through Kα is not significantly better than that assessed through M CAV . The first and second null hypotheses focus on the effect of segmentation on the anomaly metrics produced by the DCA, in an offline sense across all the segments within each setting. This is based on an assumption that, for anomalous antigen types, the higher the anomaly metrics the better, while for normal antigen types, the lower the anomaly metrics the better. Details of such an assumption and corresponding assessment will be described in Section 5.4.1. The third and fourth null hypotheses are concentrated on the segmentation’s effect on the detection performance with respect to the detection accuracy of every single segment within each setting, in an online sense. Details of the calculation and evaluation of detection performance will be given in Section 5.4.2. To test the null hypotheses, a large real-world dataset of a medium-scale port-scan within a university computer network, named the SYN scan dataset [63], is used. The dataset represents an example of the problem of interest, which may not be able to generalise all the scenarios in anomaly-based intrusion detection. However, it is considered adequate for the empirical investigation exploring the DCA’s algorithmic properties, rather than solving any particular problem. The concept of using segmentation with the DCA for the purpose of online detection is not entirely novel. Preliminary work of applying segmentation to the DCA has been performed in [67] and [122]. The corresponding experiments only took a brief glance at either of the segmentation approaches. Here, 113


the effect of segmentation on the algorithm’s online detection capability is examined in much greater detail. This chapter is organised as follows: the system modifications of online analysis component and segmentation approaches are introduced in Section 5.2; experimental conditions including experimental design, testing dataset and experimental setup are given in Section 5.3; the methods used for analyses are described in Section 5.4; the experimental results and statistical analysis are reported in Section 5.5; and finally a summary of this chapter is given in Section 5.6.

5.2 5.2.1

System Modifications Online Analysis Component

An online analysis component in the context of the DCA performs periodic analysis of the processed information presented by DCs, to continuously identify anomalies. Such component is intended to enable the algorithm to detect anomalies as quickly and accurately as possible, and hence detection speed and accuracy are two indicators of the system performance. In this chapter, detection accuracy is considered as the major performance indicator, as the proposed online approaches are obviously advantageous over the offline one in terms of detection speed. Due to the periodicity of online analyses, when to perform an analysis should be determined first. This can be solved by introducing segmentation to the DCA as a postprocessing method. During detection, a stream of processed information is presented by matured DCs over time. Segmentation involves partitioning such an information stream into smaller segments, in terms of the number of data items or elapsed time. In each segment, an analysis is performed and one set of detection results is generated, in which intrusions occurred in the interval of this segment can be identified. Segmentation produces multiple sets of results corresponding to the segments generated, rather than one set of results produced by the non-segmentation system. The system is 114


able to perform analysis online not offline, as all segments are processed during detection. Additionally, segmentation divides the analysis process into multiple steps, instead of being performed at once. The computation power and time required for each analysis process can be reduced, so segmentation could effectively enhance detection speed. Moreover, as processed information is presented by matured DCs at different time points over the entire detection duration, analysing the sequence of processed information at once ignores the temporal difference within the processed information. An antigen type that causes malicious activities at one point but does nothing at another point may be classified as normal rather than an intrusion, or vice versa. This may result in the reduction of false positives, and hence the detection accuracy could be improved.

5.2.2

Segmentation Approaches

Two segmentation approaches are used here, namely antigen based and time based segmentations. These two approaches set the segment size respectively according to two factors, which are the number of sampled antigens or the elapsed time. As data accumulate during detection, these two factors could indicate at which point to perform an analysis. The number of sampled antigen instances reflects to the amount of potential suspects sampled by the system for classification. The antigen based segmentation approach generates a segment whenever the number of sampled antigens reaches a specific number, triggering an analysis process. Conversely, the elapsed time implies the quantity of signal instances processed, as one signal instance is processed every iteration in the algorithm. It determines the quantity of evidence that can be used for decision making. Time based segmentation creates a segment whenever the defined time period elapses, with an analysis performed for each segment. Let z be a predefined segment size, t ∈ Time be a time point, where Time is a discrete time space.

A function that returns the number of sampled antigens is de-

fined as n : Time → N, and a function that returns the output list is defined as L : Time → Nn(t) × Rn(t) . Two segmentation approaches can be presented as Algo-

115


Algorithm 7: The antigen based segmentation approach. input : signals and antigens of input data output: antigen types and anomaly metrics 1 2 3 4 5 6 7 8 9

while detection phase do if (n(t) mod z) = 0 then analysis L(t); empty L(t); end else t = t + 1; end end

Algorithm 8: The time based segmentation approach. input : signals and antigens of input data output: antigen types and anomaly metrics 1 2 3 4 5 6 7 8 9

while detection phase do if (t mod z) = 0 then analysis L(t); empty L(t); end else t = t + 1; end end

rithm 7 and Algorithm 8. Segment size z is vital to the quantity of sampled antigens or the processed time contained in each segment. In order to perform sensitivity analyses, a range of segment sizes are tested, to find out their effects on the anomaly metrics and classification performance of the algorithm. Although both segmentation approaches employ a fixed segment size, they both involve dynamics of various system factors, such as the rate of processing of signal instances and the frequency of occurrence of antigen instances. In the antigen based segmentation the number of sampled antigens required for each segment is fixed, resulting in the processed time contained in each segment being variable. For example, one segment can last 10 seconds, and another one with the same number of processed antigen can last over 30

116


seconds. Whereas in the time-based segmentation approach where the time required for each segment is fixed, resulting in the number of sampled antigens contained in each segment being variable. For instance, one segment can have 100 processed antigens, and another one can have over 500 or even 1000 processed antigens, depending on the antigen frequency of that point during the experiment. As a result, by assessing both segmentation approaches, different aspects of system behaviour can be explored. This could reveal more insights into the behaviour of the DCA, producing a development platform for dynamic segmentation.

5.3 5.3.1

Experimental Conditions Experimental Design

In order to examine the effects of segmentation approaches on the DCA, three sets of experiments are conducted as follows: 1. Experiment of the standard DCA without segmentation; 2. Experiments of the DCA with the antigen based segmentation approach and various segment sizes applied; 3. Experiments of the DCA with the time-based segmentation approach and various segment sizes applied.

5.3.2

The SYN Scan Dataset

SYN scan is a port scan technique used by attackers for exploiting the vulnerabilities of targeted machines [119]. It involves sending SYN packets to all the IP addresses within the network, to identify active hosts. According to the TCP ‘3-way handshake’, if a host has open ports, once receiving SYN packets, it will reply to the sender with ACK packets. Attackers can then determine which hosts in the network are available for further exploits. The SYN scan dataset [62] was generated under the scenario that the 117


scan is performed by an insider, who can be a legitimate user of the system performing unauthorised activities. The dataset was collected through a ssh connection, where both anomalous and normal processes are included. It consists of over 13 million antigen instances and more than 4,800 signal instances. A sample of the antigen instances and a sample of the signal instances are displayed in Figure C.1 and Figure C.2 respectively in Appendix C. This dataset is large and noisy, making the problem difficult to solve. Therefore, it is ideal for testing, as the segmentation approaches are intended to improve the system performance. The dataset was originally demonstrated in [63], where seven features were collected. However, only the three most appropriate features are used here, as the aim is concentrated on exploring the usefulness of segmentation in terms of online analysis rather than producing an optimal solution. The PAMP signal is the number of ICMP ‘destination unreachable’ error messages received per second. When the closed ports of a host are scanned, a large amount of destination unreachable error messages are generated by firewalls, and hence is a certain signature of a scan. The Danger signal is based on the ratio of TCP packets to all other packets processed by the network card of the scanning host. A burst in magnitude of this ratio is not usually observed under normal conditions, which means something malicious could have happened. The Safe signal is derived from the observation that during SYN scans the average network packet size reduces to a consistent size of 40 bytes. The scans tend to send small-sized packets in large quantity, big packet sizes indicate normal behaviours of the network. These signals were chosen using expert domain knowledge and through empirical investigation into the nature of port scan performed using nmap. One set of signals is captured per second. All signals are normalised into a range of [0, 100], and they are plotted in Figure 5.1 (a). As most of the values of the PAMP signal are relatively small, they cannot reach the upper limit of the signal range, 100, after applying the moving average. The Danger and Safe signals, on the other hand, are not affected in such a way. The process IDs whenever a system call is made on the host are recorded as individual

118


CSM K

PAMP 4 8

Danger 2 4

Safe 6 -13

Table 5.1: Weights of signal transformation for SYN scan application.

Parameter Population size Lifespan Segment size (ABS) Segment size (TBS)

Value 100 12x, x ∈ {1, 2, . . . , 100} 10n , n ∈ {2, 3, 4, 5, 6} 10n , n ∈ {0, 1, 2, 3}

Table 5.2: Experimental parameters of SYN scan application.

antigen instances, but only antigen types with high frequency are of interest, as a system call of low frequency cannot instigate major changes in the monitored signals. The number of each antigen type per second is plotted in Figure 5.1 (b). The antigen types of interest include ‘nmap’, ‘pts’, and firefox’. nmap is a program used for invoking and performing SYN scans to the victim machines, pts (pseudo-terminal slave) demon process is the parent of the nmap process, and firefox performs web browsing throughout the recorded session. The main objective of the problem is to test whether the DCA is able to distinguish between anomalous and normal processes through assess their anomaly metrics [63]. Within the duration where the scan (attack) was being performed, nmap and pts processes are considered anomalous since they were responsible for the system’s misbehaviour, while firefox process is considered normal. In the experimental results, the anomaly metrics of both nmap and pts are expected to be higher in value than those of firefox. Such a principle is used for evaluating the classification performance of all the different versions (with or without segmentation) of the DCA.

5.3.3

Experimental Setup

The implementation of the DCA is based on an extension of the dDCA where the signal categories are increased from two to three, as demonstrated in [72]. All systems are programmed in C with a gcc 4.0.1 compiler. Experiments are run on an Intel 2.2 GHz 119


(a) Input Signals

(b) Antigen Frequency

Figure 5.1: Values of input signals against time series (moving average with intervals of 100, per selected signal category, used for plotting the graph, but not the actual input of the system), showing the inherent noise in the dataset. The number of each antigen type per second against time series (moving average with intervals of 100, per antigen type of interest, used for plotting the graph, but not the actual input of the system).

120


MacBook (OS X 10.5.5), and all the statistical tests performed in R (2.8.1). The predefined weights used for signal transformation are displayed in Table 5.1, and for consistency, they are the same as those used in [67]. The weights for the K signal are derived from summing up the weights for the original ‘Semi’ and ‘Mat’ signals. Two versions of anomaly metrics are used in the experiments, namely M CAV and Kα . The reason for using both versions is to not only link this work to the experiments based on M CAV in [62], but also bring the novelty of using Kα . All the other experimental parameters are listed in Table 5.2. Previous work [64] has shown that 100 is an appropriate value of the DC population size. Since the implemented system is based on the dDCA, the DC lifespans are uniformly distributed, and a coefficient whose value is determined by the associated weights of the CSM signal, is applied. The coefficient is defined so that the lifespans of most DCs are greater than the average signal strength of the input data, measured by the mean CSM signal, so that these DCs sample data instances for multiple iterations and mature more rapidly when sufficient information is collected. Both segmentation approaches use a series of segment sizes, listed in Table 5.1, for the purpose of sensitivity analysis.

5.4 5.4.1

Analytical Methods Statistical Tests

The authors of [67] applied the DCA with the antigen based segmentation on the same SYN scan dataset. The results produced by the DCA are presented in the form of M CAV values, and are compared with those produced by Self-Organising Maps (SOM). The statistical tests show that the DCA produces significantly better performance than SOM, more details will be described in Section 5.5.3. It provides a guideline on comparing different methods’ performance, through assessing anomaly metrics. According to [67], an assumption can be made by looking at the algorithm’s performance of detecting anomalous or normal antigen types respectively. If the anomaly metrics of an

121


anomalous antigen type (namp or pts) produced by method I are significantly greater than those produced by method II, then method I is considered significantly better than method II. Conversely, if the anomaly metrics of a normal antigen type (firefox) produced by method I are significantly less than those produced by method II, then method I is considered significantly better than method II. The focus of such comparisons is to determine the effect of segmentation on the algorithm’s performance in an ‘offline’ sense, as the assessment is performed across all the segments within each setting. In addition to the comparison between anomaly metrics (H5.1 and H5.2 ), statistical tests are also applied to compare the detection accuracy of different methods (H5.3 and H5.4 ). Two versions of statistical tests are performed, namely one-sample and two-sample. The one-sample tests are used to compare the DCA with segmentation and the standard DCA (H5.1 and H5.3 ), while two-sample tests are used for the rest of the comparisons (H5.2 and H5.4 ). In the first null hypothesis (H5.1 ), let x ∈ Rn denote the sample of anomaly metrics of an antigen type produced by a segmentation method, and µ ∈ R be the value of anomaly metric of that antigen type produced by the standard DCA. The one-sample tests treat µ as the population mean or median, depending whether parametric or nonparametric statistical tests are applied. They aim to determine if the sample mean (or median) of x is significantly greater (or less) from the population mean (or median). For the third null hypothesis (H5.3 ), x represents a sample of detection accuracies and µ is the detection accuracy of the standard DCA. In the second null hypothesis (H5.2 ), let x1 , x2 ∈ Rn denote two samples of anomaly metrics of an antigen type, produced by two segmentation methods respectively. The two-sample tests aim to identify if the sample mean (or median) of x1 is significantly greater (or less) that of x2 . For the fourth null hypothesis (H5.4 ), x1 and x2 represent two samples of detection accuracies. In both one-sample and two-sample tests, the Shapiro-Wilk test (α = 0.05) [31] is applied for the normality test of samples. If the compared samples are normally distributed, nonparametric student t-test (α = 0.05) [31] will be applied. On the other hand, if

122


the compared samples are not normally distributed, nonparametric Mann-Whitney test (α = 0.05) [31] will be used instead. The significance level of all the tests is set as 0.05, which gives us a confidence of 0.95 to reject the null hypothesis.

5.4.2

Calculation of Detection Performance

Anomaly metrics of all the antigen types (nmap, pts and firefox) are further assessed to perform a classification per segment, with respect to a predefined segment size in either of the segmentation approaches. As two versions of anomaly metrics are used in the experiments, two thresholds, namely εk = −1010.680 and εm = 0.246 (for Kα and M CAV respectively), are calculated for further classification. Details of the calculation of εk and εm can be found in Section 3.2.3 of Chapter 3. Let A = {nmap, pts, firefox} be a set of all the antigen types of interest. Let Na be the number of antigen instances and Ns be the number of signal instances. If z is the predefined segment size, the number of segments generated can be calculated as dNa /ze (antigen based segmentation) and dNs /ze (time based segmentation) respectively, where d·e is the ceiling function. Let i = {1, 2, . . . , dNa /ze} or i = {1, 2, . . . , dNs /ze} be the index of a segment in either segmentation approach. A function M(i, α) that returns the anomaly metric of an antigen type α ∈ A in the ith segment, with respect to a predefined segment size z, is defined as M : N × A → R. A true positive is identified if (M(α, i) > εk ) ∧ (α = nmap) or (M(α, i) > εk ) ∧ (α = pts) holds, and a true negative is identified if (M(α, i) < εk ) ∧ (α = firefox) holds. The detection accuracy is equal to the total number of true positives (TP) and true negatives (TN) divided by the total number of positives (P) and negatives (N), calculated as follows.

ACC =

TP + TN P +N

(5.1)

The same principle and calculation can be applied to M CAV with a threshold εm . Such assessments are concentrated on the effect of segmentation on the algorithm’s performance in an ‘online’ sense, since the detection performance of every single segment 123


segment size non-seg non-seg non-seg

102

103 104 nmap (µ = −966.207) < 0.001 (g) < 0.001 (g) 0.216 pts (µ = −1005.068) < 0.001 (g) < 0.001 (g) < 0.001 (g) firefox (µ = −1005.661) 1.000 1.000 1.000

105

106

0.683

0.729

0.162

0.554

0.862

0.473

Table 5.3: The p-values (significance level α = 0.05) of one-sample one-sided MannWhitney tests for H5.1 , with respect to Kα values and the antigen based segmentation. The highlighted cells represent the cases where the antigen based segmentation outperforms the standard DCA (‘non-seg’ stands for non-segmentation and ‘g’ indicates greater than).

with each setting will be reported. As segmentation is introduced to the system, a set of anomaly metrics corresponding to all the antigen types of interest, are produced per segment and thus a classification is performed. As one signal instance was collected per second, the duration of the SYN scan can be represented as [1, Ns ] ∩ N ⊆ Time, where Time represents the discrete time domain of a given problem. According to [62], nmap process was invoked between t = 651 and t = 4,120 (t ∈ Time), and therefore the assessed segments (regardless the segmentation approaches) are those contained within this interval ([651, 4120] ⊂ [1, Ns ] ∩ N), where nmap and pts processes are both active. Let σi ∈ {0, 1, 2, 3} be the number of correction classifications (true positives and negatives) made in the ith segment. If the results are assessed through Ka , it is calculated as σi = βi + γi + δi

∀i, where βi = 1{i|(M(α,i)>εk )∧(α=nmap)} (i), γi = 1{i|(M(α,i)>εk )∧(α=pts)} (i), and δi =

1{i|(M(α,i) 3. This means that the entire signal transformation function requires modification, as the original three-dimensional weight matrix is no longer applicable. If the linear fashion of

169

7. Conclusions and Future Work

signal transformation is used, a new 2 × m weight matrix should be created to transform m input signals to two output signals. As a result, the heuristics derived from immunology that define the relationship among original signal categories of the DCA cannot be applied. The new weight matrix needs to be acquired through a form of learning based on underlying characteristics of the given data. The expansion of input space does not eliminate the necessity of using dimensionality reduction techniques to remove irrelevant or redundant features. Effective dimensionality reduction can reduce the complexity of the input data, so the difficulty of creating a model for correctly classifying data instances is reduced accordingly. The probability of significant information loss through the dimensionality reduction process can also be decreased. Automated parameter tuning often involves optimisation, in this case multi-objective optimisation since two objective functions can be derived. One objective function aims to minimise the error of predictions of the algorithm, and this is related to the distribution of K values calculated from the input signal instances. Let θˆ = {kˆ1 , kˆ2 , ..., kˆn } be a set of normalised K values with respect to the range of the class labels, and θ = {c1 , c2 , ..., cn } be a set of class labels associated with each input data instance, where n is the data size. An objective function for find the decision boundary, denoted by f , can be optimised ˆ that is defined in Equation 3.25. However, the second objective by minimising MSE(θ) function, denoted by g, related to the frequency of DC maturation is rather difficult to formulate. It is advisable to have CSM signal shifts at the transitional points of class changes, and this is also determined by the predefined lifespan of each DC. If f is optimised, there is at least some guarantee of the classification performance. The greedy search method used in the inference approach is not practical if applied to a high dimensional feature space, due to lack of scalability. As a result, other more efficient methods, such as simulated annealing [18], Tabu search [34] or Genetic Algorithms [35], can be applied to search for optima. Multi-layered perceptrons can be introduced into the algorithm. As a result, the mapping between input and output signals becomes nonlinear and the associated weights can

170


be adjusted automatically by some gradient-based error function through training. The work of Oates et al. [124] suggested that it is possible to create a repertoire of classifiers, each physically constructed to accept a different number of input instances and trained using training data, smoothed over different time scales. Instead of aggregating the K signal transformed by a rather centralised function from each input signal instance, each DC would individually aggregate the input signal instance on its own and present these averages to the classifier in the repertoire over the most similar time scale to that of the DC’s lifespan. The classifiers in the repertoire can be any types of non-linear classifiers, such as multi-layered perceptron based networks. The proposed solutions may improve the algorithm’s capability of handling highly complicated problems, however they could also significantly enhance its computational complexity. Researchers in machine learning often adopt Occam’s Razor when designing learning models, and simplicity is the key to successful applications. Vapnik suggested that when trying to solve some problem, one should not solve a more difficult problem as an intermediate step [158]. As a consequence, one should bear in mind the trade-off between computational complexity and classification performance, when modifying the current system based on proposed solutions.

7.3

The Future of AIS

Even though this thesis focuses on one particular immune-inspired algorithm, the findings could contribute to the wide field of AIS, especially to the analysis and development of immune algorithms in general. The theoretical investigation of the DCA adds mathematical foundations to the algorithm, which the author feels could have been used during the initial development, to avoid confusion and provide for some guarantees. This was not the case as no clear guidelines for developing biological inspired algorithms were available at that time. A commonly used development process involves, finding a biological system with computationally plausible properties, and developing the algorithm based on an abstract model of certain parts of the biological system. The resulting 171


algorithm is then tested by several datasets, to determine its effectiveness and efficiency for real-world applications. This approach has several issues: firstly, the modelling is often not sufficient, personal interests can introduce bias, and the derived models may not adequately represent the useful components of the natural system. Secondly, if the modelling is not performed based on mathematical methods, future users may be required to formalise the algorithm as mathematical functions, in order to perform theoretical analyses. The reverse engineering of mathematical foundation into an existing algorithm can be more problematic than basing an algorithm on the sound theory to begin with. This work supports similar calls from researchers in the field to ground systems in rigorously defined models. For example, Stepney et al. [146] proposed a conceptual framework for developing biological inspired algorithms that are aimed to capture biological richness and complexity, without compromising the need for sound engineered systems. This framework was then further extend by Timmis et al. [153] to show an Immuno-Engineering approach for developing immune-inspired algorithms in particular. A successful example using such an approach is the work of Owens et al. [127], where the authors firstly defined a detailed mathematical model of the T-cell receptor signalling process and then reduced it to a more abstract model without losing the core properties. The abstract model is then combined with machine learning techniques, to develop an algorithm for kernel density estimation and anomaly detection. Both the biological inspiration and mathematical soundness are persevered in the algorithm. In addition, the empirical investigation of the DCA also provides hints that could be useful for the future development of AIS. Segmentation can be used for situations where online detection or large datasets are involved, as it provides a means of dividing the data into smaller and more manageable pieces. Such online solution is different from other techniques in AIS that usually require modifications of the core immune-inspired algorithm [40, 115]. It instead can be seen as a plug-in component that could improve an artificial immune system’s online detection capability without increasing its compu-

172


tational complexity. The automated data pre-processing methods are tailored for the DCA, however they still show the possibility of using techniques in machine learning and statistics to widen the use and improve the performance of immune inspired algorithms. Both the importance of rigorous foundations and the demonstrable usefulness of machine learning and statistical techniques to the field, highlight the significance of maintaining stronger links to other more established fields.

173

References [1] Y. Al-Hammadi. Behavioural Correlation for Malicious Bot Detection. PhD thesis, School of Computer Science, University of Nottingham, 2010. [2] Y. Al-Hammadi, U. Aickelin, and J. Greensmith. DCA for Bot Detection. In Proceedings of the IEEE World Congress on Computational Intelligence (WCCI), pages 1807–1816, 2008. [3] N. B. Amor, S. Benferhat, and Z. Elouedi. Naive bayes vs decision trees in intrusion detection systems. In Proceedings of the 2004 ACM symposium on Applied computing, pages 420–424, 2004. [4] D. Anderson, T. F. Lunt, H. Javitz, A. Tamaru, and A. Valdes. Detecting unusual program behaviour using the statistical component of the next-generation intrusion detection expert system (NIDES). Technical report, Computer Science Laboratory, SRI International, 1995. [5] P. Andrews and J. Timmis. On diversity and artificial immune sysetms: Incorporating a diversity operation into aiNet. In Proceedings of the International Conference on Natural and Artificial Immune Sysetms (NAIS), pages 293–306, 2005. [6] A. C. Atkinson, M. Riani, and A. Cerioli. Exploring Multivariate Data with the Forward Search. Springer Series in Statistics. Springer, 2004.

174

REFERENCES

[7] T. Back. Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford Press, 1996. [8] R. Bejtlich. The tao of network security monitoring: beyond intrusion detection. Addison-Wesley, 2004. [9] L. Biacino and G. Gerla. Fuzzy logic, continuity and effectiveness. Archive for Mathematical Logic, 41(7):643–667, 2001. [10] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [11] C. L. Blake, S. Hettich, and C. J. Merz. UCI repository of machine learning databases. 1998. [12] L. Breiman, J. H. Friedman, and R. A. Olshen. Classification and Rgression Trees. Wadsworth, 1984. [13] P. Bretscher and M. Cohn. A theory of self-nonself discrimination. Science, 169(3950):1042–1049, 1970. [14] S. M. Bridges and R. B. Vaughn. Fuzzy data mining and genetic algorithms applied to intrusion detection. In Proceedings of the 23rd National Information Systems Security Confernece, pages 13–31, 2000. [15] R. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2(1):14–23, 1986. [16] C. J. C. Buerges. A Tutorial on Support Vector Mahinces for Pattern Recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998. [17] B. Bull and T. Kovacs. Foundations of learning classifier systems: An introduction. In Proceedings of Studies in Fuzziness and Soft Computing, pages 1–17, 2005. [18] V. Cerny. A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm. Journal of Optimization Theory and Applications, 45:41–51, 1985. 175

REFERENCES

[19] Z. Chelly and Z. Elouedi. FDCM: A Fuzzy Dendritic Cell Algorithm. In Proceedings of the 9th International Conference on Artificial Immune Systems (ICARIS), pages 102–115, 2010. [20] E. Clark, A. Hone, and J. Timmis. A Markov Chain Model of the B-Cell Algorithm. In Proceedings of the 4th International Conference on Artificial Immune Systems (ICARIS), pages 318–330, 2005. [21] E. F. Codd. Cellular Automata. Academic Press, 1968. [22] G. Coelho and F. J. Von Zuben. OMNI-aiNet: An immune-inspired approach for omni optimisation. In Proceedings of the 5th International Conference on Artificial Immune Systems (ICARIS), pages 294–308, 2006. [23] I. R. Cohen. Immune system computation and the immunological homunculus. Model Driven Engineering Languages and Systems, 4199:499–512, 2006. [24] I. R. Cohen. Real and artificial immune sysetms: Computing the state of the body. Immunological Reivew, 7:569–574, 2007. [25] J. Cohen, P. Cohen, S. G. West, and L. S. Aiken.

Applied multiple regres-

sion/correlation analysis for the behavioral sciences. Psychology Press, 3rd edition, 2003. [26] R. Coico, G. Sunshine, and E. Benjamini. Immunology: A short course. John Wiley & Sons, Inc., 5th edition, 2003. [27] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Third Edition (Hardcover). The MIT Electrical Engineering and Computer Sicence Series. The MIT Press, 2009. [28] G. Cousineau, M. Mauny, and K. Callaway. The Functional Approach to Programming. Cambridge University Press, 1998. [29] T. Cover and P. Hart. Nearest neighbour pattern classification. IEEE Transcations on Information Theory, IT-11:21–27, 1967. 176

REFERENCES

[30] T. M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, EC-14:326–334, 1965. [31] M. J. Crawley. Statistics: An Introduction Using R. Wiley Blackwell, 2005. [32] N. Cruz Cortes and C. A. Coello Coello. Multiobjective optimisation using ideas from the clonal selection principle. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 158–170, 2003. [33] V. Cutello, G. Nicosia, P. Oliveto, and M. Romeo. On the convergence of immune algorithms. In Proceedings of Fundations of Computational Intelligence, pages 409–416, 2007. [34] D. Cvijovic and J. Klinowski. Taboo search - an approach to the multiple minima problem. Science, 267:664–666, 1995. [35] L. D. Davis and M. Mitchell. Handbook of Genetic Algorithms. Van Nostrand Reinhold, 1991. [36] L. N. de Castro and J. Timmis. Artificial Immune Systems: A New Computational Intelligent Approach. Springer-Verlag, 2002. [37] L. N. de Castro and F. J. Von Zuben. The clonal selection algorithm with engineering applications. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 36–37, 2000. [38] L. N. de Castro and F. J. Von Zuben. An evolutionary immune network for data clustering. In Proceedings of Brazilian Symposium on Artificial Neural Networks, pages 84–89, 2002. [39] L. N. de Castro and F. J. Von Zuben. Learning and optimisation using the clonal slection principle. IEEE Transactions on Evolutionary Computation, 6(3):239– 251, 2002.

177

REFERENCES

[40] R. de Lemos, J. Timmis, S. Forrest, and M. Ayara. Immune-Inspired Adaptable Error Detection for Automated Teller Machines. IEEE SMC Part C: Applications and Reviews, 37(5):873–886, 2007. [41] D. E. Denning. An intrusion detection model. IEEE Transactions on Software Engineering, SE-13(2):222–232, 1987. [42] T. G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems: First International Workshop (MCS 2000), pages 1–15, 2000. [43] E. W. Dijkstra and C. S. Scholten. Predicate calculus and program semantics. Springer-Verlag, 1990. [44] G. Dranoff. Cytokines in cancer pathogenesis and cancer therapy. Nature Reviews Cancer, 4:11–22, 2004. [45] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Blackwell, 2nd edition, 2000. [46] M. Elberfeld and J. Textor. Efficient Algorithms for String-Based Negative Selection. In Proceedings of the 8th International Conference on Artificial Immune Systems (ICARIS), pages 109–121, 2009. [47] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecing intrusions in unlabled data, chapter 4. Kluwer, 2002. [48] F. Esponda. Negative representation of information. PhD thesis, Department of Computer Science, University of New Mexico, 2005. [49] F. Esponda, E. S. Ackley, S. Forrest, and P. Helman. Online negative databases. International Journal of Unconventional Computing, 1(3):201–220, 2005. [50] J. D. Farmer, N. H. Packard, and A. S. Perelson. The immune learning, adaptation, and machine learning. Physica D, 4:187–204, 1986. 178

REFERENCES

[51] T. Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8), 2006. [52] R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, 2nd edition, 1987. [53] D. R. Flower and J. Timmis. In Silico Immunology. Springer, 2007. [54] D. B. Fogel. What is evolutionary computation? IEEE Spectrum, 37(2):28–32, 2000. [55] S. Forrest, A. Perelson, L. Allen, and R. Cherukuri. Self-nonself discrimination in a computer. In Proceedings of the IEEE Symposium on Research Security and Privacy, pages 202–212, 1994. [56] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. Robotics & Automation Magazine, IEEE, 4(1), 1997. [57] J. Frank. Artificial intelligence and intrusion detection: Current and future directions. In Proceedings of the 17th National Computer Security Conference, 1994. [58] P. Garthwaite, I. T. Jolliffe, and B. Jones. Statistical Inference (Hardcover). Oxford Press, 2003. [59] M. Glickman, J. Balthrop, and S. Forrest. A machine learning evalutation of an artificial immune system. Evolutionary Computation, 13(2):179–212, 2005. [60] R. A. Goldsby, T. J. Kindt, B. A. Osborne, and J. Kuby. Immunology. W. H. Freeman and Company, 5th edition, 2003. [61] F. A. Gonzalez and D. Dasgupt. Anomaly detection using real-valued negative selection. Genetic Programming and Evolvable Machines, 4(4):383–403, 2003. [62] J. Greensmith. The Dendritic Cell Algorithm. PhD thesis, School of Computer Science, University of Nottingham, 2007.

179

REFERENCES

[63] J. Greensmith and U. Aickelin. DCA for SYN Scan Detection. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 49–56, 2007. [64] J. Greensmith and U. Aickelin. The Deterministic Dendritic Cell Algorithm. In Proceedings of the 7th International Conference on Artificial Immune Systems (ICARIS), pages 291–303, 2008. [65] J. Greensmith, U. Aickelin, and S. Cayzer. Introducing Dendritic Cells as a Novel Immune-Inspired Algorithm for Anomaly Detection. In Proceedings of the 4th International Conference on Artificial Immune Systems (ICARIS), pages 153–167, 2005. [66] J. Greensmith, U. Aickelin, and G. Tedesco. Information fusion for anomaly detection with the dendritic cell algorithm. Information Fusion, 11(1):21–34, 2010. [67] J. Greensmith, J. Feyereisl, and U. Aickelin. The DCA: SOMe Comparison A comparative study between two biologically-inspired algorithms. Evolutionary Intelligence, 1(2):85–112, 2008. [68] J. Greensmith, J. Twycross, and U. Aickelin. Articulation and Clarification of the Dendritic Cell Algorithm. In Proceedings of the 5th International Conference on Artificial Immune Systems (ICARIS), pages 404–417, 2006. [69] F. Gu, J. Feyereisl, R. Oates, J. Reps, J. Greensmith, and U. Aickelin. Quiet in Class: Classification, Noise and the Dendritic Cell Algorithm. In Proceedings of the 10th International Conference on Artificial Immune Systems (ICARIS 2011), volume LNCS Volume 6825, pages 173–186, 2011. [70] F. Gu, J. Greensmith, and U. Aickelin. Further Exploration of the Dendritic Cell Algorithm: Antigen Multiplier and Time Windows. In Proceedings of the 7th International Conference on Artificial Immune Systems (ICARIS), pages 142–153, 2008.

180

REFERENCES

[71] F. Gu, J. Greensmith, and U. Aickelin. Exploration of the Dendritic Cell Algorithm with the Duration Calculus. In Proceedings of the 8th International Conference on Artificial Immune Systems (ICARIS), pages 54–66, 2009. [72] F. Gu, J. Greensmith, and U. Aickelin. Integrating Real-Time Analysis With The Dendritic Cell Algorithm Through Segmentation. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 1203–1210, 2009. [73] F. Gu, J. Greensmith, and U. Aickelin. The Dendritic Cell Algorithm for Intrusion Detection, chapter Biologically Inspired Networking and Sensing: Algorithms and Architectures. IGI Global, 2011. [74] F. Gu, J. Greensmith, R. Oates, and U. Aickelin. PCA 4 DCA: the application of Principal Component Analysis to the Dendritic Cell Algorithm. In Proceedings of the 9th Annual Workshop on Computational Intelligence (UKCI), 2009. [75] I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. Jounal of Machine Learning Research, 3:1157–1182, 2003. [76] P. K. Harmer, P. D. Williams, G. H. Gunsch, and G. B. Lamont. An artificial immune system architecture for computer security applications. IEEE Transactions on Evolutionary Computation, 6(3):252–280, 2002. [77] E. Hart and J. Timmis. Application Areas of AIS: The Past, Present and the Future. Journal of Applied Soft Computing, 8(1):191–201, 2008. [78] H. He, X. Luo, and B Liu. Detecting anomalous network traffic with combined fuzzy based approaches. In Advances in Intelligent Computing, volume 433-442, 2005. [79] J. Healey and R. Picard. Detecting stress during real-world driving tasks using physiological sensors. IEEE Transactions in Intelligent Transportation Systems, 6(2):156–166, 2005.

181

REFERENCES

[80] M. E. Hellman and J. Raviv. Probability of Error, Equivocation, and the Chernoff Bound. IEEE Transcations on Information Theory, IT-16(4):368–382, 1970. [81] A. Hofmann, C. Schmitz, and B. Sick. Rule extraction from neural networks for intrusion detection in computer networks. IEEE Transactions on Systems, Man and Cybernetics, 2:1259–1296, 2003. [82] S. Hofmeyr and S. Forrest. Architecture for an artificial immune system. Evolutionary Computation, 7(1):1289–1296, 2000. [83] S. A. Hofmeyr and S. Forrest. Immunity by Design: An Artificial Immune System. In Proceedings of the Genetic and Evolutionary Computation Conference, volume 2, pages 1289–1296, Orlando, Florida, USA, 13–17 1999. Morgan Kaufmann. [84] J. Holland and J. Reitman. Cognitive systems based on adaptive algorithms, chapter Pattern-Directed Inference Systems. Academic Press, 1978. [85] H. Hotelling. Analysis of a complex of statistial variables into principal componets. Journal of Educational Psychology, 24(417-441), 1933. [86] MIT Lincoln Lab Information System Technology Group. The 1998 Intrusion Detection Off-line Evaluation Plan. http://www.ll.mit.edu/IST/ideval/data/1998/, March 1998. [87] MIT Lincoln Lab Information System Technology Group. The 1999 Intrusion Detection Off-line Evaluation Plan. http://www.ll.mit.edu/IST/ideval/data/1999/, 1999. [88] P. Jackson. Introduction to expert systems. Addison Wesley, 3rd edition, 1999. [89] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000.

182

REFERENCES

[90] A. G. Janecek, W. N. Gansterer, M. A. Demel, and G. F. Ecker. On the Relationship Between Feature Selection and Classification Accuracy. In JMLR: Workshop and Conference Proceedings 4, pages 90–105, 2008. [91] C. A. Janeway and R. Medzhitov. Innate immune recognition. Annual Review of Immunology, 20:197–216, 2002. [92] T. Jansen and C. Zarges. A theoretical analysis of immune inspired somatic contiguous hypermutations for function optimisation. In Proceedings of the 8th International Conference on Artificial Immune Systems (ICARIS), pages 80–94, 2009. [93] A. Jean-Raymond. The B-Book. Cambridge University Press, 1996. [94] N. K. Jerne. Towards a network theory of the immune system. Ann. Immunol. (Inst. Pasteur), 125C:373–389, 1974. [95] N. Kayacik, G. amd Zincir-Heywood and M. Heywood. On the Capability of an SOM based Intrusion Detection System. Proceedings of International Joint Conference on Neural Networks, 3:1808– 1813, 2003. [96] N. Kayacik, G. amd Zincir-Heywood and M. Heywood. Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets. In The 3rd Annual Conference on Privacy, Security and Trust (PST), 2005. [97] J. Kelsey and J. Timmis. Immune inspired somatic contiguous hypermutation for function optimisation. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 207–218, 2003. [98] D. S. Kim, H. N. Nguyen, S. Y. Ohn, and J. S. Park. Fusions of ga and svm for anomaly detection in intrusion detection system. In Advances in Neural Networks, pages 415–420, 2005.

183

REFERENCES

[99] J. Kim, P. Bentley, U. Aickelin, J. Greensmith, G. Tedesco, and J. Twycross. Immune System Approaches to Intrusion Detection - A Review. Natural Computing, 6(4):413–466, 2007. [100] J. W. Kim. Integrating Artificial Immune Algorithms for Intrusion Detection. PhD thesis, Department of Computer Science, University College London, 2002. [101] C. Kruegel, F. Valeur, G. Vigna, and R. Kemmerer. Stateful Intrusion Detection for High-Speed Neworks. In IEEE Symposium on Security and Privacy, 2002. [102] C. G. Langton. Artificial life: an overview. MIT Press, 1997. [103] N. Lay and I. Bate. Improving the reliability of real-time embedded systems using innate immune techniques. Evolutionary Intelligence, 1(2):113–132, 2008. [104] J. Le Boundec and S. Sarafijanovic. An artificial immune system approach to misbehaviour detection in mobile ad-hoc networks. In Proceedings of the 1st International Workshop on Biologically Inspired Approaches to Advanced Information Technology, pages 96–111, 2004. [105] W. Lee and S. J. Stolfo. Data mining approaches for intrusion detection. In Proceedings of the 7th conference on USENIX Security Symposium, 1998. [106] I. Levin.

KDD-99 Classifier Learning Contest: LLSoft’s Results Overview.

SIGKDD Explorations, 1(2):67–75, 2000. [107] M. Liskiewicz and J. Textor. Negative Selection Algorithms Without Generating Detectors. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1047–1054, 2010. [108] J. Luo and S. M. Bridges. Mining fuzzy association rules and fuzzy frequency episodes for intrusion detection. International Journal of Intelligent Systems, 15(8):687–703, 2001.

184

REFERENCES

[109] M. B. Lutz and G. Schuler. Immature, semi-mature and fully mature dendritic cells: which signals induce tolerance or immunity?

TRENDS in Immunology,

23(9):445–449, 2002. [110] J. McHugh. Intrusion and instruion detection. International Journal of Information Security, 1(1):14–35, 2011. [111] L. Me. GASSATA, a genetic algorithm as an alternative tool for security audit trails analysis. In Proceedings of the 1st International Workshop on the Recent Advances in Intrusion Detection, 1998. [112] J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, A 209:415–446, 1909. [113] M. Mitchell. An Introduction to Genetic Algorithms. The MIT Press, 1998. [114] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997. [115] M. Mokhtar, R. Bi, J. Timmis, and A. M. Tyrrell. A Modified Dendritic Cell Algorithm for On-line Error Detection in Robotic Systems. In Proceedings of the 11th IEEE Congress on Evolutionary Computation (CEC), pages 2055–2062, 2009. [116] B. Moszkowski. A temporal logic for multilevel reasoning about hardware. Computer, 18(2):10–19, 1985. [117] C. Musselle. Insight into the Antigen Sampling Component of the Dendritic Cell Algorithm. In Proceedings of the 9th International Conference on Artificial Immune Systems (ICARIS), pages 88–101, 2010. [118] J. L. Myers and A. D. Well. Research Design and Statistical Analysis. Routledge, 2nd edition, 2002. [119] S. Northcutt and J. Novak. Network Intrusion Detection. New Riders Publishing, 3rd edition, 2003. 185

REFERENCES

[120] R. Oates. The Suitability of the Dendritic Cell Algorithm for Robotic Security Applications. PhD thesis, School of Computer Science, University of Nottingham, 2010. [121] R. Oates, J. Greensmith, U. Aickelin, J. Garibaldi, and G. Kendall. The Application of a Dendritic Cell Algorithm to a Robotic Classifier. In Proceedings of the 6th International Conference on Artificial Immune (ICARIS), pages 204–215, 2007. [122] R. Oates, G. Kendall, and J. Garibaldi. Frequency Analysis for Dendritic Cell Population Tuning: Decimating the Dendritic Cell. Evolutionary Intelligence, 1(2), 2008. [123] R. Oates, G. Kendall, and J. Garibaldi. The Limitations of Frequency Analysis for Dendritic Cell Population Modelling. In Proceedings of the 7th International Conference on Artificial Immune Systems (ICARIS), pages 328–339, 2008. [124] R. Oates, G. Kendall, and J. Garibaldi. Classifying in the presence of uncertainty: a DCA persepctive. In Proceedings of the 9th International Conference on Artificial Immune Systems (ICARIS), pages 75–87, 2010. [125] E. Olderog and H. Dierks. Real-Time Systmes: Formal Specification and Automatic Verification. Cambridge University Press, 2008. [126] M. Ostaszewski, P. Bouvry, and F. Seredynski. Denial of service detection and analysis using idiotypic networks paradigm. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 79–86, 2008. [127] N. Owens, A. Greensted, J. Timmis, and A. Tyrrell. T-cell receptor signalling inspired kernel density estimation and anomaly detection. In Proceedings of the 8th International Conference on Artificial Immune Systems (ICARIS), pages 122– 135, 2009.

186

REFERENCES

[128] E. Palmer. Negative selection - clearing out the bad apples from the t-cell repertoire. Nature Reviews Immunology, 3:383–391, 2003. [129] H. C. Peng, F. H. Long, and C. Ding. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226–1238, 2005. [130] A. S. Perelson and G. F. Oster. Theoretical studies of clonal selection: Minimal antibody repertoire size and reliability of slef-non-self discrimination. Journal of Theoretical Biology, 81(4):645–670, 1979. [131] K. Polat, S. Sahan, and S. Gunes. A new method to medical diagnosis: Artificial immune recognition system (airs) with fuzzy weighted pre-processing and application to ecg arrhythmia. Expert Systems with Applications, 31(2):264–269, 2006. [132] K. Rajewsky. Clonal selection and learning in the antibody system. Nature, 381:751– 758, 1996. [133] A. P. Ravn, H. Rischel, and K. M. Hansen. Specifying and Verifying Requirements of Real-time Systems. IEEE Transactions on Software Engineering, 19(1):41–55, 1993. [134] G. Rudolph. Finite markov chain results in evolutionary computation: a tour d’horizon. Fundamenta Informaticae, 35(1-4):67–89, 1998. [135] J. Ryan, M. J. Lin, and R. Miikkulainen. Intrusion detection with neural networks. In Advances in Neural Information Processing Systems 10, pages 943–949, 1998. [136] S. Saitoh. Theory of Reproducing Kernels and its Applicaitons. Longman Scientific and Technical, 1988. [137] B. Scholkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. Muller, G. Ratsch, and

187

REFERENCES

A. J. Smola. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5):1000–1017, 1999. [138] B. Scholkopf and A. J. Smola. Learning with Kernels : Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 2002. [139] B. Scholkopf, A. J. Smola, and K. R. Muller. Kernel principal component analysis. In Proceedings of the 7th International Conference on Artificial Neural Networks, pages 583–588, 1997. [140] K. Shafi, T. Kovacs, H. A. Abbass, and W. Zhu. Intrusion detection with evolutionary learning classifier systems. Natural Computing, 8(1):3–27, 2009. [141] H. Shah, J. Undercoffer, and A. Joshi. Fuzzy clustering for intrusion detection. In Proceedings of the 12th IEEE International Conference on Fuzzy Systems, pages 1274–1278, 2003. [142] A. Sharma and K. K. Paliwala. Fast principal component analysis using fixed-point algorithm. Pattern Recognition Letters, 28(10):1151–1155, 2007. [143] J. Shlens. A tutorial on principal component analysis. Technical Report 2, System Neurobiology Laboratory, Salk Institue for Biological Science, 2005. [144] M. Siper. Introduction to the Theory of Computation. Course Technology, 2nd edition, 2005. [145] Sourcefire. Snort: an open source network intrusion prevention and detection system (ids/ips). In http://www.snort.org/, 2011. [146] S. Stepney, R. Smith, J. Timmis, A. Tyrrell, M. Neal, and A. Hone. Conceptual frameworks for artificial immune systems. International Journal of Unconventional Computing, 1(3):315–338, 2006. [147] T. Stibor. On the appropriateness of negative selection for anomaly detection and network intrusion detection. PhD thesis, Darmstadt University of Technology, 2006. 188

REFERENCES

[148] T. Stibor, P. Mohr, J. Timmis, and C. Eckert. Is negative selection appropriate for anomaly detection?

In Genetic And Evolutionary Computation Conference

(GECCO), pages 321–328, 2005. [149] T. Stibor, R. Oates, G. Kendall, and J. Garibaldi. Geometrical insights into the dendritic cell algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 1275–1282, 2009. [150] T. Stibor and J. Timmis. An investigation into the compression quality of ainet. In Proceedings of Foundations of Computational Intelligence, 2007. [151] T. Stibor, J. Timmis, and E. Claudia. A Comparative Study of Real-Valued Negative Selection to Statistical Anomaly Detection Techniques. In Proceedings of the 4th International Conference on Artificial Immune Systems (ICARIS), pages 262–375, 2005. [152] J. Timmis, P. Andrews, N. Owens, and E. Clark. An Interdisciplinary Perspective on Artificial Immune Systems. Evolutionary Intelligence, 1(1):5–26, 2008. [153] J. Timmis, E. Hart, A. Hone, M. Neal, A. Bobins, S. Stepney, and A. Tyrrell. Immuno-Engineering. In Proceedings of the 2nd IFIP International Conference on Biologically Inspired Collaborative Computing, IEEE Press Vol: 268/2008, pages 3–17, 2008. [154] J. Timmis, A. Home, T. Stibor, and E. Clark. Theoretical advances in artificial immune systems. Theoretical Computer Sicence, 403(1):11–32, 2008. [155] J. Timmis and M. Neal. A resource limited artificial immune system for data analysis. Knowledge-Based Systems, 14(3-4):121–130, 2001. [156] K. Torkkola. Feature Extraction by Non-parametric Mutual Information Maximisation. Journal of Mahince Learning Research, 3:1415–1438, 2003. [157] J. Twycross and U. Aickelin. Libtissue - Implementing Innate Immunity. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), 2006. 189

REFERENCES

[158] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 2nd edition, 1999. [159] M. Villalobos-Arias, C. A. Coello Coello, and O. Hernandez-Lerma. Convergence Analysis of a Multiobjective Artificial Immune System Algorithm. In Proceedings of the 4th International Conference on Artificial Immune Systems (ICARIS), volume 226-235, 2004. [160] A. Watkins and J. Timmis. Artificial immune recognition system (airs): Revisions and refinements. In Proceedings of the 1st International Conference on Artificial Immune Systems (ICARIS 2002), pages 173–181, 2002. [161] A. Watkins, J. Timmis, and L. Boggess. Artificial immune recognition system (AIRS): An immune-inspired supervised learning algorithm. Genetic Programming and Evolvable Machines, 5(1):291–317, 2004. [162] N. P. Weng. Aging of the immune system: How much can the adaptive immune system adapt? Immunity, 24(5):495–499, 2006. [163] A. Whitbrook, U. Aickelin, and J. Garibaldi. An Idiotypic Immune Network as a Short-Term Learning Architecture for Mobile Robots. In Proceedings of the 7th International Conference on Artificial Immune Systems (ICARIS), pages 266–278, 2008. [164] A. Whitbrook, U. Aickelin, and J. Garibaldi. Two-Timescale Learning Using Idiotypic Behaviour Mediation for A Navigating Mobile Robot. Journal of Applied Soft Computing, 10, 2010. [165] M. Williamson. Biologically inspired approaches to computer security. Technical report, HP Laboratories Bristol, 2002. [166] S. X. Wu and W. Banzhaf. The use of computational intelligence in intrusion detection systems: a review. Applied Soft Computing, 10(1):1–35, 2010. [167] Q. Xu, W. Pei, L. Yang, and Q. Zhao. An intrusion detection approach based on

190

REFERENCES

understandable neural network trees. International Journal of Computer Science and Network Security, 6(11):229–234, 2006. [168] C. Zarges. Rigorous runtime analysis of inversely fitness proportional muation rates. In Proceedings of Parallel Problem Solving from Nature (PPSN), pages 112–122, 2008. [169] C. Zarges. On the utility of the population size for inversely fitness proportional mutation rates. In Proceedings of the 10th ACM SIGEVO Workshop on Fundations of Genetic Algorithms (FOGA), pages 39–46, 2009. [170] Z. H. Zhang and H. Shen. Application of online-training SVMs for real-time intrusion detection with different considersations. Computer Communications, 28(12):1428–1442, 2005. [171] C. Zhou and M. R. Hansen. A calculus of durations. Information Processing Letters, 40(5):269–276, 1991. [172] C. Zhou and M. R. Hansen. Duration Calculus: A Formal Approach to Real-Time Systems. Springer-Verlag, 2004.

191

Appendix A

Additional Figures of Chapter 5

192

3

2

1

0

Classification Count


3

2

1

0

6000

●

●● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

MCAV

Segment Index

4000

● ●

2000

8000

10000

12000

0

2000

6000

Segment Index

4000

●

8000

10000

12000

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0

Percentage Percentage

1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0

Kα

0

0

2

2 Classification Count

1

MCAV


1

Kα

3

3

A. Additional Figures of Chapter 5

Figure A.1: Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the antigen based segmentation (z = 103 ), where both versions of anomaly metrics (Kα and M CAV ) are used.

193

3

2

1

0



3

2

1

0

● ● ● ● ● ● ● ● ● ● ●

200

600

●● ● ● ● ●

MCAV

Segment Index

400

800

1000

1200

0

200

600

Segment Index

400

●

800

1000

1200

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0


1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0

Kα

0

0

2


1

MCAV


1

Kα

3

3



194


3 2 0

1

MCAV


0

1

Kα

2

3


0.0

0.2

0.4

0.6

0.8

1.0

0.0

●●●

120 100 Segment Index

●

MCAV

Segment Index

80

100 80 60

60

Percentage 40

40

1.0

20

20 0

0.8

Kα

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

1

0.6

0

●●●●●●●●●●●●

0

0.4

120

Percentage

0.2

2


3

0

1

2

3



195


3 2 0

1

MCAV


0

1

Kα

2

3


0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.4

0.6

0.8

1.0

Percentage

12 10 6

●

2 0

0

●

●

2

●

●

4

●

Segment Index

8

● ●

MCAV

Segment Index

4

6

Kα

8

●

●

10

●

●

12

●

Percentage

0.2

0

1

2


3

0

1

2

3



196

3

2

1

0



3

2

1

0

● ● ● ● ● ●● ● ●

50

100

200

● ● ●

MCAV

Segment Index

150

250

300

350

0

50

100

200

Segment Index

150

250

300

350

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0


1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0

Kα

0

0

2


1

MCAV


1

Kα

3

3


Figure A.5: Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the time based segmentation (z = 10), where both versions of anomaly metrics (Kα and M CAV ) are used.

197


3 2 0

1

MCAV


0

1

Kα

2

3


0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.4

0.6

0.8

1.0

Percentage

35 30 25 20 15

● ● ●

10 5 0

0

●

●

●

●

●

5

●

●

●

●

●

10

●

●

Segment Index

● ● ● ●

●

●

●

●

MCAV

Segment Index

15

Kα

20

●

●

25

●

●

●

●

●

30

●

●

●

●

●

35

●

Percentage

0.2

0

1

2


3

0

1

2

3


Figure A.6: Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the time based segmentation (z = 102 ), where both versions of anomaly metrics (Kα and M CAV ) are used.

198


3 2 0

1

MCAV


0

1

Kα

2

3


0.0

0.2

0.4

0.6

0.8

1.0

0.0

●

MCAV

●

3 Segment Index

Segment Index

Percentage

●

4

1.0

2

3

0.8

Kα

2

0.6

0

0

●

1

1

0.4

4

Percentage

0.2

0

1

2


3

0

1

2

3


Figure A.7: Plots of the number of correct classifications against the segment index and corresponding histograms of the percentages of each unique value’s frequency in the time based segmentation (z = 103 ), where both versions of anomaly metrics (Kα and M CAV ) are used.

199

Appendix B

Additional Tables of Chapter 6

200

TPR 0.953 0.931 0.949 0.992 0.685 0.897 0.995 0.940 0.617 0.991

ACC 0.843 0.964 0.975 0.996 0.843 0.949 0.998 0.970 0.809 0.773

TPR 0.971 0.966 0.977 0.999 1.000 1.000 0.997 0.994 1.000 0.993

COR FPR ACC 0.526 0.723 0.449 0.759 0.441 0.768 0.314 0.843 0.000 1.000 0.000 1.000 0.459 0.769 0.278 0.858 0.360 0.820 0.665 0.664 TPR 0.866 0.595 0.554 0.994 1.000 1.000 0.995 0.598 0.971 0.619

IFG FPR 0.003 0.001 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.002 ACC 0.932 0.797 0.776 0.997 1.000 1.000 0.998 0.799 0.986 0.809

TPR 0.879 0.916 0.935 0.989 0.983 0.987 0.995 0.957 0.999 0.312

PCA FPR 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.194

ACC 0.939 0.958 0.968 0.995 0.992 0.994 0.998 0.979 1.000 0.559

Table B.1: Results of the DCA on the 10% KDD dataset, including True Positive Rate (TPR), False Positive Rate (FPR) and Detection Accuracy (ACC) for Manual (MAN), Information Gain based (IFG), PCA based (PCA) and Correlation based (COR) methods.

01 02 03 04 05 06 07 08 09 10

MAN FPR 0.267 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.445

B. Additional Tables of Chapter 6

201


01 02 03 04 05 06 07 08 09 10

TPR 0.973 0.973 0.980 1.000 1.000 1.000 0.990 0.990 0.964 0.993

KNN FPR ACC 0.003 0.985 0.003 0.985 0.008 0.986 0.010 0.995 0.000 1.000 0.000 1.000 0.003 0.994 0.005 0.993 0.003 0.981 0.007 0.993

TPR 0.974 0.967 0.960 0.999 1.000 1.000 0.995 0.941 1.000 0.988

DTree FPR 0.009 0.006 0.003 0.005 0.000 0.000 0.005 0.005 0.000 0.005

ACC 0.983 0.981 0.979 0.997 1.000 1.000 0.995 0.968 1.000 0.992

TPR 0.959 0.962 0.951 0.996 1.000 1.000 0.995 0.939 1.000 0.979

SVM FPR ACC 0.010 0.975 0.016 0.973 0.019 0.966 0.012 0.992 0.000 1.000 0.000 1.000 0.021 0.987 0.005 0.967 0.003 0.999 0.006 0.987

Table B.2: Results of the machine learning techniques on the 10% KDD dataset, including True Positive Rate (TPR), False Positive Rate (FPR) and Detection Accuracy (ACC) for K-Nearest Neighbour (KNN), Decision Trees (DTree), and Support Vector Machines (SVM) algorithms.

202

MAN FPR 0.061 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.123 ACC 0.960 0.982 0.988 0.894 0.842 0.952 0.998 0.970 0.688 0.919

TPR 0.919 0.931 0.759 0.992 0.691 0.898 0.995 0.598 0.811 0.936

COR FPR ACC 0.002 0.959 0.016 0.958 0.008 0.876 0.001 0.996 0.000 0.846 0.000 0.949 0.000 0.998 0.000 0.799 0.000 0.844 0.002 0.959 TPR 0.958 0.664 0.671 0.998 1.000 1.000 0.996 0.629 0.603 0.663

IFG FPR 0.003 0.003 0.007 0.000 0.000 0.000 0.000 0.000 0.000 0.002 ACC 0.978 0.831 0.832 0.999 1.000 1.000 0.998 0.815 0.802 0.831

TPR 0.971 0.949 0.955 0.998 0.983 0.987 0.996 0.947 0.999 0.309

PCA FPR 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.197

ACC 0.985 0.975 0.978 0.999 0.992 0.994 0.998 0.974 1.000 0.556

Table B.3: Results of the DCA on the whole KDD dataset, including True Positive Rate (TPR) and False Positive Rate (FPR) for Manual (MAN), Information Gain based (IFG), PCA based (PCA) and Correlation based (COR) methods.

01 02 03 04 05 06 07 08 09 10

TPR 0.981 0.964 0.976 0.787 0.684 0.904 0.996 0.939 0.621 0.994


203


01 02 03 04 05 06 07 08 09 10

TPR 0.981 0.991 0.991 1.000 1.000 1.000 1.000 0.995 0.653 0.994

KNN FPR ACC 0.002 0.990 0.003 0.994 0.004 0.994 0.000 1.000 0.000 1.000 0.000 1.000 0.005 0.998 0.002 0.997 0.001 0.826 0.006 0.994

TPR 0.972 0.966 0.984 1.000 1.000 1.000 0.996 0.949 1.000 0.989

DTree FPR 0.007 0.004 0.002 0.000 0.000 0.000 0.002 0.001 0.001 0.006

ACC 0.983 0.981 0.991 1.000 1.000 1.000 0.997 0.974 1.000 0.992

TPR 0.960 0.962 0.977 1.000 1.000 1.000 0.999 0.998 0.998 0.987

SVM FPR ACC 0.074 0.943 0.014 0.974 0.020 0.979 0.000 1.000 0.000 1.000 0.000 1.000 0.003 0.998 0.001 0.999 0.000 0.999 0.000 0.994

Table B.4: Results of the machine learning techniques on the whole KDD dataset, including True Positive Rate (TPR), False Positive Rate (FPR) and Detection Accuracy (ACC) for K-Nearest Neighbour (KNN), Decision Trees (DTree), and Support Vector Machines (SVM) algorithms.

204

Appendix C

Samples of the SYN Scan Dataset

205

C. Samples of the SYN Scan Dataset

Figure C.1: A sample of the antigen instances in the SYN scan dataset.

206

C. Samples of the SYN Scan Dataset

Figure C.2: A sample of the signal instances in the SYN scan dataset.

207