Model-Based Subgroup Discovery

Model-Based Subgroup Discovery

Hao Song

A dissertation submitted to the University of Bristol in accordance with the requirements for award of the degree of Doctor of Philosophy in the Faculty of Engineering.

Department of Computer Science University of Bristol September 2017

Abstract This thesis investigates and extends the frameworks of Subgroup Discovery and Exceptional Model Mining for knowledge discovery in databases, where the task is to locate interesting subsets of the given data according to some specific statistical deviations. While subgroup discovery traditionally provides approaches to capture statistical information from a univariate target variable, exceptional model mining supports multivariate targets as well as a wider range of options for the statistical model. However, measures in both subgroup discovery and exceptional model mining are defined parametrically, which means that the measures are model-specific and hence limits the capability of the framework to adopt different statistical models and datasets. In this thesis we propose to use generic measures to be able to apply subgroup discovery on a wider variety of datasets based on richer kinds of models. Instead of defining parameter-driven measures, we design measures through an evaluation-driven approach, which means that the obtained measures can be generalised to any model class that can be evaluated in the same way. We define such approaches with generic, evaluation-driven measures as Model-Based Subgroup Discovery. Specifically, we adopt Proper Scoring Rules to derive measures to evaluate subgroups with any probabilistic model. We demonstrate that, with suitable model assumptions, the proposed measures can locate subgroups with better summaries compared to existing measures. The generality and computational efficiency of these measures are also investigated and demonstrated with applications on datasets from smart homes. With Hidden Markov Models, we further illustrate the capability of our approach on sequential datasets. We also propose measures that are formed with non-probabilistic regression models together with applications on model reuse, in the context of predictive machine learning.

i

ii

Dedication

To ......

iii

iv

Acknowledgement Here I would like to show my greatest appreciations to the people who have guided me and accompanied me through the three and half years of my PhD study. My supervisor: Peter Flach. My industrial supervisor: Georgios Kalogridis. My PhD progress reviewer: Oliver Ray. The director and work-package leaders of the SPHERE project: Ian Craddock, Peter Flach, Rachael Gooberman-Hill, Majid Mirmehdi, George Oikonomou, Robert Piechocki, Bernard Stark. My colleagues and friends in both the Machine Learning group and the SPHERE Project: Reem Al-Otaibi, Yu Chen, Tom Diethe, Meelis Kull, Louise Millard, Miquel Perelló Nieto, Kacper Sokol, Niall Twomey, Alison Burrows, Massimo Camplani, Dima Damen, Xenofon Fafoutis, Sion Hannuna, Ben Miller, Emma Pritchard, Bo Tan, Lili Tao, Pete Wolonosiki, Yangdi Xu, Ni Zhu. My colleague and friends in Toshiba Research Europe, Telecommunication Laboratory: Zhong Fan, Aftab Khan, Mahesh Sooriyabandara. And all my friends and my family. My acknowledgement to Toshiba Research Europe, Telecommunication Laboratory for funding my research and the Engineering and Physical Sciences Research Council (EPSRC) for funding the SPHERE Project.

v

vi

Declaration I declare that the work in this dissertation was carried out in accordance with the requirements of the University’s Regulations and Code of Practice for Research Degree Programmes and that it has not been submitted for any other academic award. Except where indicated by specific reference in the text, the work is the candidate’s own work. Work done in collaboration with, or with the assistance of, others, is indicated as such. Any views expressed in the dissertation are those of the author.

vii

viii

Contents 1

2

Introduction

1

1.1

Subgroup: Modelling and Post-Modelling . . . . . . . . . . . . . . . . .

2

1.2

Discovery: With Richer Data and Richer Models . . . . . . . . . . . . .

4

1.3

The SPHERE Project . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Contributions and Achievements . . . . . . . . . . . . . . . . . . . . . .

6

1.5

Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Preliminaries and Related Work

9

2.1

Dataset and Sequential Dataset . . . . . . . . . . . . . . . . . . . . . . .

9

2.2

Subgroup Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2.1

Description Language . . . . . . . . . . . . . . . . . . . . . . .

10

2.2.2

Search Strategy and Search Algorithm . . . . . . . . . . . . . . .

13

2.2.3

Quality Measure: Definition and Properties . . . . . . . . . . . .

16

2.2.4

Quality Measure: Traditional Subgroup Discovery . . . . . . . .

22

2.2.5

Quality Measure: Exceptional Model Mining . . . . . . . . . . .

24

2.2.6

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.3

Proper Scoring Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.4

Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.4.1

Inference and Prediction . . . . . . . . . . . . . . . . . . . . . .

32

2.4.2

Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.5 3

Subgroup Discovery with Proper Scoring Rules

37

3.1

Statistical Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.2

Generalised Information Gain under Proper Scoring Rules . . . . . . . .

38

3.3

Generative Modelling for Improved Estimation . . . . . . . . . . . . . .

42

3.3.1

The Generative Model . . . . . . . . . . . . . . . . . . . . . . .

42

3.3.2

Improved Quality Measures . . . . . . . . . . . . . . . . . . . .

43

3.4

Isometric Analysis for Proposed Measures . . . . . . . . . . . . . . . . .

45

3.5

Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.5.1

50

Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

x

CONTENTS 3.5.2 3.6

4

53

4.1

Probabilistic Model Class and the Weighted Divergence . . . . . . . . . .

56

4.2

Weighted Relative Likelihood . . . . . . . . . . . . . . . . . . . . . . .

58

4.3

Evaluation-Driven and Parameter-Driven . . . . . . . . . . . . . . . . . .

60

4.4

Experiment with Synthetic Dataset . . . . . . . . . . . . . . . . . . . . .

61

4.5

Experiment with the SPHERE Challenge Dataset . . . . . . . . . . . . .

65

4.6

Experiment with the Casas Dataset . . . . . . . . . . . . . . . . . . . . .

70

4.6.1

Data Set and Experiment Settings . . . . . . . . . . . . . . . . .

70

4.6.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

HMM-Based Subgroup Discovery for Sequential Datasets

79

5.1

Model-Based Subgroup Discovery with Sequential Data . . . . . . . . .

80

5.2

HMM-Based Subgroup Discovery . . . . . . . . . . . . . . . . . . . . .

82

5.3

Experiment on Synthetic Datasets . . . . . . . . . . . . . . . . . . . . .

84

5.4

Experiments on the SPHERE Challenge Dataset . . . . . . . . . . . . . .

87

5.4.1

Intra-Sequence Discovery . . . . . . . . . . . . . . . . . . . . .

88

5.4.2

Inter-Sequence Discovery . . . . . . . . . . . . . . . . . . . . .

91

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

Applications for Predictive Machine Learning: Model Reuse with Subgroup Discovery 99 6.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2

Quality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.3

Description Language and Search Strategy . . . . . . . . . . . . . . . . . 102

6.4

Model Reuse Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.5

6.6 7

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

5.5 6

52


4.7 5

UCI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.4.1

Baseline Method 1 . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4.2

Baseline Method 2 . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4.3

MBSD-reuse method . . . . . . . . . . . . . . . . . . . . . . . . 103

Experiments with the MoReBikeS data . . . . . . . . . . . . . . . . . . . 104 6.5.1

Model-Based Subgroup Discovery with Single Bike Station . . . 104

6.5.2

Model Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Conclusions and Future Work

117

7.1

Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

CONTENTS

xi

Bibliography

121

List of Figures

131

List of Tables

135

List of Acronyms

137

xii

CONTENTS

Chapter 1 Introduction Being able to handle a large variety of information is one of the major technological achievements through the recent years [1]. While during the early time of history, information was stored merely as rough figures on the stone, its format has been kept evolving together with our civilisation. From characters to words, from pictures to videos, a huge amount of information has been invented and produced. With all these forms of information, we can conceptualise the world around us, and eventually turn them into our knowledge. As a result, each kind of information always comes with a formalisation together with ways of interpreting them. People have been spending their life to be trained to handle just a few formalisations of information. Such limitation on processing information has not been changed for a long time until now, as we come to the time to deal all kinds of information with a single formalisation, digital data. While early forms of information can be directly generated and interpreted by people, digital data comes from much richer sources and is therefore far harder to understand. For instance, early pieces of characters and words can be primarily written by a single person, and mostly describe a few things that were happening around the author. However, in the modern world, most characters and words are generated automatically from electrical devices under certain tasks, such as making bank transfers through a computer, obtaining locations from Global Positioning System (GPS) on a mobile phone, communicating via social networks with a tablet. All these new sources are feeding data into our life without a pause. As people do not generate these data from sources directly, to interpret them hence becomes a difficult task, even while these data may contain useful information behind them. With the help of modern computers, we can hand over the tasks of analysing data to the machines. The fields of Machine Learning [2, 3, 4] and Data Mining [5, 6, 7, 8] are formed to study methods and approaches to allow the computers to understand data and perform further tasks. One way to interpret machine learning is that, it investigates the approaches to apply a model to some data so that the model can adapt itself to the data under a given task. On the other hand, data mining focuses on the tasks of obtaining 1

valuable information out of the data, which also usually requires fitting models to the given data, as in the case in machine learning. One perspective to consider the relationship between machine learning and data mining is to address the differences in the context of tasks. While machine learning aims to learn a model from the data, data mining involves further usage of the obtained models for knowledge discovery. Therefore, models become the shared concept between the two areas. As a result, researchers commonly categorises a machine learning technique is according to its way of model training. Supervised Learning and Unsupervised Learning characterise a model by checking whether the training of the model requires a given target variable. Classification and Regression define a supervised learning approach based on whether the target variable is discrete or continuous. On the other hand, the category of a data mining technique depends on the types of the obtained knowledge, which links to the source of the data. Therefore, it is common to see terms like rule mining, graph mining [9], and sequential pattern mining in the field of data mining, instead of seeing exclusive categories of tasks. However, even researchers have invented and proposed many models to understand the data better, and it is still a common phenomenon that a model might not be able to explain the observed data fully. That is, there is always a part of the data that doesn’t fit the model very well. One way to understand these unfitted data is to treat them as noise and disregard them as meaningless information. Alternatively, one can also look for valuable information behind these unfitted data. For example, consider a smart home with a set of sensors that monitor the behaviour of members of the household, under a general health-care purpose. It would be a typical situation that the sensors observe some information that is different from the previously fitted model. This piece of atypical data can be either caused by a simple schedule change or resulting from a potential behaviour change under some health conditions. Therefore, it is non-trivial to come with approaches that can discover such valuable pieces of data. In this thesis, we investigate and extend one of these approaches, known as Subgroup Discovery. The task of subgroup discovery is to locate interesting subsets of some given data according to a predefined statistical deviation [10, 11, 12, 13]. Therefore, subgroup discovery directly focuses on the task of obtaining knowledge from the unfitted data.

1.1

Subgroup: Modelling and Post-Modelling

As introduced above, subgroup discovery is proposed as an approach to discover interesting subsets within a dataset according to some given properties. In general, existing work formalises these properties via two aspects: statistically unusual and sufficiently large. Consider a synthetic toy data set relating to someone’s dietary habits. It contains two (discretised) features: the time of the day, denoted as X1 ∈ {Morning, Afternoon, 2

Figure 1.1: An example bivariate data set with two subgroups (shaded areas) defined on the discretised features, both capturing an area of statistical deviation in comparison to the overall population. The subgroup on the left is preferred by a commonly used evaluation measure (WRAcc) while the right subgroup is preferred by the one of the measures we propose in this paper. Evening} and the calorie consumption in the diet, denoted as X2 ∈ {Low, Medium, High}. The target variable is Y ∈ {Weekday, Weekend}. Figure 1.1 visualises the data, with two potentially interesting subgroups (shaded areas). The subgroup on the right concentrates on the area of maximum statistical deviation (high calorie intake in the evening is more common during weekend), while the one on the left covers both medium and high calorie intake in the evening. Concerning the aspect of size, small subgroups are less attractive since they might not be statistically significant. That is, the statistical information obtained from small subgroups might be less accurate comparing the one estimated from large subgroups. If we put the context into the application side, the benefits from larger subgroups can also be considerably higher. Assuming the task is to detect subgroups of people that are more likely to buy a product from some online shopping websites. A smaller amount of people that are willing to purchase the product is naturally less attractive given their limited amount of consumption to the product. We have partly motivated the second aspect in the previous section. A model fitted to the overall data always brings some data points that don’t fit very well. It is, therefore, a valuable task to look for other information from these statistically unusual subgroups. If we use the example of shopping records again, the overall model here would be the probability of buying the given product among all the people visiting the website, subgroups of people that either has a higher or has a lower probability of purchasing the product would be interesting to the website. It this thesis, subgroup discovery is defined as a procedure with two steps: modelling and post-modelling. As the essential task is to find the statistically unusual subgroups, it is required to first obtain a global statistic on the overall dataset. By doing so, we are assuming the data set by assigning a model class to fit it. As traditional databases 3

are organised as a combination of attributes, early work in the field typically formalised the statistical model as the marginal class distribution of an univariate attribute, which is known as target attribute or target variable. After the modelling step, a fitted model is obtained, which contains our model assumption together with the fitted statistical information. The task of subgroup discovery now comes to the post-modelling step, which is to use the trained model to evaluate different candidate subgroups, so that the most interesting ones can be selected. As a result, some measure is required to quantify the statistical deviation between a candidate subgroup and the overall dataset. As indicated above, early work defines this via comparing the class distribution of the target variable, the subgroups that are both large and have a different class distribution are evaluated as interesting.

1.2

Discovery: With Richer Data and Richer Models

While the concept of class distribution is easy to interpret, it is also limited in the sense that it only carries a single form of statistical information. Let us go back to the previous example of a smart home, where multiple sensors of interests have been deployed. Assuming we are going to apply traditional subgroup discovery on the data, using different time intervals as the candidate descriptions, the direct approach is to treat each sensor as a target variable individually. Consequently, the obtained subgroups only represent the information regarding the particular sensor that was targeted. For instance, if the presence sensor in the kitchen was selected, the resulting subgroups might mean a specific usage pattern of the kitchen only. While these subgroups are quite simple to understand, they are limited as they ignore the information behind the dependency among different sensors. Like the case with sensors, such high dimensional data can be easily seen everywhere among recent data capturing systems, where the requirement of knowledge discovery further becomes data fusion — to merge different types of variables into a single form by applying proper statistical models. Several attempts have been made in the field of subgroup discovery recently. In [14], the authors proposed Exceptional Model Mining as an extension of subgroup discovery, where the framework can support multi-dimensional targets and richer statistical model classes. The main concept of exceptional model mining is to select a model class and then fit it to the overall dataset and candidate subgroups respectively. For instance, in the same paper [14], Leman et al. show that, if the Pearson correlation is selected as the model class, subgroups can be taken where the local correlation is different from the global correlation fitted to the overall dataset. The resulted measures allow finding subgroups with two-dimensional target variables. Later in [15], Duivesteijn et al. proposed a Bayesian network class for Exceptional Model Mining (EMM). The idea is then to find subgroups where the dependency among the target variables is different from the population. 4

As exceptional model mining is designed to find subgroups according to fixed statistical information under the given model class, it requires defining a different measure to each model class. As seen in [14, 15, 16, 17], most measures in exceptional model mining are defined via a parameter-driven manner. That is, the measure quantifies the quality of a subgroup by comparing the local parameter estimated within the subgroup, and the global parameter estimated from the overall dataset (or sometimes the parameter estimated from the complement of the subgroup). While this ensures that the obtained subgroups are interesting exactly according to the selected parameter, the generality of the approach is limited as such measures cannot be applied across other model classes with different sets of parameters. However, as illustrated previously, modern data systems require highly structured models to capture the information within them, where a larger number of parameters are commonly seen and differing from each other. Therefore, it is also reasonable to require a generic approach for find subgroups among a family of model classes. In this thesis, we investigate and develop generic approaches to apply subgroup discovery on a richer form of data based on a richer form of models. Instead of defining parameter-driven measures, we propose to design evaluation-driven measures, which can be generalised to any model class that can be evaluated via the same way. We define such approaches with evaluation-driven measures as Model-Based Subgroup Discovery. The main approach we proposed uses Proper Scoring Rules, a widely adopted evaluation method for probabilistic models, to construct the quality measures. We show that the constructed measures can provide subgroups with better summaries of the datasets, compared to measures in traditional subgroup discovery. The generality and computational efficiency are also investigated and demonstrated with different models with applications on datasets from smart homes. Given the generality of the proposed measures, they can be easily adopted to find subgroups on datasets where highly structured models are necessary, such as sequential datasets. We illustrate this claim with Hidden Markov Models [18, 19, 4], one of the most broadly applied model classes for sequential modelling. We further define the idea of semi-evaluation driven measures and propose measures that are formed with non-probabilistic regression models together with applications on model reuse, under the task of predictive machine learning.

1.3

The SPHERE Project

Most works within this thesis were performed under the SPHERE (a Sensor Platform for Health-care in a Residential Environment) project. As the name suggests, the SPHERE project aims to deliver a platform of smart sensors that captures and analyses healthrelated data [20]. The role of machine learning and data mining in the project is, therefore, to make use of the captured sensor data to provide information towards better healthcare. 5

The sensors involved in the project are categorised into three major modules: the wearable module, the environmental module, and the video module. The wearable module is designed to capture the movements of the members within a household directly. One of the conventional sensors in the wearable module is a tri-axial accelerometer mounted on a waist band. The environmental module, on the other hand, focuses on collecting indirect, activity-related information at certain locations within the house. The environmental module includes sensors like passive infrared (PIR) sensors, humidity sensors, and temperature sensors. Like the environmental module, the video module is also designed to capture information at fixed locations. Compared to the environmental module, the video module can provide richer information regarding the activities of the household, especially on emergency events like falling from stairs. However, as the video module relies on the deployment of cameras in the rooms, it is therefore restricted to a few locations excluding spaces like bathrooms and bedrooms, where the presence of video cameras might concern the users with their privacy. Up to the year of 2017, the SPHERE project has published a dataset associated with a discovery challenge held by the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) 2016. The dataset was collected and labelled on several participants performing daily activities according to a pre-designed script in a test house. The project is also in the process of deploying the sensor platform into some houses in Bristol, to prepare datasets for longterm living environments. As we illustrated in the previous example with a smart home, the data we are dealing with in SPHERE are high-dimensional and comes from a variety of sources with different formats. Besides, certain labels of the activities or health conditions of the households can also be expensive for long term daily life, and it is reasonable to ask for approaches that can detect abnormal behaviours with the sensor data only, instead of relying on the right labels. These two requirements become the major motivation of model-based subgroup discovery from the application side.

1.4

Contributions and Achievements

The work covered in this thesis has made a number of contributions to enlarge the framework of subgroup discovery, which can be summarised as follows: – This thesis formally defines parameter-driven measures and evaluation-driven measures in the field of subgroup discovery, which hasn’t been addressed by existing work. – This thesis investigates and develops generic, model-based measures upon probabilistic model classes, together with valuable applications on smart sensor data. 6

– The model-based approach is further applied to sequential subgroup mining, where we use the hidden Markov models as the base model class and illustrates the benefits compared non-sequential quality measure. – This thesis establishes the link between subgroup discovery and predictive machine learning, where we show that with certain measures the obtained subgroups can be used to construct better predictive models. In terms of publications, some work in this thesis has led to the following conference papers: – Hao Song, Peter Flach, and Georgios Kalogridis. Dataset shift detection with model-based subgroup discovery. 2nd International Workshop on Learning over Multiple Contexts (LMCE 2015), 2015. ([21]) – Hao Song and Peter Flach. Model reuse with subgroup discovery. In Proceedings of the 2015th International Conference on ECML PKDD Discovery Challenge Volume 1526, ECML PKDD DC’15, pages 5–18, Aachen, Germany, Germany, 2015.CEUR-WS.org. ([22]) – Hao Song, Meelis Kull, Peter Flach, and Georgios Kalogridis. Subgroup discovery with proper scoring rules. In Paolo Frasconi, Niels Landwehr, Giuseppe Manco, and Jilles Vreeken, editors, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II, pages 492–510, Cham, 2016. Springer International Publishing. ([23]) [21] covers the early idea of model-based subgroup discovery, which will be later introduced in Chapter 4. In this paper, we proposed to use the mean value of the global model’s log-likelihood to find subgroups with potential dataset shifts. This work provides an initial direction for the development of evaluation-driven measures in this thesis, as well as the Relative Likelihood measures studied in both Chapter 4 and Chapter 5. Both the theoretical work and experiments in this paper were performed by Hao Song, with discussion and supervision under Peter Flach and Georgios Kalogridis. Peter Flach further provided valuable suggestions for the final editing of this paper. [22] describes the work of applying subgroup discovery to reuse pre-trained models and generates the primary contents in Chapter 6. While the original paper mainly describes the approach as a solution to the Discovery Challenge of ECML-PKDD 2015, Chapter 6 will introduce this work with more emphasis on the opportunities to apply subgroup discovery for tasks in predictive machine learning. Both the theoretical work and experiments in this paper were performed by Hao Song, with discussion and supervision provided by Peter Flach. 7

[23] provides the research on adopting proper scoring rules in subgroup discovery, and gives the theoretical function for the later work on probabilistic model classes. The work in this paper will be covered in Chapter 3, with additional discussions on the relationship between information gain used in decision tree learning, as well as some further analysis on ROC isometrics. For this paper, the theoretical work on measures with proper scoring rules was jointly developed by Hao Song and Meelis Kull. The design and implementation of the experiments were performed by Hao Song, with suggestions from Peter Flach. This work was under the general supervision of Peter Flach and Georgios Kalogridis.

1.5

Thesis Structure

The rest of this thesis is organised as follows. Chapter 2 covers the preliminaries and related work. In Chapter 3, we start by proposing generic measures under the setting of traditional subgroup discovery. With derivation using the proper scoring rules, the proposed measures can provide subgroups that summarise the dataset statistically better than existing approaches. In Chapter 4, we investigate how to generalise the proposed measures into any probabilistic model class. With quality measure calculated according to the model-based likelihood, we discuss both the evaluation-driven option and the semi-evaluation-driven option, together with their interpretations and scenarios where the local training model is required or not. We compare the proposed measures with traditional subgroup discovery and exceptional model mining with synthetic data and further demonstrate it with two applications. Chapter 5 investigates the application of modelbased subgroup discovery on sequential datasets. By adopting Hidden Markov Model as the base model class, we illustrate the generality of model-based subgroup discovery on real life applications. Chapter 6 presents the work of applying subgroup discovery for model reuse, which provides better performance for predictive machine learning. Chapter 7 concludes this thesis and discuss potential future works.

8

Chapter 2 Preliminaries and Related Work In this chapter, we give the preliminaries and notations that will be later used in this thesis, together with a detailed discussion on some related work. We start by defining the general notations for a dataset in section 2.1. The framework of subgroup discovery and exceptional model mining is reviewed and discussed in section 2.2, together with the definition of parameter-driven, evaluation-driven, and semi-evaluation-driven quality measures. Section 2.3 introduces proper scoring rules, which is one of the most widely adopted families of functions used for the evaluation of probabilistic models. Finally, we give a general navigation to the Hidden Markov Model (HMM), a major sequential probabilistic model class in the field of structured modelling.

2.1

Dataset and Sequential Dataset

In this section, we introduce the basic notations for a dataset and a sequential dataset. We assume both the features (attributes) and target variables (labels) of an instance can be represented as vectors, denoted as x ∈ X and y ∈ Y, with X and Y representing the feature space and target space respectively. We use the notation x(m) to represent the m-th dimension of the feature vector (m ∈ {1, ..., M}), so that x = [x(1) , ..., x(M) ]. Similarly, y(k) is used to represent the k-th dimension of the target vector, with k ∈ {1, ..., K} and y = [y(1) , ..., y(K) ]. A dataset with N instances is denoted as a multiset IU = {(xi , yi )}N i=1 . We assume that a sequential dataset with N sequences can be represented via a multiset IU = {(Xi ,Yi )}N i=1 . For each Xi and Yi , they are defined as a sequence of T consecutive instances (features and labels), with a shared length of T , that is: [i]

[i]

[i]

[i]

[i]

[i]

Xi = [x1 , x2 , ..., xT ]

(2.1)

Yi = [y1 , y2 , ..., yT ] [i]

[i]

Again, we represent both xt and yt as vectors belonging respectively to X and Y. We [i](m) [i](m) use the notation xt (yt ) to refer the m-th feature (taget) value of the t-th instance, 9

within the i-th sequence. As a result, the corresponding space of X and Y can be denoted as XT and YT respectively. While it is a strong assumption that all sequences are of the same length, most sequential datasets can be preprocessed into this form. For instance, a set of documents might have different numbers of words at the beginning, and it can be processed into chunks with the same number of words simply by using a sliding window. The primary benefit from this assumption is that all the sequences can be assumed to be drawn from the same feature and target space, which is required by most models in the field of machine learning and data mining.

2.2

Subgroup Discovery

Subgroup Discovery is defined as the task of discovering statistically unusual subsets within a given dataset, with a sufficiently large size [24]. Following the survey paper by Herrera et al. [13], we also divide the framework of subgroup discovery into three modules: description language, quality measure and search strategy. We omit the discussion on the types of the target variable, as in this thesis we integrate the types of target variable into the statistical model class, and the corresponding quality measure applied in the subgroup discovery process. Mathematically, the task of subgroup discovery is commonly defined as top Q subgroup discovery [25]: Definition 1. Given a dataset IU with feature x ∈ X and target variable y ∈ Y, a description language L and a quality measure φ , the task of subgroup discovery is to generate a list of Q subgroup descriptions {d1 , ..., dQ }, so that φ (d j ) ≥ φ (dk ) : j ≤ k ∀ j, k ∈ {1, ..., Q}. A subgroup discovery task can therefore be uniquely defined with the following 4 components: a dataset, a description language, a quality measure, and a search strategy. We cover each of them in detail for the rest of this section.

2.2.1

Description Language

Before entering the details of subgroup descriptions and the description language, we first give the notation of the characteristic function (indicator function) of a subgroup description, which maps the feature space into a binary space: Definition 2. Given a subgroup description d, its characteristic function is denoted as Id : X → {0, 1}. Id (xi ) = 1 implies that the ith instance is covered by this description, and Id (xi ) = 0 means the ith instance is not covered by the corresponding description. The covered instances of description d under dataset IU is denoted as Id = {(xi , yi ) : 10

Id (xi ) = 1∧(xi , yi ) ∈ IU }. The uncovered instances are denoted as I¬d = {(xi , yi ) : Id (xi ) = 0 ∧ (xi , yi ) ∈ IU }. A language is then defined as a description space that contains all possible descriptions, denoted as D. Following this notation, we define a language scheme as a function L, which maps a given feature space to a corresponding language (description space): Definition 3. A language scheme is defined as a function L, so that D = L(X), here X is the feature space of a given dataset, and D is the corresponding language (description space) for the subgroup discovery task. The interpretation is, with a provided feature space, the language scheme specifies the way of constructing all possible descriptions. For instance, if we assume the language scheme as the conjunctive normal form as in standard logical expressions, the description space is then generated as all the possible conjunctions of feature-value pairs under the given feature space. The language, or description space, is essentially the hypothesis space in the field of generalised concept learning, which includes areas like rule-based learning and tree-based learning, as shown by Flach in [3]. In general, as the characteristic function of a given description behaves in the same way as a binary classier, a language scheme can, therefore, be adopted by any existing binary model class. In most existing literature, the language scheme is set as logical rules as in the example above, since such rules are easier to be interpreted by most users comparing to other non-parametric binary models, like Support Vector Machines [26], Gaussian Processes [27], and other generalised kernel-based approaches [28], as well as Deep Neural Networks [29, 30, 31, 32]. This advantage on the interpretability is because logical rules are combinations of feature-values, it gives the users a readable information about what kinds of instances are covered in the subgroup (the extension of the description). For instance, assuming we have some sales records from a shopping website, price ≥ 500GBP can be directly interpreted as the description covers all the products that are no cheaper than 500GBP. On the other hand, for the same extension, an SVM might provide a description like w · price + t ≥ 0 for some parameter w and t. In general, such a description is more difficult for typical users to read (although for this case w = 1 can do the trick, in practice these parameters can result in any value). While in this example we have only one feature, the differences can be more distinct once there are multiple features, where models like SVM will have parameters in different scales, but a logical expression is still on the scale of original features. On the other hand, these non-logical models have their specialities. For instance, deep neural networks are good at learning hierarchical representations over the feature space, and Gaussian processes can output predictions with associated uncertainties. However, as the primary purpose of a subgroup discovery is to obtain interpretable knowledge from data, we still use logical rules as the language scheme in this thesis. 11

Now we discuss how to formalise the description language by using the logical rules as the language scheme, which requires the following to be specified: – the space of each feature (attribute) – the operation between the feature and its values – the logical connectives among feature-value pairs Given that there is a variety of choices to specify each of them, here we only give a discussion on the standard options, as well as the ones that are adopted in this thesis. To specify the space of each feature (attribute) is non-trivial as it is closely related to the interpretation of the corresponding feature, as well as the available operations on the feature and the complexity of the description space. So far, we use the notation X to represent the (assumed known) overall feature space. However, in practice, the feature space is constructed by specifying the space of each individual feature, so that X = X(1) × X(2) × ... × X(M) , x(m) stands for the space of the m-th feature. In general, the construction can be organised into 2 scenarios, depending on whether the corresponding feature is discrete (binary, categorical, ordinal) or continuous. To specify the space for discrete feature is straightforward, the space X(m) = {v1 , ..., vL } simply contains all the possible values of x(m) , with L being the number of possible values of the m-th feature. On the other hand, to obtain the space of a continuous feature might require further preprocessing, so that the resulted space has a finite and tractable number of possible values. The typical pre-processing approach is to discretise the continuous values into a number of distinct edges, once the number of possible values of a continuous feature gets beyond a given threshold. There are two major options on the discretisation, one is to discretise the feature into edges with equal width bins, and the other is to discretise into edges with equal frequency bins. The choice between these two options depends on whether the feature tends to be uniformly distributed. Equal width bins should be adopted for more uniformly distributed features, and equal frequency bins should be used otherwise. In this thesis, we will use equal width bins by default. The operations on the attribute values are also linked to the types of the corresponding attribute. In general, the options available for binary and categorical features are = and 6=, and >, ≥, ωd2 , there is always Φ(d1 ) ≥ Φ(d2 ), and for ∀d1 , d2 : υd1 > υd2 ∧ ωd1 = ωd2 , there is always Φ(d1 ) ≥ Φ(d2 ). That is, a valid quality measure should meet the following requirement: while two subgroups have the same sizes, the more deviated one will never be quantified less interesting, and while two subgroups have the same statistical deviations, the larger one will never be quantified less interesting.

2.2.4

Quality Measure: Traditional Subgroup Discovery

In this section, we review a few widely adopted measures from traditional subgroup discovery. As previously discussed, for early approaches in the field of subgroup discovery, the statistical model class is generally formalised as the class distribution of a discrete uni-variate target variable. This setting is partly due to the relationship between subgroup discovery and predictive rule induction. Since we sometimes only deal with a class from a discrete target variable, to keep the clarity of the notation, we represent a discrete target as a bit vector in this thesis. That is, if the target is binary (e.g. yraw ∈ {v1 , v2 }), we will denote it as y = [y(1) , y(2) ]. y = [1, 0] hence implies yraw = v1 and y = [0, 1] implies yraw = v2 . Similarly, for a multi-class target yraw ∈ {v1 , ..., vK } we also represent it with K bits vector y = [y(1) , ..., y(K) ], with y(k) = 1 implies yraw = vk . We now can represent the class distribution of the overall dataset as a vector π = with π (k) being estimated as the mean value of y(k) :

[π (1) , ..., π (K) ],

π (k) = Θmean (IU ) (2.7)

(k)

∑N y = i=1 i N

(1)

(K)

(k)

The class distribution under a description d is denoted with ρd = [ρd , ..., ρd ], with ρd 22

being estimated as the mean value of y(k) : (k)

ρd = Θmean (Id ) (k)

∑N Id (xi ) · yi = i=1N ∑i=1 Id (xi )

(2.8)

One of the common statistical deviation measure based on the class distribution is the Relative Accuracy (RAcc) [40], which is defined as: (k)

(k)

ΩRAcc (Id , IU ) = ρd − π (k)

(2.9)

It can be seen that the RAcc is a class-wise deviation measure, as it is calculated only on the k-th bit of the target, which means the result differs from class to class. Especially (2) (1) it the binary setting, it yields ΩRAcc (Id , IU ) + ΩRAcc (Id , IU ) = 0, given that both π and ρd are probability vectors and sum to 1. The natural interpretation of the RAcc is hence the difference of the proportion of a particular class, between the candidate subgroup and the overall dataset. Multiplying RAcc with the relative size measure hence gives the quality measure Class-wise Weighted Relative Accuracy (WRAcc) [12], which is defined as: (k)

(k)

ΦW RAcc (d) = ϒrel (d) · ΩRAcc (Id , IU ) (k)

= υdrel · (ρd − π (k) ) =

y(k) ∑N ∑N Id (xi ) · y(k) ∑N i=1 Id (xi ) − i=1 · ( i=1N ) N N ∑i=1 Id (xi )

=

(k) y(k) ∑N Id (xi ) ∑N ∑N i=1 Id (xi ) · y − i=1 · i=1 N N N

(2.10)

One natural interpretation of WRAcc is that, if we switch the class distribution within the subgroup from the overall one to the local one, the number of instances gained in the (k) −π (k) ). WRAcc is then the same quantity target class, which is given by ∑N i=1 Id (xi )·(ρ but normalised by N. On the other hand, if we treat Id (xi ) as a binary random variable itself, the term ∑N i=1 Id (xi ) is equal to E[Id (xi )]. Similarly, if we treat y(k) as a random variable, the N N I (x )·y(k) N I (x ) N y(k) term ∑i=1 dN i is equal to E[Id (xi ) · y(k) ], and the term ∑i=1Nd i · ∑i=1N is equal to (k) E[Id (xi )] · E[y ]. This gives the alternative interpretation of WRAcc, the covariance between two Bernoulli variables: (k)

ΦW RAcc (d) = E[Id (xi ) · y(k) ] − E[Id (xi )] · E[y(k) ] = cov[Id (xi ), y(k) ]

(2.11)

Since the training algorithm Θmean (d) can be generalised to uni-variate continuous 23

target, WRAcc can also be generalised to measure the subgroups with continuous target variables, the resulted measure is known as Continuous Weighted Relative Accuracy (CWRAcc) [41]. An alternative here is discretise the continuous target and still apply measures designed for discrete targets [42]. Building upon RAcc, WRAcc is also computed concerning interestingness of the k − th class of the target. It is therefore reasonable to extend the measure so that it can evaluate the interestingness on all the K classes simultaneously. Multi-class Weighted Relative Accuracy (MWRAcc) and Weighted Multi-class Weighted Relative Accuracy (WMWRAcc) [43] is proposed as measures that can evaluate subgroups simultaneously on all the classes: ΦMW RAcc (d) =

K 1 (K) · ϒrel (d) · ∑ |ΩRAcc (Id , IU )| K k=1

K 1 (k) = · υdrel · ∑ |ρd − π (k) | K k=1 K

ΦW MW RAcc (d) =

(k)

(2.12)

(K)

∑ ϒMulti(d) · |ΩW RAcc(Id , IU )| k=1 K

= υdrel · ∑

(2.13) (k) (k) ρd · |ρd − π (k) |

k=1

Clearly, MWRAcc is the product between the relative size and the sum of the absolute value of class-wise RAcc, and normalised by K. WMWRAcc is to weight the absolute value of the RAcc with the class distribution before the sum, and then weight the sum with the relative size. The absolute value is applied here is due to the fact that (K) ∑K k=1 ΩRAcc (Id , IU ) = 0. While this makes MWRAcc lose the covariance interpretation, experiments have shown that the results subgroups can provide better features in decision tree learning [43].

2.2.5

Quality Measure: Exceptional Model Mining

As mentioned above, the statistical model classes used in traditional subgroup discovery are limited as computing the mean value of a uni-variate target. Exceptional Model Mining [14, 16] is hence proposed as an approach to extend subgroup discovery to support the modelling with enriched models and multivariate targets. For instance, if the statistical model class is set as the Pearson Correlation, the estimated correlation coefficient for a set of instances can be given as: (1)

(1)

(2)

(2)

∑yi ∈I (yi − y¯I ) · (yi − y¯I ) q ΘCorr (I) = q (1) (1) 2 (2) (2) ∑yi ∈I (yi − y¯I ) · ∑yi ∈I (yi − y¯I )2 24

(2.14)

(1)

(2)

Here y¯I and y¯I represent the sample mean of y(1) and y(2) within I respectively. One of the proposed statistical deviation measures in [16] in the correlation model class is hence: Ωcorr (Id , I¬d ) = |ρd − ρ¬d | (2.15) Here ρ¬d = Θcorr (I¬d ) is the correlation estimated from the complement of the description d. The corresponding quality measure, Entropy Weighted Absolute Correlation Difference (EMM-Corr), is given as: ΦEMM-Corr (d) = ϒent (d) · Ωcorr (Id , IU )

(2.16)

There are two topics to be discussed with respect to Φcorr . The first is that the statistical deviation is calculated between the candidate description and its complement, instead of between the candidate description and the overall dataset. In the original thesis of Duivesteijn [25], the author suggests that this choice should be made under the assumption of the corresponding statistical model, considering other aspects such as computation complexity. Later in Chapter 3, we also discuss the same question in the context of information gain. We demonstrate that, comparing to the population is preferable under the setting of Subgroup Discovery (SD), and comparing to the complement tends to be more suitable under the setting of predictive tasks like classification. The second point is about the choice of the size factor. Here it is formalised as the Shannon entropy of the relative size c and its complement. As suggested by the author, this is mainly proposed to solve the over-fitting issue with statistical deviation factor. Therefore, it doesn’t come with a natural interpretation. The choice of size factor is also further discussed in rest of this thesis. One of the recent development with exceptional model mining considers the problem of finding deviated transition patterns on sequential datasets [17]. Using our notations [i] [i] above, a single target value in their work can be represented as Yi = [y1 , y2 ], which repre[i] [i] sents a single transition from y1 to y2 . Assuming we still use one-hot vector to represent [i] [i] [i](a) [i] the K possible states of both y1 and y2 (e.g. y1 = 1 implies y1 = va , a ∈ {1, ..., K}), the statistical model class is set to be the transition probability form state a to state b: [i](a)

(a,b) Θtran (I) =

∑Yi ∈I y1

[i](b)

· y2

[i](a)

(2.17)

∑Yi ∈I y1

The statistical deviation measure Total Variation Distance (TV) is then given as: (a)

Ωtv (Id , IU ) =

K

∑ |Θ(a,b)(Id ) − Θ(a,b)(IU )| b=1

25

(2.18)

The quality measure Weighed Total Variation Distance (EMM-TV) is defined as: K

ΦEMM-TV (d) =

(a)

(a)

∑ ϒmulti(d) · Ωtv (Id , IU )

(2.19)

a=1 (a)

[i](a)

To be aware, here ϒmulti (d) = ∑Yi ∈I y1 is a special setting of the multi-class size measure defined previously. As the authors suggest, the measure EMM-TV is an extension of MWRAcc in a sequential setting.

2.2.6

Related Work

Historically, subgroup discovery is a supervised learning approach to descriptive models, which is extended from rule-based models [44, 45] and closely related to association rule mining proposed by Agrawal et al. in 1993 [46, 47]. In 1995, Inductive Query Language was introduced by Siebes [24], which aims to look for an interesting subset of the dataset. In 1996, the term subgroup discovery was first proposed by Klösgen [10], with the algorithm Explora, which can find subgroups based on a nominal target variable. The work is extended to support multi-relational databases by Wrobel in 1997 [11] and the resulted algorithm MiDOS is further investigated in 2001 to draw a link to inductive logic programming [48, 38]. For a general survey of the field, please refer to [13] by Herrera et al. and [49] by Atzmuller. As mentioned previously, SD comprises three main elements: description language, quality measure and search algorithm. Some of these elements are shared with general rule induction approaches, which includes areas like Emerging Pattern Mining [50], Contrast Set Mining [51] and Redescription Mining [52]. A detailed introduction and comparison with similar tasks can be found in [53]. Regarding the subgroup description language, most existing work defines it through logical operations on attribute values. Fuzzy rule induction has also been introduced in the field in [54]. In [55] the authors present an approach to constructing more informative descriptions on numeric and nominal attributes in linear time. The proposed algorithm can find the optimal interval for numeric attributes and optimal set of values for nominal attributes. The results show improvements in the quality of obtained subgroups comparing to traditional descriptions. Concerning the search algorithm, both exhaustive [56] and greedy search algorithms [36] have been widely adopted in existing implementations. In [57] the authors introduce a pruning method for descriptions on numerical attributes. Recent work in [34] presents a fast exhaustive search strategy for numerical target concepts. The authors propose and illustrate novel bounds on different types of quality measures. The exhaustive search can then be performed efficiently via additional pruning techniques. Regarding quality measures, recent work has focused on the extension of traditional 26

measures with improved statistical modelling. In [16, 14] EMM was introduced as a framework to support improved target concepts with different model classes. For example, if linear regression models are trained on the whole data set and different candidate subgroups, the quality of subgroups can be evaluated by comparing the regression coefficient between the global model and the local subgroup model. The structure of a Bayesian Network can also be used to find unusual dependency among multiple targets [15]. In [39] the authors extend the framework to support predictive statistical information. This extension further allows subgroups to be found where a scoring classifier’s performance deviates from its overall performance. In [17], the EMM framework is also used to find exceptional transition behaviour among sequential data. SD has been widely adopted for applications in fields like business and medicine. In [58, 59], the authors demonstrate how the framework can be used to detect risk patient groups. [60] presents some work on sonography. [61] gives some case studies including direct mailing campaign and advertising campaign in marketing. Other than a specific implementation of SD algorithms under a particular measure, there are also a few general applications in the field, like Cortana [62], Vikamine [63]. The general Python data mining library Orange [64] also contains a toolkit for SD.

2.3

Proper Scoring Rules

As in this thesis we investigate generic, evaluation-driven quality measures for the task of subgroup discovery, in this section we introduce the PSRs, together with a few common concepts in the field of model evaluation. PSRs refers to a family of functions that are widely adopted in machine learning and statistics to assess the goodness of probability estimates on discrete target [65, 66, 67, 68, 69]. Here we first give the definition of a (discrete) probability estimate: Definition 16. A probability estimate for a target variable with K classes is denoted as (k) = 1. s = [s(1) , ..., s(K) ]. Here s(k) ∈ [0, 1] and ∑K k=1 s A scoring rule is then defined as: (s)

(s)

Definition 17. A scoring rule is a function ψi = Ψ(s, yi ), here ψi is a non-negative value measuring the loss of a probability estimate s on an instance yi . Two of the most commonly adopted scoring rules are the Log Loss (LL) and Brier Score (BS), which are defined as: 0

ΨLL (s, yi ) = − log(s(k ) ) 27

(2.20)

K

ΨBS (s, yi ) =

(k) 2

∑ (s(k) − yi

)

(2.21)

k=1 (k0 )

Here k0 represent the class of yi so that yi = 1. As indicated by the definitions above, PSRs measures the loss of assigning a probability estimate on each instance. For every scoring rule Ψ there is a corresponding divergence measure ∆ which evaluates how much a probability estimate diverges from another. Formally, a divergence measure can be constructed as follows. Assuming we have two probability estimates p and q, and a random variable y that is drawn from q, the expected loss of assigning p and q to y can then be respectively given as Ey∼q [Ψ(p, y)] and Ey∼q [Ψ(q, y)]. Definition 18. A divergence is a function ∆ that evaluate the difference between two probability estimates p and q by comparing their expected loss through a scoring rule Ψ, which is given as ∆(p, q) = Ey∼q [Ψ(p, y)] − Ey∼q [Ψ(q, y)]. A divergence ∆ can be used to determining whether a scoring rule Φ is proper, that is: Definition 19. A scoring rule Φ is proper if ∆(p, q) = 0 implies p = q. The divergences corresponding to BS and LL are the squared error and KullbackLeibler (KL) divergence respectively. The derivations are given in the following lemma. Lemma 1. The divergence of Log-loss is equal to the Kullback-Leibler (KL) divergence. K

∆LL (p, q) = − ∑ q(k) · log k=1

p(k) q(k)

Proof. ∆LL (p, q) =Ey∼q [ΨLL (p, y)] − Ey∼q [ΨLL (q, y)] K

(k)

=− ∑ q

(k)

K

· log(p ) + ∑ q(k) · log(q(k) )

k=1 K

k=1

= − ∑ q(k) · log k=1

p(k) q(k)

Lemma 2. The divergence of Brier Score is equal to the squared error. K

∆BS (p, q) =

∑ (p(k) − q(k))2 k=1

28

Proof. ∆BS (p, q) =Ey∼q [ΨBS (p, y)] − Ey∼q [ΨBS (q, y)] K

= ∑ q(k) · [(1 − p(k) )2 + ∑ (p( j) )2 ] j6=k

k=1 K

− ∑ q(k) · [(1 − q(k) )2 + ∑ (q( j) )2 ] j6=k

k=1 K

K

= ∑ [q(k) − 2 · q(k) · p(k) + q(k) · ∑ (p( j) )2 ] j=1

k=1 K

K

k=1

j=1

− ∑ [q(k) − 2 · (q(k) )2 + q(k) · ∑ (q( j) )2 ] "

K

K

K

k=1

k=1

j=1

K

K

K

k=1

k=1

= 1 − 2 · ∑ q(k) · p(k) + ∑ q(k) · "

∑ (p( j))2

− 1 − 2 · ∑ (q(k) )2 + ∑ q(k) · "

K

= 1 − 2 · ∑ q(k) · p(k) + k=1

"

K

− 1 − 2 · ∑ (q ) + k=1

!#

∑ (q( j))2

j=1

!#

K

∑ (p( j))2

j=1 K

(k) 2

!#

∑ (q

!# ( j) 2

)

j=1

i K h = ∑ (p(k) )2 − 2 · p(k) · q(k) + (q(k) )2 k=1 K

= ∑ (p(k) − q(k) )2 k=1

The term Ey∼q [Ψ(q, y)] is also known as generalised entropy. If the corresponding scoring rule is proper and the true probability q of y is known, the generalised entropy is the minimal loss under the given scoring rule. The reason it is referred as generalised entropy is that, when LL is used as the scoring rule, Ey∼q [Ψ(q, y)] equals the well-known Shannon Entropy [70, 69]: K

Ey∼q [ΨLL (q, y)] = − ∑ q(k) · log(q(k) ) k=1

29

(2.22)

And when BS is switched as the scoring rule, it gives the Gini-index [70, 69]: K

Ey∼q [ΨBS (q, y)] =1 − ∑ (q(k) )2 k=1 K

(2.23)

= ∑ q(k) · (1 − q(k) ) k=1

The fundamental property of a scoring rule essentially comes from its convexity. For such convex functions, the properness of their divergence can then be proved through the well-known Jensen’s inequality [70]. However, there are also other interpretations for these entropy and divergence measures, like the Tsallis entropy [71] and f-divergence [70]. For instance, if now we use p(y) and q(y) to represent two continuous probability density function, f-divergence can be defined as:

Z

∆ f [p(y), q(y)] =

f Y

p0 (y) 0 q (y) q0 (y)

(2.24)

Here p0 (y) and q0 (y) represents the differentiation of p(y) and q(y) respectively. Denoting p0 (y) q0 (y) as r, other than KL divergence, other common choices of f are the Hellinger distance √ ( f (r) = ( r − 1)2 ) and Pearson χ 2 ( f (r) = (r − 1)2 ). While proper scoring rules are commonly used to evaluate models with discrete random variables, and f-divergence is mainly applied on continuous variables, the KL divergence among both are widely adopted in other areas like Variational Inference in Bayesian Learning [72, 2, 73, 4].

2.4

Hidden Markov Model

HMM [18, 19, 4] is one of the most widely adopted sequential model class, particularly within areas like natural language processing and computation finance. Given a sequence of T instances (Y, Z) = (y1:T , z1:T ), with Y represents the observations and Z represents the hidden states, a HMM decomposes its joint probability via a few steps. The first step is to simply adopt the chain rule to reformat the joint probability of the whole sequence into a product of conditional probabilities on each location within the sequence: P(y1:T , z1:T ) =P(z1 ) · P(y1 | z1 ) · P(z2 | y1 , z1 ) · P(y2 | y1 , z1:2 ) · ... · P(zT | y1:T −1 , z1:T −1 ) · P(yT | y1:T −1 , z1:T ) =P(z1 ) · P(y1 | z1 ) · P(z2 | z1 ) · P(y2 | y1 , z1:2 ) · ... · P(zT | zT −1 ) · P(yT | y1:T −1 , z1:T ) T

=P(z1 ) · P(y1 | z1 ) · ∏ P(zt | zt−1 ) · P(yt | zt ) t=2

30

(2.25)

The second step is the main reason that the approach is named as HMM, which is to apply the (commonly first order) Markov assumption that zt is conditional independent from all other random variables given zt−1 . The third step is to assume further that each observation yt is conditional independent from all other random variables given the hidden label zt at the same location. This results in a decomposition of the joint probability into a Markov chain on the hidden labels z1:T , with the observations y1:T generated conditional independently on each zt .

Such independence assumptions are commonly seen in machine learning to obtain computationally tractable models. For instance, Naive Bayes assume all the features are conditionally independent given the target variable. In the case of HMM, while the previous decomposition has reduced the number of parameters to linear on the sequence length instead of exponential, the number of parameters can still be intractable while the length of the sequences is large. On the other hand, the fitted parameters can also be hard to align when the training data are multiple sequences with different lengths. Therefore, in practice it is also common to assume both the transition probabilities P(zt | zt−1 ) and the emission probabilities P(yt | zt ) (including P(y1 | z1 )) are respectively drawn from fixed models disregarding of t: ( j)

[ j]

zt | zt−1 = 1

∼

Cat[µtransition ]

( j)

∼

f [θemission ]

yt | zt = 1

[ j]

(2.26)

Here j ∈ {1, ..., L} with L as the total number of hidden states (we represent z as a ran[ j] dom vector with one-hot encoding). Cat[µtransition ] denotes a categorical distribution with [ j] [ j] µtransition as the mean vector. On the other hand, f [θemission ] can be set as any probabilis[ j] tic model class with the corresponding parameter θemission . The common choices include categorical distribution (Discrete HMM) and Gaussian distribution (Gaussian HMM). As a result, the only conditional probability left undefined in the decomposition is the initial probability P(z1 ), which can be defined through another categorical distribution: z1

∼

Cat[µinitial ]

(2.27)

Therefore, the parameters of a common HMM can be denoted as a tuple: [ j]

[ j]

θHMM = hµinitial , (µtransition )Lj=1 , (θemission )Lj=1 i While all these quantities are model parameters, for convenience, we still refer them as initial probability, transition probabilities, emission model and emission probabilities in the following texts. 31

2.4.1

Inference and Prediction

In the context of Bayesian learning, inference refers the task of estimating the probability (distribution) of hidden variables and parameters under the assumed model class, given a number of observed data points. Therefore, to predict the hidden states (labels) given the observations (features) is a task of inference, assume we have known the parameters of a HMM. As the aim of this section is to give a general introduction to the mechanism of HMM, here we only cover the major inference techniques. For the details of different inference approaches, please refer [4]. To infer hidden states under a HMM is to estimate the following probability: P(zt | y1:T )

(2.28)

That is, we want to estimate the probability distribution of the hidden state zt at time point t, given the whole sequence of observations y1:T . The most widely adopted approach for this task is the Forward-Backward algorithm. The basic idea is to decompose the quantity above as following: P(zt , y1:T ) P(y1:T ) ) · P(zt | y1:t ) · P(y1:t ) P(yt+1:T | zt , y1:t = P(y1:T ) P(yt+1:T | zt ) · P(zt | y1:t ) · P(y1:t ) = P(y1:T )

P(zt | y1:T ) =

(2.29)

) is due to the Markov assumption that the observaHere the cancel out in P(y1:T | zt , y1:t tions yt+1:T after t are conditional independent from y1:t , given zt . While there are four quantities left in the final equation, only P(zt | y1:t ) and P(yt+1:T | zt ) need to be calculated. This is because P(y1:t ) and P(y1:T ) take the same value for all the possible values of zt . Therefore, P(zt | y1:T ) can be obtained by simply normalising the product between P(zt | y1:t ) and P(yt+1:T | zt ).

We first look at the calculation of P(zt | y1:t ). While to calculate it directly with any t is not straightforward, we can start from P(z1 | y1 ): P(z1 | y1 ) =

P(y1 | z1 ) · P(z1 ) (k)

(k)

∑Lk=1 P(y1 | z1 = 1) · P(z1 = 1)

(2.30)

Here both P(y1 | z1 ) and P(z1 ) are known, as previously defined with the emission and 32

initial probabilities. The next step is to look at P(z2 | y1:2 ): P(z2 , y1:2 ) P(y1:2 ) 1 P(y2 | z2 , y P(y ) 1 ) · P(z2 | y1 ) · = 1 ) P(y P(y2 | y1 ) ·

P(z2 | y1:2 ) =

(k)

=

(k)

( j)

( j)

∑Lj=1 P(y2 | z2 = 1) · P(z2 = 1 | y1 ) (k)

=

(2.31)

P(y2 | z2 ) · [∑Lk=1 P(z2 | z1 = 1) · P(z1 = 1 | y1 )] (k)

P(y2 | z2 ) · [∑Lk=1 P(z2 | z1 = 1) · P(z1 = 1 | y1 )] ( j)

( j)

(k)

(k)

∑Lj=1 P(y2 | z2 = 1) · [∑Lk=1 P(z2 = 1 | z1 = 1) · P(z1 = 1 | y1 )]

As shown by the decomposition above, now we can obtain P(z2 | y1:2 ) through P(y2 | z2 ) (emission probabilities), P(z2 | z1 ) (transition probabilities), and P(z1 | y1 ), which has been computed previously. Following the same decomposition, for any t, P(zt | y1:t ) can be calculated as: P(zt | y1:t ) = (k)

(k)

P(yt | zt ) · [∑Lk=1 P(zt | zt−1 = 1) · P(zt−1 = 1 | yt−1 )] ( j)

( j)

(k)

(2.32)

(k)

∑Lj=1 P(yt | zt = 1) · [∑Lk=1 P(zt = 1 | zt−1 = 1) · P(zt−1 = 1 | y1:t−1 )] As shown by the equations above, P(zt | y1:t ) can be computed by starting from P(z1 | y1 ) and move one step recursively to the end of the sequence. This procedure is hence known as the forward algorithm.

While the major aim of the forward algorithm is to compute P(zt | y1:t ), it can also be used to calculate the marginal likelihood of the observations: P(y1:T ) = P(y1 ) · P(y2 | y1 ) · P(y3 | y1:2 ) · ... · P(yT | y1:T −1 )

(2.33)

where we can compute each term as: P(yt+i | yt:t+i−1 ) = L

( j)

L

( j)

(k)

(k)

∑ P(yt+i | zt+i = 1) · [ ∑ P(zt+i = 1 | zt+i−1 = 1) · P(zt+i−1 | yt:t+i−1)]

j=1

(2.34)

k=1

Finally, the only step left before obtaining P(zt | y1:T ) is to calculate the conditional 33

probability P(yt+1:T | zt ). This can be done as follows: P(yt+1:T | zt ) L

=

k=1 L

=

(k)

(k)

∑ P(yt+1:T | zt+1 = 1) · P(zt+1 = 1 | zt ) (k)

(2.35)

(k)

(k)

∑ P(yt+2:T | zt+1 = 1) · P(yt+1 | zt+1 = 1) · P(zt+1 = 1 | zt ) k=1

As shown by the equation, P(yt+1:T | zt ) is now depending on P(yt+2:T | zt+1 ). Hence, we can start from P(yT | zT −1 ), which is: L

P(yT | zT −1 ) =

(k)

∑ P(yT | zT

(k)

= 1) · P(zT = 1 | zT −1 )

(2.36)

k=1

and compute recursively back to P(yt+1:T | zt ). This procedure is then known as the backward algorithm. While here we only cover the case of the forward-backward algorithm under HMM, it can also be applied to perform inference on other chain-like structured models like the Conditional Random Fields [74].

2.4.2

Training

One advantage of HMM is that the training can be done under both supervised and unsupervised setting. The supervised setting is straightforward. Assume we have a dataset {(Zi ,Yi )}N i=1 with N sequences. Since the hidden state z is known, all the related parameters in θHMM can be directly obtained through maximum likelihood estimation: [i]( j)

( j) µˆ initial

∑N z = i=1 1 N

(2.37) [i]( j)

[ j](k) µˆ transition

=

T −1 ∑N i=1 ∑t=1 zt

[ j] θêmission = argmaxθ

[i](k)

· zt+1

(2.38)

[i]( j)

T −1 ∑N i=1 ∑t=1 zt N

T

[i]( j)

∑ ∑ zt

[i]

· log f (yt ; θ )

(2.39)

i=1 t=1

Here we leave the estimation of the emission parameters to be abstract as the emission probabilities can be modelled with any probabilistic model class. The unsupervised setting, on the other hand, can also be trained through the Expectation Maximisation (EM) algorithm [75]. For HMM, the E-step is to fixed the old parameter θHMM and infer the distribution over the hidden state z for all the observations, which is done by using the Forward-Backward algorithm introduced in the previous sec34

tion. The M-step is then to estimate the updated parameter θˆHMM via a similar manner as in the supervised setting, instead the computation is operated on the distributions over z: [i]( j)

( j) µˆ initial

=

∑N i=1 P(z1

=

T −1 ∑N i=1 ∑t=1 P(zt

[ j] θêmission = argmaxθ

[i]

[i]( j)

T

[i](k)

[i]

= 1 | y1:T ) · P(zt+1 = 1 | y1:T )

T −1 ∑N i=1 ∑t=1 P(zt N

(2.40)

N [i]( j)

[ j](k) µˆ transition

[i]

= 1 | y1:T )

[i]( j)

∑ ∑ P(zt

[i]

(2.41)

= 1 | y1:T ) [i]

[i]

= 1 | y1:T ) · log( f (yt ); θ )

(2.42)

i=1 t=1

2.5

Chapter Summary

In this chapter, we provided a general description for the preliminaries required for reading the rest of this thesis. We first introduced the overall mathematical notations through the thesis. Then we presented the general concept of the area of subgroup discovery. This was done by introducing existing approaches concerning description language, search algorithm, and the quality measure. Regarding quality measure, we gave a general definition of traditional subgroup discovery and exceptional model mining respectively, together with a few existing measures. We formally defined the idea of parameter-driven, evaluation-driven, and semi-evaluation-driven quality measures. Some related work was also reviewed at the end of that section. For the following section, we introduced the area of PSRs, which refers to a family of functions that are commonly used for probabilistic evaluation. The PSRs are going to be the fundamental elements of our proposed measures in this thesis. Some related generalised evaluation methods and applications were also discussed in this section. Finally, we reviewed the hidden Markov models for sequential modelling, with a detailed description of its training and inference algorithms.

35

36

Chapter 3 Subgroup Discovery with Proper Scoring Rules In this chapter, we investigate generic, evaluation-driven quality measures in the context of traditional subgroup discovery. While many measures have been proposed in this context, they are all parameter-driven measures and therefore cannot be generalised to other model classes. For instance, Class-wise Weighted Relative Accuracy and Multi-class Weighted Relative Accuracy are designed for binary and multi-class targets respectively. As they are both extended from Relative Accuracy, these two measures quantify the statistical deviations of subgroups by computing the difference between two class distributions. However, by estimating a class distribution, either from the population or a candidate subgroup, it is essentially equivalent to summarising the class distribution as a statistical model of the corresponding data points. Therefore, it would be reasonable to ask for a measure that can evaluate subgroups directly on the statistical summary. Such measures can then be generalised to other target concepts as they rely on the evaluation of the statistical summaries, instead of the obtained parameters. Given this requirement, the proposed quality measures are formalised as follows. We first estimate a class distribution from the overall dataset, this gives the dataset a default statistical summary. After that, when a subgroup is selected, it adds a local summary to the default summary, we can then quantify the subgroup’s quality by evaluating the improvement of the alternative summary over the default summary. We then propose to use the proper scoring rules, a well-known family of functions for assessing the goodness of probabilistic models, to obtain theoretically well-founded quality measures for subgroup discovery. From this perspective, one subgroup is better than another if on average it has a lower divergence of target probability estimates from the actual labels. Further, we derive a novel method to directly optimise for the proposed evaluation measure, from first principles. The method is based on a generative probabilistic model, which allows us to prove the validity of the method formally. 37

3.1

Statistical Summary

We start with introducing the notation for a statistical summary. A statistical summary on (θ ) = a discrete target variable y over a dataset IU = {(xi , yi )}N i=1 is denoted a multi-set S (θ ) (θ ) = fθ (y) is the conditional probability mass distribution of y under a {si }N i=1 . Here si model with parameter θ . Therefore, if we estimate the class distribution π over the overall dataset (as in Equation 2.7), this gives the default summary to the overall dataset: (π)

(π)

= π}N i=1

SU = {si

(3.1)

Then for a subgroup with description d, we can estimate the local class distribution ρd (as in Equation 2.8), this gives the local summary within the subgroup: (ρd )

Sd

(ρd )

= {si

= ρd }i:Id (xi )=1

(3.2)

Adding the local summary to the default summary gives the alternative summary: (d,ρd ,π)

SU

(d,ρd ,π)

= {si

= Id (xi ) · ρd + [1 − Id (xi )] · π}N i=1

(3.3)

That is, if a instance is covered by the subgroup description d, the statistical model switches to the local class distribution ρd , and if it is not covered by d, the global class distribution π is kept as the global class distribution π.

3.2

Generalised Information Gain under Proper Scoring Rules

As introduced in the previous chapter, PSRs can provide a non-negative loss between an (θ ) instance and a probability estimate on it. Therefore, the loss by assigning a summary SU to the overall set of labels {yi }N i=1 is: (θ )

N

(θ )

Ψ0 (SU , y) = ∑ Ψ(si , yi )

(3.4)

i=1

(π)

With the default summary SU , the loss is given as: N

(π)

Ψ0 (SU , y) = ∑ Ψ(π, yi ) i=1

38

(3.5)

(d,ρd ,π)

And with the alternative summary SU (d,ρd ,π)

Ψ0 (SU

, the loss is calculated as:

N

, y) = ∑ Id (xi ) · Ψ(ρd , yi ) + [1 − Id (xi )] · Ψ(π, yi )

(3.6)

i=1

We can now define the quality of a subgroup description d as the gain in ψ 0 of the (π) (d,ρ ,π) summary SU d over the default summary SU , this gives the quality measure of Generalised Information Gain (IG): (π)

(d,ρd ,π)

ΦIG (d) = ψ 0 (SU , y) − ψ 0 (SU

, y)

(3.7)

(d,s,π)

with any other class In principle, instead of ρd , we could consider to summarise SU distribution s. However, the summary with ρd is unique among these, as it is maximising (π) the gain over the summary SU due to properness of the scoring rule. This claim is stated in the following theorem: Lemma 3. Let Ψ be a proper scoring rule and ∆ its respective divergence measure. If s, s0 are random vectors representing two sets of class probability estimates for a random variable t representing the actual class, then: Es,s0 ,t [Ψ(s,t) − Ψ(s0 ,t)] = Es,s0 ,t [∆(s,t) − ∆(s0 ,t)] = Es,s0 [∆(s, Et [t]) − ∆(s0 , Et [t])]

(3.8)

Proof. With t being the random variable of the actual class, we have: Es,t [Ψ(s,t)] = Es,t [∆(s,t)]

(3.9)

Following Lemma 1 from the supplementary of [69], we get the decomposition: Es,t [∆(s,t)] =Es,t [Ey∼t [Ψ(s, y)] − Ey∼t [Ψ(t, y)]] =Es,t [Ey∼t [Ψ(s, y)] − Ey∼t [Ψ(Et [t], y)]] + Es,t [Ey∼t [Ψ(Et [t], y)] − Ey∼t [Ψ(t, y)]] =Es [Ey∼Et [t] [Ψ(s, y)] − Ey∼Et [t] [Ψ(Et [t], y)]]

(3.10)

+ Et [Ey∼t [Ψ(Et [t], y)] − Ey∼t [Ψ(t, y)]] =Es [∆(s, Et [t])] + Et [∆(Et [t],t)] The analogous decomposition can be obtained with s0 . The second term is shared and hence when subtracting it cancels, yielding the required result. Theorem 1. Let Ψ, Ψ0 , ∆ be a proper scoring rule, its sum across the dataset with a summary, and its corresponding divergence measure, respectively. Then for any given 39

subgroup description d the following holds: (d,s,π) (π) , y) = ρd argmaxs Ψ0 (SU , y) − ψ 0 (SU

(3.11)

where ρd denotes the class distribution estimated within the subgroup description d. The maximum value achieved is ϒabs (d) · ∆(π, ρd ) where ϒabs (d) is the size of the subgroup with description d. Proof. For simplicity of notation, let’s assume that the training labels within description d are y1 , ..., yC (the labels of first C = ϒabs (d) instances). Consider a random variable t obtaining its value by uniformly choosing one yi that belongs to d among y1 , ..., yC . (π) (d,ρ ,π) (π) The summaries SU and SU d are equal for instances C + 1, ..., N, hence Ψ0 (SU , y) − (d,ρ ,π) Ψ0 (SU d , y) = C · Et [Ψ(π,t) − Ψ(ρd ,t)]. Using Lemma 1 this is in turn equal to C · Eπ [∆(π, Et [t]) − C · Eρd [∆(ρd , Et [t])]. However, since Et [t] = ρd then the second term is zero and the first is C · ∆(π, ρd ), which is the required result. This theorem implies that Eq.(3.7) can be rewritten as follows: ΦIG (d) = ϒabs (d) · ∆(π, ρd )

(3.12)

In words, this quality measure multiplies the size of the subgroup by the divergence of the overall class distribution from the distribution within the subgroup description1 . If we consider Log-loss as the proper scoring rule, then the corresponding generalised information gain measure is: ΦIG-LL (d) = ϒabs (d) · ∆KL (π, ρd ) K π (k) ∑N (k) i=1 Id (xi ) · − ∑ ρd · log (k) = N ρ k=1

!

(3.13)

d

(k)

where ∆KL is the KL-divergence, again ρd and π (k) are the proportion of the k-th class in the subgroup description d and in the population respectively. For Brier Score the corresponding measure is the quadratic error multiplied by the absolute size: ΦIG-BS (d) = ϒabs (d) · ∆BS (pi, ρd ) =

K ∑N (k) i=1 Id (xi ) · ∑ (π (k) − ρd )2 N k=1

(3.14)

These measures are referred as generalised information gain due to their relationships to other forms of information gain applied in the field of predictive machine learning. For example, in decision tree learning, information gain measures the decrease in impurity 1 In

general, divergence measures are not symmetric, so ∆(π, ρd ) is different from ∆(ρd , π).

40

when splitting a parent node into two children nodes. Under proper scoring rules, this information gain is equivalent to the improvement by using the alternative summary with the local class distributions both in the subgroup and its complement: Impurity(d) = ψ 0 (S(π) , y) − ψ 0 (S(d,ρ,τ) , y) = ϒabs (d) · ∆(π, ρd ) + (N − ϒabs (d)) · ∆(π, τ)

(3.15)

Here we use the notation S(d,ρ,τ) to denote the summary with si = ρ if Id (xi ) = 1, and si = τ if Id (xi ) = 0. τ is the class distribution estimated from the complement of the subgroup description d. The term information gain comes from the field of information theory, where Shannon entropy measures the amount of information carried by a signal in bits. Entropy hence is seen as a quantity that describes the unpredictability of a random variable. Therefore, as in decision tree learning, when we use additional features to create conditional random variables (e.g. create a split), the unpredictability of the random variable can be reduced. The reduced amount is hence known as information gain. Proper scoring rules are hence one way of generalising Shannon entropy into a family of entropy functions, which further gives a family of information gain measures. For instance, if we measure impurity by Shannon entropy this leads to Quinlan’s information gain splitting criterion [76]; and if we measure impurity by the Gini index, we obtain Gini-split [77]. Again, for the generalised information gain measure defined in Equation 3.12, we only care about the improvement by adding the subgroup into the summary, it results in a slightly different measure compared to the impurity measure as introduced above. However, as it can still be seen as computed the difference between the amount of information, we refer to it as generalised information gain. Another interesting aspect here is to consider the difference between the impurity measure and our generalised information under the context of designing a quality measure. Recall that we previously discussed whether a subgroup should be compared to its complement or the population. While by definition both measures here are comparing to the original summary for the population, it can be seen that the impurity measure is calculated according to both the local distribution within the subgroup (description) and its complement. As it turned out, such measures are commonly used to find a split to the dataset, instead of evaluating the gain by only adding the corresponding subgroup. Therefore, at least with our case under PSRs, it would make more sense to compare to the population without considering the complement of the subgroup. 41

3.3

Generative Modelling for Improved Estimation

As mentioned in the introduction, the general context of data mining is where one observes a set of data points that belong to a particular domain and the task is to extract information from the data. Such information can then be adopted to improve the performance of corresponding applications. Therefore, it is desirable that the subgroups as the representation of obtained knowledge would generalise to future data observed in the same domain. Two problems need addressing when generalising to future data. First, the class distribution ρd might be calculated on a small sample size and therefore can be a poor estimate of the actual distribution in the future. Second, it is not certain whether the actual distribution within the subgroup description is indeed different from the overall distribution π. To capture these aspects, we employ a generative model to simulate how instances with label y are produced at the first place, under the subgroup description d. We assume that the observed (training) instances under description d were generated according to the same model, which is defined as follows.

3.3.1

The Generative Model

First, we fix the default K-class distribution π. We then decide whether the distribution under a subgroup description d is different from the default (z = 1) or the same as default (z = 0), with some hidden variable z drawn from a Bernoulli distribution: z ∼ Ber(γ)

(3.16)

where Ber denotes the Bernoulli distribution, γ is the mean of the distribution and our prior belief that E[z] = 1. If z = 1 then we assume the class distribution q within the subgroup description d is generated according to a Dirichlet Prior: q ∼ Dir(β )

(3.17)

where Dir(β ) is the K-dimensional Dirichlet distribution with parameter vector β . Finally, we assume that all the instances are drawn from the subgroup description d are all Independent and Identically Distributed (i.i.d.). For simplicity of notation, let us again assume that the training instances within d are the first C instances y1 , . . . , yC . The distribution of y1 , . . . , ym and the test label variable y are as follows: y, y1 , . . . , yC ∼ Cat[z · q + (1 − z) · π]

(3.18)

where Cat is the categorical distribution with the given class probabilities. In the experiments reported later we used non-informative priors for z and q, with γ = 0.5 and 42

β = [1, ..., 1] respectively. While some literature defines a generative model as a model that captures the conditional distribution of the features given the target variable [78, 70], the model we described above is only built on the target variable. We still treat this model as a generative model as it can provide random samples for the target variable, which hasn’t been addressed in subgroup discovery according to the reviewed literature.

3.3.2

Improved Quality Measures

The above model can be used to generate instances within a subgroup description d. We will now exploit this model to derive two subgroup quality measures, the first one of which takes into account the uncertainty about the true class distribution in the subgroup, while the second one also models our uncertainty whether it is different from the default distribution. Therefore, we consider the task of choosing ρ which would maximise the expected gain in ψ 0 on the future instances. The following theorem solves this task, conditioning on the observed class distribution within the subgroup and on the assumption that this subgroup is different from background (z = 1). Theorem 2. Consider a subgroup as generated with the model above. Denote the random vector of the counts of each class in the training set of this subgroup by H and the observed counts by h = ∑Ci=1 yi . Then: argmaxs E[Ψ0 (π, y) − Ψ0 (s, y)|H = h, z = 1] =

h+β (k) (k) ∑K k=1 h + β

(3.19)

Denoting this quantity by ρˆ d , the achieved maximum is ∆(π, ρˆ d ), where ∆ is the divergence measure corresponding to Ψ. Proof. Consider a random variable t obtaining its value by uniformly choosing one yi that belongs to d among y1 , ..., yC . Then: E[Ψ0 (π, y) − Ψ0 (s, y)|H = h, z = 1] = E[Ψ(π,t) − Ψ(s,t)|H = h, z = 1]

(3.20)

Using Lemma 1 this in turn is equal to: ∆(π, E[t|H = h, z = 1]) − ∆(s, E[t|H = h, z = 1])

(3.21)

Since the first term does not depend on s this quantity is maximised by minimising the second divergence. As with any divergence, the minimal value is zero and it is obtained if the two terms are equal, i.e., s = ρˆ d = E[t|H = h, z = 1]. It remains to prove that E[t|H = h, z = 1] = K h+β . This holds because it is a posterior distribution under the h(k) +β (k) ∑k=1

Dirichlet prior Dir(β ) after observing h(1) , ..., h(k) of the classes 1, ..., K, respectively. 43

In the experiments we use β = (1, . . . , 1) and hence the gain is maximised when predicting the Laplace-corrected probabilities, i.e., adding 1 to all counts and then normalising. According to this theorem we propose a novel quality measure which takes into account the uncertainty about the class distribution: ΦIG ˆ (d) = ϒabs (d) · ∆(π, ρˆ d )

(3.22)

where ϒabs (d) is again the absolute size of the subgroup. Now consider the case where z = 1 is no longer assumed, this leads to the following theorem. Hence, it additionally takes into account the uncertainty about whether the distribution within the subgroup is different from the background. Theorem 3. Consider a subgroup as generated with the model above and denote H and h as above. Then: argmaxs E[Ψ0 (π, y) − Ψ0 (ρ, y)|H = h] = a ·

h+β (k) (k) ∑K k=1 h + β

+ (1 − a) · π

(3.23)

where a = P[z = 1|H = h]. Denote this quantity by ρˆˆ d . Then the achieved maximum value is ∆(π, ρˆˆ d ), where ∆ is the divergence measure corresponding to ψ. Proof. Consider a random variable t obtaining its value by uniformly choosing one yi that belongs to d among y1 , ..., yC . Then: E[Ψ0 (π, y) − Ψ0 (s, y)|H = h] = E[Ψ(π,t) − Ψ(s,t)|H = h]

(3.24)

Using Lemma 1 this is in turn equal to: ∆(π, E[t|H = h]) − ∆(s, E[t|H = h])

(3.25)

Since the first term does not depend on s this quantity is maximised by minimising the second divergence. As with any divergence, the minimum value is zero, and it is obtained if the two terms are equal, i.e., ρ = E[t|H = h]. It remains to prove that E[t|H = h] = a · ρˆ d + (1 − a) · π where ρˆ d is defined in Theorem 2: E[t|H = h] =P[z = 1|H = h] · E[t|H = h, z = 1] +P[z = 0|H = h] · E[t|H = h, z = 0]

(3.26)

=a · ρˆ d + (1 − a) · π Here E[t|H = h, z = 0] = π due to y (and therefore t) drawn from Bernoulli (Categorical) with the mean z · q + (1 − z) · π. The achieved maximum is ∆(π, ρˆˆ d ). Following this theorem we propose another novel quality measure, which takes into 44

account both the uncertainty about the class distribution and about whether it is different from the background distribution: ΦPSR (d) = ϒabs (d) · ∆(π, ρˆˆ d )

(3.27)

where ϒabs (d) is the absolute size of the subgroup. To calculate the value of a = P[z = 1|H = h] we have the following theorem: Theorem 4. Consider a subgroup as generated with the model above and denote H and h as above. Then the following equalities hold: γ · P[H = h | z = 1] γ · P[H = h | z = 1] + (1 − γ) · P[H = h | z = 0] β (k) ) ∏K Γ(h(k) + β (k) ) C Γ(∑K P[H = h | z = 1] = · k k=1 (k) · k=1 (k) h ∏k=1 Γ(β ) Γ(C + ∑K k=1 β ) K (k) C · ∏ (π (k) )h P[H = h | z = 0] = h k=1 P[z = 1|H = h] =

(3.28)

where C = ϒabs (d) is the absolute size as introduced above. Proof. Due to P[z = 1] = γ, we can obtain the first result from the Bayes formula with P[z = 1|H = h] = P[H=h|z=1]·P[z=1] . To obtain the second result we note that in the subP[H=h] group z = 1 the class distribution is drawn from Dir(β ), therefore the distribution of H follows the Dirichlet-Multinomial distribution. The stated result represents simply the probability distribution function of the Dirichlet-Multinomial with Dir(β ) and multinomial of size C. The third result is simply the probability distribution function of the Multinomial Distribution.

3.4

Isometric Analysis for Proposed Measures

Receiver Operating Characteristic (ROC) isometric is a common tool to visualise tradeoffs between positive and negative examples under a given binary measure. In this section, we hence adopt the ROC isometric to demonstrate the properties and differences among our proposed measures. We start by introducing the ROC analysis [79, 80, 3] under the context of subgroup discovery. A ROC space is a two-dimensional space with two unit axes, with each axis representing the ratio between the local count of one class and the global count of the same class. As a count can be obtained for any subgroup description d, the subgroup description can, therefore, be uniquely visualised as a point in the ROC space. For the simplicity of notations, let us denote the x-axis with a as the ratio of y(1) = 1 between the 45

subgroup description d and the overall dataset: ρ (1) ad = ϒrel (d) · (1) π ϒabs (d) ρ (1) = · (1) N π (1) N ∑ Id (xi ) · yi = i=1 (1) ∑N i=1 yi

(3.29)

And denote the y-axis with b s the ratio of y(2) = 1 between the subgroup description d and the overall dataset:: ρ (2) π (2) ϒabs (d) ρ (2) · (2) = N π (2) ∑N i=1 Id (xi ) · yi = (2) ∑N i=1 yi

bd = ϒrel (d) ·

(3.30)

Now assume that we have some quality measure Φ and an infinite description space that contains descriptions with all possible values of ad and bd . With a given quality value q, we can therefore solve the set of points that are corresponding to q as (ad , bd ) : Φ(d) = q. This set of points typically results in a line in the ROC space. Such lines are known as isometric [81] in ROC analysis. In the context of subgroup discovery, ROC isometric can visually provide us with at least the following information: 1. The effect of the relative size of the quality measure 2. The effect of the local class distribution within the subgroup description 3. The relationship between a subgroup description and its complement Before introducing the isometrics for our proposed measures, we first present the isometrics for the local class distribution ρd and the relative size ϒrel (d). Figure 3.1 shows the isometrics for the local class distributions, as well as isometrics for the relative size ϒrel (d), under different global class distributions: As shown, the isometrics for the local class distribution are all straight lines starting from the origin, disregarding the overall class distribution in the population. In fact, these isometrics are equivalent to the isometrics of precision, since precision is the mass of a class distribution on the positive (1) examples (Precision(d) = ρd ). On the other hand, the isometrics for the relative size are all parallel lines with the slope determined by the global class distribution. 46

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

b d (TPr)

b d (TPr)

1 0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1

0.2

0.3

0.4

a d (FPr)

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.3

0.4

0.7

0.8

0.9

1

0.6

0.7

0.8

0.9

1

0.5

0.4

0.2

0.6

(b)

b d (TPr)

b d (TPr)

(a)

0.1

0.5

a d (FPr)

0.5

0.6

0.7

0.8

0.9

1

a d (FPr)

0.1

0.2

0.3

0.4

0.5

a d (FPr)

(c)

(d)

Figure 3.1: Isometrics for (a) local class distribution with π (1) /π (2) = 1, (b) local class distribution with π (1) /π (2) = 3, (c) relative size with π (1) /π (2) = 1, (d) relative size with π (1) /π (2) = 3. We start with the isometric of the generalised information gain ΦIG-BS with BS, together with the isometric of ΦW RAcc for comparison. These isometrics are shown in Figure 3.2. By looking at the top row of the figure, it can be observed that the isometrics of ΦW RAcc are parallel lines. This observation indicates that, by adding random instances to a subgroup according to the global class distribution, the quality of the subgroup will stay the same. On the other hand, the same operation will lead to a lower quality of the measure ΦIG−BS . The interpretation is hence that ΦIG−BS considers subgroups that are both large and close to the global class distribution less interesting, compared to ΦW RAcc . In the right column, isometrics for both measures are produced under a unbalanced global class ratio (e.g. π (1) /π (2) = 3). While the isometrics with ΦW RAcc still keep being parallel and symmetric, isometrics with ΦIG−BS become asymmetric. The isometrics close to the bottom right of the ROC have a higher density comparing to the ones close to the top left. This appearance means that subgroups can gain a rapid change of quality by containing more instances from one class instead of the other class. Given that here the high-density isometrics appear around the bottom right of the ROC space, the corre47

1

1

2 0.

0.9

1 0.

15 0.

0

0.8

0.7

5 .0

-0

05 0.

1

0.

0

b d (TPr)

0.6

.1

-0

0.5 0.4

0.6

5

1

0.

0.5

. -0

05

0.

5 .1

0.3

-0

0

0.2

1

. -0

0.1

0.2

0.3

0.4

0.5

0.6

5 .0

.2 -0

0.7

0.8

0.9

-0

0

0.2

0.1

0.1

1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

25 0.9

1

a d (FPr)

(a)

(b)

0.9

0.7

0.6

0.6

5

b d (TPr)

0.8

0.7

10

0.8

0.4

5

15

5

10

1 20

1

b d (TPr)

5 .1 -0

.1

-0

a d (FPr)

0.5

.0 -0

0

05 0.

0.4

05

0.3

0.9

0

05

0.

0.8

0.7

b d (TPr)

1

0.

15

0.

0.9

0.5 0.4

0.3

0.3

5

10

0.2

5

0.2

10 0.1

0.1 5

0.1

0.2

0.3

15

10 0.4

0.5

0.6

0.7

0.8

20 0.9

5

1

a d (FPr)

0.1

0.2

0.3

10 0.4

15

0.5

0.6

0.7

20 0.8

a d (FPr)

(c)

(d)

Figure 3.2: Isometrics for (a) W RAcc with π (1) /π (2) = 1, (b) W RAcc with π (1) /π (2) = 3, (c) IG − BS with π (1) /π (2) = 1, (d) IG − BS with π (1) /π (2) = 3. sponding class is hence the negative class (e.g. y(2) = 1), which is the rare class according to the global class ratio. This asymmetry is a nice property when dealing with highly unbalanced dataset without a given cost for each class. Figure 3.3 and Figure 3.4 provide the isometrics for ΦIG-BS and ΦPSR-BS respectively. ˆ Since both measures involve the Laplace correction with the Dirichlet prior, isometrics for them can have different shapes under different sizes of the overall dataset. As a result, both measures generate different isometrics comparing to ΦIG−BS for the case N = 10, the isometrics with N = 1000 are close to the ones of ΦIG−BS . As shown by Figure 3.3 (a), isometrics of ΦIG−BS with N = 10 are close to parallel lines. It can be explained as the Laplace correction is more effective when the dataset is small and hence reduces the quality of subgroups close to the bottom left. Concerning Figure 3.3 (d), the effect of Laplace correction is even stronger with an unbalanced class ratio in the population, which gives a convex shape for the isometrics around the bottom right of the ROC space. Regarding isometrics of the measure ΦPSR in Figure 3.4, considering that the estimated class distribution within this measure is a linear combination of the Laplace corrected class distribution and the overall class distribution, the isometrics also appear to be a combination of the 48

2 0.

0.4

0.5 0.4

0.2

0.4

0.3

0.2 0.1

0.2

0.3

0.4

0.4 0.5

0.6

0.7

0.8

0.8

1

0.9

1

2 0.2

0.1

4

6

0.3

8

0.4

10

10 0.5

12

0.6

14

0.7

16

18

0.8

20 0.9

5

0.9

0.7

0.6

0.6

0.6

0.2

0.4

0.5 0.4

0.4

0.3

0.3

0.2

0.2 2

0.6

4 0.

0.4

0.5

a d (FPr)

(d)

0.6

0.7

0.8

0.8

b d (TPr)

0.8

0.7

b d (TPr)

0.8

0.3

50

0.3

150

0.6

0.7

0.8

200 0.9

1

0.5

0.3

5

50

0.2

10

100

0.1

1

5 1

100 0.5

0.4

0.1

0.9

0.4

(c)

0.7

0.2

0.2

1

0.9

0.5

100

a d (FPr)

0.8

0.1

0 0.1

1 2 0.

0.

200

50

1

(b)

1

0.1

50

0.2

a d (FPr)

(a)

b d (TPr)

8

0.1

a d (FPr)

0.9

6

0.1

0.6

0.5

0.3

4

2

0.2

0.1

0.6

0.4

0.3

0.2

15

8

6 4

2

50

0.5

0.7

0.6 2

4 0.

4

0.6

0.8

10 0

0.7

0.9

50

0.8

0.7

2

b d (TPr)

0.8

1

8 6

6

0. 8

0.

0.9

b d (TPr)

b d (TPr)

2 0.

20 18 16 14 12 10

1 1

1 0.9

0.1

0.2

0.3

15

10 0.5

0.4

0.6

0.7

0.8

20 0.9

50 1

a d (FPr)

0.1

0.2

0.3

100 0.4 0.5

150 0.6

0.7

200 0.8

0.9

1

a d (FPr)

(e)

(f)

ˆ − BS with (a) π (1) /π (2) = 1, N = 10, (b) π (1) /π (2) = 1, Figure 3.3: Isometrics for IG N = 100, (c) π (1) /π (2) = 1, N = 1000, (d) π (1) /π (2) = 3, N = 10, (e) π (1) /π (2) = 3, N = 100, (f) π (1) /π (2) = 3, N = 1000. ones in Figure 3.2 and Figure 3.3. For instance, both Figure 3.4 (a) and Figure 3.4 (d) are more close to the original shape as in Figure 3.3. This observation further demonstrates our claim that the measure ΦPSR is capable of dealing with the uncertainty of whether the candidate subgroup is drawn from a local distribution, which further reduces over-fitting on same subgroups.

3.5

Experimental Evaluation

In this section we experimentally investigate the performance of our proposed measures. The experiments are separated into two parts. For the first part, we generated synthetic data, such that we know the true subgroup. In the second part, we applied our methods to UCI data to investigate summarisation performance. For our proposed measures, we adopt the generalised divergences of BS and LL as given in Lemma 1 and Lemma 2. Plugging these into Equation 3.22 and 3.27 we obtain ˆ ˆ four novel measures IG-BS, IG-LL, PSR-BS and PSR-LL. We compare these proposals against the generalised information gain IG-LL (Equation 3.13), IG-BS (Equation 3.14), and a range of subgroup evaluation measures used in the literature: Class-wise Weighted Relative Accuracy (WRAcc), as well as the χ 2 statistic, which is defined as follows: (k)

(ρd − π (k) )2 π (k) k=1 K

ΦChi2 (d) = ϒabs (d) · ∑ 49

(3.31)

2

8 6

4

0.

0.1

0.2

0.1

4

6

2

0.3 0.4 0.5 .6 0

4

6

2

(a)

8

10

12

14

16

100

2

50

0 15

50

200

π(1) /π(2) =1, N = 1000

2

8

6

4

0. 1

20 18 16 14 12 10

π(1) /π(2) =1, N = 100

1

0 0.5.6 0.4 0.3 0.2

π(1) /π(2) =1, N = 10

50

8

10

18

20

100

50

(b)

200

(c)

π(1) /π(2) =3, N = 100

π(1) /π(2) =3, N = 1000

50

5

π(1) /π(2) =3, N = 10

150

100

0.1

0.2

5

50

10

0.1

0.2

0.3 0.4 0.5 .6 0 0.7 0.8

1

(d)

5

10

(e)

15

20

100

50

100

150

200

(f)

Figure 3.4: Isometrics for PSR − BS with (a) π (1) /π (2) = 1, N = 10, (b) π (1) /π (2) = 1, N = 100, (c) π (1) /π (2) = 1, N = 1000, (d) π (1) /π (2) = 3, N = 10, (e) π (1) /π (2) = 3, N = 100, (f) π (1) /π (2) = 3, N = 1000.

3.5.1

Synthetic Data

In the experiments on the synthetic data, we evaluate how good the methods are in revealing the true subgroup used in generating the data, as well as in producing good summaries of the data. To provide a more intuitive illustration, we construct our data set according to a reallife scenario. Suppose one has been using a wearable device to record whether daily exercises were performed or not, for a whole year. As it turned out, there were 146 out of 365 days when the exercises were performed, which gives a probability about 2/5 that the exercises were performed on a random day. According to the website of the wearable device, the same statistics are about 1/3 for the general population. It is possible that the overall exercise frequency was different, but perhaps a more plausible explanation might be that more exercises were performed during a particular period only. Subgroup discovery can hence be applied to recognise the period of more intensive exercise and summarise the corresponding exercise frequency. Following this scenario, the feature space consists of the 52 weeks of the year, hence X = {1, ..., 52}. We define the subgroup language as the set of all intervals of weeks of length from 2 to 8 weeks. The data set is assumed to contain a single year from January to December. This setting allows us to perform an exhaustive search on the subgroup language. As here we aim to compare the performance among different quality measures, applying exhaustive search can avoid the bias introduced by other greedy search algorithms. The way to generate the data is then as described in the previous section. Given the 50

π (1) .1 .2 .3 .4 .5

PSR − BS .744 .636 .587 .558 .567

PSR − LL W RAcc Chi2 IG − BS .736 .597 .526 .030 .638 .510 .436 .089 .589 .480 .403 .218 .564 .454 .390 .372 .569 .458 .410 .561

IG − LL .029 .091 .223 .379 .565

ˆ − BS IG .742 .628 .581 .550 .561

ˆ − LL IG .716 .631 .585 .559 .565

Table 3.1: Micro-averaged F-scores on the artificial data, for different class distributions (π (1) ). The best results for each row are shown in bold. default class distribution π, the subgroup class distribution q is sampled from a Dirichlet prior, and a true subgroup is selected uniformly within the language. Therefore, all the seven days of each week can be distributed either according to π or according to q. We evaluate each subgroup quality measure by comparing the obtained subgroup against the true subgroup. This evaluation is done by measuring the similarity of the respective indicator functions between z and zˆ. For similarity, we use the F-score as we are not interested in the ‘true negatives’ (instances in the complements of both true and discovered subgroups). There instances are not interesting as they are not directly described by the obtained subgroups. For the cases where such instances are indeed interesting, it would be more reasonable to design quality measures that can discover them directly as resulted subgroups. The F-score for this case can be computed as (zi and zî are used to represent whether an instance belongs to the true subgroup and the obtained subgroup respectively): F1 =

2 · ∑N i=1 I(zi = 1, zî = 1) N ∑i=1 (2 · I(zi = 1, zî = 1) + I(zi = 1, zî = 0) + I(zi = 0, zî = 1))

(3.32)

The results are given in Table 3.1 as the micro-averaged F-scores from 5 000 synthetic sequences, for different values of π (1) (the first component of the class distribution vector). We can see that the PSRs-based approaches outperform existing measures, with a slight advantage for Log-loss over Brier score. The information gain based methods perform particularly poorly, as they have a preference for pure subgroups, whereas for skewed π it would be advantageous to look for subgroups with a more uniform class distribution. As π tends to be uniform, the ‘true’ subgroup becomes more likely to be distributed close to π and harder to identify. Therefore, all methods are expected to perform poorly for π (1) ≈ 0.5. The variance is quite high across all methods, probably because the data set is quite small (52 · 7 = 364 instances). Since a better statistical summary is essentially our aim, the results are also evaluated according to their overall loss on a test set (also of length one year) drawn from the same distribution. For each quality measure, a subgroup is obtained from the training fold together with the local statistical summary (πˆˆ for ΦPSR , πˆ for other quality measures). The loss for the obtained summarisation can then be calculated as in Eq.(3.4). The corre51

π (1) .1 .2 .3 .4 .5

PSR − BS .195 ± .03 .326 ± .03 .419 ± .02 .475 ± .02 .494 ± .02

PSR − LL .195 ± .03 .326 ± .03 .419 ± .02 .475 ± .02 .494 ± .02

W RAcc .207 ± .03 .334 ± .03 .424 ± .02 .479 ± .02 .497 ± .01

Chi2 .212 ± .03 .337 ± .03 .426 ± .02 .480 ± .01 .498 ± .01

IG − BS IG − LL .231 ± .04 .231 ± .04 .350 ± .04 .350 ± .04 .430 ± .03 .430 ± .03 .478 ± .02 .478 ± .02 .494 ± .02 .495 ± .02

ˆ − BS IG .195 ± .03 .326 ± .03 .420 ± .02 .476 ± .02 .494 ± .02

ˆ − LL IG .195 ± .03 .326 ± .03 .420 ± .02 .476 ± .02 .494 ± .02

Table 3.2: Average Brier scores on the artificial data. The best results are shown in bold. π (1) .1 .2 .3 .4 .5

PSR − BS .344 ± .04 .507 ± .03 .610 ± .03 .668 ± .02 .687 ± .02

PSR − LL .344 ± .04 .507 ± .03 .610 ± .03 .668 ± .02 .686 ± .02

W RAcc .359 ± .04 .517 ± .03 .616 ± .02 .673 ± .02 .690 ± .01

Chi2 .368 ± .04 .520 ± .03 .618 ± .02 .674 ± .02 .691 ± .01

IG − BS IG − LL .406 ± .06 .407 ± .06 .539 ± .05 .540 ± .05 .624 ± .03 .624 ± .03 .671 ± .02 .671 ± .02 .688 ± .02 .687 ± .02

ˆ − BS IG .344 ± .04 .508 ± .03 .611 ± .03 .670 ± .02 .688 ± .02

ˆ − LL IG .347 ± .04 .509 ± .03 .611 ± .03 .669 ± .02 .687 ± .02

Table 3.3: Average Log-loss on the artificial data. The best results are shown in bold. sponding results are given in Tables 3.2-3.3 for both Brier score and Log-loss. We see a similar pattern as with the F-score results.

3.5.2

UCI Data

We proceed to compare our method with existing approaches on UCI data sets [82]. We selected the same 20 UCI datasets as described in [43]. The information regarding the number of attributes and instances are provided in Table 3.4: The subgroup language we used here is conjunctive normal form, with disjunctions (only) between values of the same feature, and conjunctions among disjunctions involving different features. All features are treated as nominal. If the original feature is numeric and contains more than 100 values, it is discretised into 16 bins. Since exhaustive search is intractable for most data sets in this experiment, we perform beam search instead. The beam width is set to be 32 (i.e., 32 candidate subgroups are kept to be refined in the next round). The number of refinement rounds is set to 8. The resulting average Brier scores and Log-loss are given in Tables 3.5-3.6. All the results are obtained by 10-fold cross-validation. As in the previous experiment, a subgroup is learned on the training folds, and the class distribution estimated on the test fold is then used to compute the corresponding loss. Given these results, it can be seen that our proposed measures outperform WRAcc, Chi2 and both versions of information gain. The PSR measures (first two columns) are never outperformed by the generalised divergence (last two columns), so we recommend using the former unless simplicity of implementation is an issue (as the latter don’t need the estimation of a). Regarding the choice between (BS, LL), this is still an ongoing debate in the community. Here we used both to demonstrate that our novel measure can apply either of the two most well-known proper scoring rules. 52

Name Abalone Balance-scale Car Contraceptive Contact-lenses Credit Dermatology Glass Haberman Hayes-roth House-votes Ionosphere Iris Labor Mushroom Pima-indians Soybean Tic-Tac-Toe Breast Cancer Zoo

# instances 4 176 624 1 727 1 472 24 589 365 213 305 131 434 350 150 57 8 123 767 683 957 197 100

# features 9 5 7 10 5 16 35 11 4 5 17 34 5 17 23 9 36 10 34 18

# classes 3 3 4 3 3 2 6 6 2 3 2 2 3 2 2 2 19 2 2 7

Table 3.4: The 20 UCI data sets used in the experiments.

3.6

Chapter Summary

In this chapter, we investigated how to discover subgroups that are optimal in the sense of maximally improving the global statistical summary of a given data set. By assuming that the (discrete) statistical summary is to be evaluated by the proper scoring rules, we derived the corresponding quality measures from first principles. We also proposed a generative model to consider the optimal statistical summary for any candidate subgroup. The proposed measures are visualised through the ROC isometrics under a binary setting, together with a comparison to the measure WRAcc. By performing experiments on both synthetic data and UCI data, we showed that our measures provide better summaries in comparison with existing methods. The major advantage of adopting our generative model is that it prevents finding small subgroups with extreme distributions, compared to the measures without generative modelling. This result can be thought of as obtained from applying a regularisation on the class distribution, similar to performing Laplace smoothing in decision tree learning. Given the experiments, we can observe that the novel measures tend to perform better on small data sets (e.g. Contact-lenses, Labor).

53

data set Abalone Balance-scale Car Contraceptive Contact-lens Credit Dermatology Glass Haberman Hayes-roth House-votes Ionosphere Iris Labor Mushroom Pima-indians Soybean Tic-Tac-Toe Breast Cancer Zoo

PSR − BS .872 ± .005 .539 ± .043 .379 ± .023 .618 ± .019 .624 ± .283 .351 ± .047 .633 ± .073 .698 ± .050 .427 ± .083 .634 ± .029 .269 ± .041 .389 ± .061 .460 ± .077 .478 ± .237 .253 ± .010 .416 ± .029 .826 ± .046 .395 ± .019 .274 ± .035 .582 ± .135

PSR − LL .874 ± .005 .572 ± .027 .380 ± .032 .647 ± .013 .651 ± .285 .351 ± .047 .708 ± .027 .698 ± .051 .387 ± .092 .625 ± .040 .271 ± .037 .389 ± .062 .460 ± .077 .466 ± .249 .253 ± .010 .458 ± .044 .882 ± .019 .455 ± .039 .306 ± .053 .684 ± .052

W RAcc .879 ± .006 .527 ± .047 .381 ± .030 .638 ± .015 .579 ± .226 .351 ± .047 .721 ± .026 .725 ± .065 .391 ± .096 .632 ± .046 .309 ± .061 .411 ± .115 .460 ± .077 .500 ± .338 .279 ± .012 .422 ± .062 .882 ± .018 .434 ± .053 .325 ± .051 .675 ± .058

Chi2 .897 ± .004 .578 ± .024 .466 ± .031 .650 ± .012 .611 ± .151 .500 ± .012 .806 ± .026 .745 ± .046 .398 ± .068 .659 ± .028 .482 ± .027 .470 ± .054 .675 ± .005 .491 ± .152 .505 ± .001 .462 ± .035 .920 ± .011 .460 ± .034 .459 ± .030 .781 ± .077

IG − BS .878 ± .01 .561 ± .032 .406 ± .036 .619 ± .021 .461 ± .438 .351 ± .047 .633 ± .073 .716 ± .068 .394 ± .094 .608 ± .048 .306 ± .055 .401 ± .114 .460 ± .077 .397 ± .328 .279 ± .012 .425 ± .058 .826 ± .046 .424 ± .051 .318 ± .050 .582 ± .135

IG − LL .884 ± .006 .562 ± .032 .406 ± .036 .616 ± .021 .461 ± .438 .351 ± .047 .635 ± .077 .719 ± .048 .394 ± .094 .602 ± .044 .306 ± .055 .398 ± .112 .460 ± .077 .397 ± .328 .253 ± .010 .427 ± .060 .861 ± .026 .403 ± .046 .306 ± .053 .582 ± .135

ˆ − BS IG .872 ± .005 .539 ± .043 .379 ± .024 .618 ± .019 .627 ± .284 .351 ± .047 .633 ± .073 .698 ± .050 .430 ± .082 .634 ± .029 .269 ± .041 .389 ± .061 .460 ± .077 .478 ± .237 .253 ± .010 .416 ± .029 .826 ± .046 .395 ± .019 .274 ± .035 .582 ± .135

ˆ − LL IG .874 ± .005 .572 ± .027 .380 ± .032 .647 ± .013 .655 ± .287 .351 ± .047 .708 ± .027 .698 ± .051 .387 ± .092 .625 ± .040 .271 ± .037 .389 ± .062 .460 ± .077 .467 ± .249 .253 ± .010 .458 ± .044 .882 ± .019 .455 ± .039 .306 ± .053 .684 ± .052

Table 3.5: Average BS for the UCI data sets. The best results are shown in bold.

data set Abalone Balance-scale Car Contraceptive Contact-lens Credit Dermatology Glass Haberman Hayes-roth House-votes Ionosphere Iris Labor Mushroom Pima-indians Soybean Tic-Tac-Toe Breast Cancer Zoo

PSR − BS 2.430 ± .055 .958 ± .077 .766 ± .037 1.119 ± .031 1.166 ± .483 .563 ± .069 1.459 ± .178 1.479 ± .130 .695 ± .104 1.142 ± .050 .491 ± .074 .667 ± .098 .836 ± .132 .775 ± .332 .408 ± .016 .688 ± .034 2.579 ± .157 .660 ± .022 .507 ± .048 1.435 ± .329

PSR − LL 2.436 ± .057 .918 ± .064 .764 ± .047 1.068 ± .021 1.212 ± .485 .563 ± .069 1.424 ± .075 1.477 ± .131 .601 ± .111 1.054 ± .116 .476 ± .071 .670 ± .102 .836 ± .132 .747 ± .359 .408 ± .016 .659 ± .060 2.447 ± .079 .647 ± .040 .455 ± .087 1.439 ± .118

W RAcc 2.450 ± .062 .918 ± .084 .766 ± .052 1.089 ± .022 1.042 ± .336 .563 ± .069 1.443 ± .077 1.478 ± .211 .617 ± .121 1.045 ± .103 .476 ± .101 .629 ± .139 .836 ± .132 .787 ± .482 .455 ± .019 .655 ± .077 2.452 ± .083 .663 ± .061 .508 ± .078 1.447 ± .139

Chi2 2.608 ± .051 1.026 ± .064 .946 ± .056 1.173 ± .021 1.076 ± .239 .794 ± .014 1.807 ± .084 1.635 ± .154 .686 ± .083 1.180 ± .051 .774 ± .029 .763 ± .062 1.210 ± .008 .785 ± .176 .798 ± .001 .754 ± .041 2.810 ± .103 .752 ± .040 .751 ± .035 1.825 ± .228

IG − BS 2.504 ± .061 .986 ± .067 .797 ± .066 1.122 ± .035 .884 ± .735 .563 ± .069 1.459 ± .178 1.552 ± .188 .623 ± .117 .968 ± .116 .467 ± .088 .620 ± .147 .836 ± .132 .622 ± .470 .455 ± .019 .669 ± .076 2.579 ± .157 .669 ± .067 .491 ± .077 1.435 ± .329

IG − LL 2.511 ± .061 .993 ± .067 .797 ± .066 1.115 ± .036 .884 ± .735 .563 ± .069 1.464 ± .185 1.493 ± .192 .622 ± .117 .953 ± .108 .467 ± .088 .616 ± .145 .836 ± .132 .622 ± .470 .408 ± .016 .669 ± .076 2.455 ± .172 .641 ± .061 .456 ± .086 1.435 ± .329

ˆ − BS IG 2.430 ± .055 .958 ± .077 .766 ± .037 1.119 ± .031 1.175 ± .488 .563 ± .069 1.459 ± .178 1.479 ± .130 .693 ± .105 1.142 ± .050 .491 ± .074 .667 ± .098 .836 ± .132 .775 ± .333 .408 ± .016 .688 ± .034 2.579 ± .157 .660 ± .022 .507 ± .048 1.435 ± .329

ˆ − LL IG 2.436 ± .057 .918 ± .064 .764 ± .047 1.068 ± .021 1.223 ± .492 .563 ± .069 1.424 ± .075 1.478 ± .131 .601 ± .111 1.054 ± .116 .476 ± .071 .670 ± .102 .836 ± .132 .747 ± .359 .408 ± .016 .659 ± .060 2.447 ± .079 .647 ± .040 .455 ± .087 1.439 ± .118

Table 3.6: Average LL for the UCI data sets. The best results are shown in bold.

54

Chapter 4 Model-Based Subgroup Discovery

In the previous chapter, we demonstrated how evaluation-driven quality measures could be derived in the context of a statistical summary of proper scoring rules. To establish a link to tradition subgroup discovery, we initially investigated the proposed measures under the traditional setting, where the target variable is uni-variate, and the statistical models are capturing the class distributions. Hence, the subgroups found with the derived measures are based on a binomial or multinomial model class, as well as the proposed generative model class.

In this chapter, we further develop the idea of using generic quality measures to find subgroups within different probabilistic model classes. We first investigate how the previous measures with proper scoring rules can be applied to any probabilistic model class. By definition, to calculate the score of any instance under a particular model (distribution), Brier score requires the estimation of the distribution over the whole domain of the random variable. When the random variable is discrete, the calculated Brier score is equal to the sum of the squared difference between two probability mass distributions. The first is the estimated distribution and the second is a distribution with probability mass 1 on the observed values and probability 0 elsewhere. On the other hand, with log-loss, it only requires the estimation of the probability mass of the observed value to obtain the corresponding score. Therefore, we focus on log-loss and use it to develop both evaluation-driven and semi-evaluation generic measures further. An experiment with synthetic data is performed to illustrate the statistical information captured by traditional subgroup discovery, exceptional model mining and model-based subgroup discovery altogether. Finally, we present two case studies on real-life applications to demonstrate the vitality and flexibility of model-based subgroup discovery. 55

4.1

Probabilistic Model Class and the Weighted Divergence (π)

As introduced in the previous chapter, we use the notation SU = {si = π}N i=1 to represent (d,ρd ,π) (d,ρd ,π) = Id (xi ) · ρd + [1 − = {si the default summary on the overall dataset, and SU N Id (xi )] · π}i=1 to represent the alternative summary with a subgroup description d, where π and ρd are the estimated class distributions from the overall dataset and the subgroup respectively, over the target space Y. In this chapter, we generalise the notation above to denote the default and alternative statistical summaries with any probabilistic model class: (π)

SU = {si = fπ }N i=1 (d,ρd ,π)

SU

(d,ρd ,π)

= {si

(4.1)

= Id (xi ) · fρd + [1 − Id (xi )] · fπ }N i=1

(4.2)

Here f is a probabilistic model class, hence π = Θ(IU ) and ρd = Θ(Id ) now denote the estimated parameters (parameter sets) under f with training algorithm Θ, from the overall dataset and the candidate subgroup respectively. We then use the notation fθ : Y → R to denote the probability distribution function of a model class f with parameter θ . By definition, depending on whether Y is discrete, fθ can be either a Probability Mass Function (pmf) or a Probability Density Function (pdf). For simplicity, for the rest of this chapter we only adopt the notations for the pdf case. The generalised information gain can then be applied to evaluate subgroups with any probabilistic model class: (π)

(d,ρd ,π)

ΦIG (d) = ψ 0 (SU , y) − ψ 0 (SU

, y)

= ϒrel (d) · Ey∼ fρd [Ψ( fπ , y) − Ψ( fρd , y)]

(4.3)

= ϒrel (d) · ∆( fπ , fρd ) Unlike the previous case where the target can only be discrete, now we assume that the target variable y can take any form as long as there is a corresponding probabilistic model class f . While the equation above can take any proper scoring rule, the calculation of the measure above can become intractable under certain target variables and model classes, given that the calculation of expectation requires integration over the target space Y. As our original motivation is to investigate a generic approach for subgroup discovery, we would like the corresponding measure not only to work across different model classes but also to avoid additional derivation to obtain the analytical form of the measure under a model class f . We hence propose to use the empirical form of proper scoring rules to 56

approximate the measure: ˆ IG (d) = Φ

∑

[Ψ( fπ , yi ) − Ψ( fρd , yi )]

(4.4)

i:Id (xi )=1

ˆ IG is obtained by summing over the difAs shown by the equation, the empirical form Φ ference between the scores of the global model and the local model, within the subgroup description d. This calculation avoids the integral required by the calculation of expectation. The computational complexity now only depends on the selected proper scoring rule. To further simplify the calculation, here we suggest that the log-loss should be preferable over all the other proper scoring rules for the following reason. To calculate the score for each instance, log-loss only requires the point estimation of the probability density fθ (yi ) on that particular instance yi . The obtained score is equivalent to the log-likelihood of the instance under the estimated probabilistic model. On the other hand, proper scoring rules like Brier score require the estimation of the whole distribution density on the instance space Y. This requirement can be easily seen in the definition of the discrete version of BS. The score is calculated as the sum of the squared differences between all the values within Y. This also brings the requirement of calculating the integral over Y, which we are trying to avoid for the same reason as described above. Therefore, the first generic measure we propose is the empirical approximation of the generalised information gain with log-loss, which equals to the Kullback-Leibler divergence weighted by the absolute size, between the global model fπ (y) and the local model fρd (y). The measure is hence named Weighted Divergence (WD). For simplicity, instead ˆ IG-LL , we denote it as ΦW D . of Φ Definition 20. Given a probabilistic model class f , a subgroup description d, a local model fρd , and a global model fπ , the quality measure Weighted Divergence (WD) is defined as: ΦW D (d) = ∑ (log fρd (yi ) − log fπ (yi )) (4.5) i:Id (xi )=1

As discussed previously, the major interpretation of this measure is the gain of scores by replacing the global model with the local model with the instances covered by the subgroup description d, which makes it an evaluation-driven measure. Following the properties of proper scoring rules and generalised divergence, the obtained value is always larger than or equal to 0. It is equal to 0 if and only if the local model is estimated the same as the global model (e.g. ρd = π). While this measure is calculated on all the instances within the subgroup and related to the size factor, we can still obtain the approximated statistical deviation by dividing the measure with the absolute size. 57

Definition 21. Given a subgroup description d and its weighted divergence ΦW D (d), the statistical deviation of d is calculated as: ΦW D (d) ϒabs (d) ∑i:Id (xi )=1 (log fρd (yi ) − log fπ (yi )) = ∑N i=1 Id (xi )

ΩKL (d) =

(4.6)

This statistical deviation corresponds to the approximated Kullback-Leibler divergence between the global model and the local model, which can be interpreted as expected log-likelihood ratio between the two models, on a random instance drawn from the local model. Later in the experiments we also use the term exp(ω) to demonstrate the (expected) instance-wise likelihood ratio between the two models, which gives a more intuitive comparison.

4.2

Weighted Relative Likelihood

While the quality measure WD above can be directly applied to any probabilistic model class, it requires the estimation of fρd and therefore asks a local model to be trained within each candidate subgroup. This is necessary given that we are interested in the improvement by summarising a local model. In this section, we look at another kind of interestingness, which only focuses on finding the subgroups concerning the global model. We first write equation (4.3) with log-loss as: ΦIG-LL (d) = ϒrel (d) · Ey∼ fρd [ΨLL ( fπ , y) − ΨLL ( fρd , y)] = ϒrel (d) · Ey∼ fρd [ΨLL ( fπ , y)] − Ey∼ fρd [ΨLL ( fρd , y)] = ϒrel (d) · Ey∼ fρd [−log fπ (y)] − Ey∼ fρd [−log fρd (y)] = ϒrel (d) · Ey∼ fρd [log fρd (y)] − Ey∼ fρd [log fπ (y)]

(4.7)

The equation has the two terms Ey∼ fρ (y) [log fρd (y)] and Ey∼ fρ (y) [log fπ (y)], representing d d respectively the expected log-likelihood of the global model and the local model within the subgroup. We can replace the first term with the expected likelihood of the global model within the subgroup (Ey∼ fρ (y) [log fπ (y)] ), and the second term with the expected d likelihood of the global model on the overall dataset ( Ey∼ fπ (y) [log fπ (y)] ). This gives the following quantity: ϒrel (d) · Ey∼ fρ

d

(y) [log f π (y)] − Ey∼ fπ (y) [log f π (y)] 58

(4.8)

As shown, this quantity above measures the expected log-likelihood of the global model within the subgroup against the one within the overall dataset. In practice, it can be again calculated through the empirical form. We name the corresponding measure as Weighted Relative Likelihood (WRL). Definition 22. Given a probabilistic model class f , a global model fπ , and a subgroup description d, the quality measure Weighted Relative Likelihood (WRL+ ) is defined as: + ΦW RL (d) = ϒabs (d) ·

∑yi :Id (xi )=1 log fπ (yi ) ∑N log fπ (yi ) − i=1 ϒabs (d) N

(4.9)

As the equation shows, WRL+ is a semi-evaluation driven measure, and it compares the empirical mean of the likelihood between the subgroup and the overall dataset, under the same global model. Maximising WRL+ finds subgroups that have a higher average log-likelihood than the overall population. Alternatively, we may be interested in subgroups that represent a lower than expected log-likelihood, which can be achieved by maximising its negation: − + ΦW (4.10) RL (d) = −ΦW RL (d) Although both WD and WRL are defined in terms of the log-likelihood within the subgroup, there are a few differences between them. While WD is the total gain in loglikelihood while switching from the global model to the local model, WRL+ and WRL− compute the difference between the empirical sum of the log-likelihood and the expected sum log-likelihood of the global model within the subgroup. This difference further leads to a different interpretation of the obtained subgroups for each quality measure. With respect to WD, it is guaranteed that an improved local model can be trained in any obtained subgroup. However, the subgroups found by WRL don’t necessarily lead to better local models, as they simply contain a higher or lower proportion of more or less likely observed samples. For instance, if the model class is a 1D Gaussian distribution, then WD will find subgroups where a local Gaussian distribution describes the data better than the global one. WRL+ will generate subgroups that contain more samples around the mean of the global Gaussian, and WRL− will identify subgroups that tend to have more values from its tails. However, while the dataset is drawn from a mixture of two distributions (e.g. mixture of two Gaussian distributions), WD and WRL− can end up with similar subgroups, where the corresponding subgroups provide better local models, as well as a higher proportion of more / less likely observations. This claim is demonstrated in details in the following synthetic experiments. Regarding the running time, as in WRL no training on local subgroups is required, it can avoid the time consumption O(|D| · t¯) which goes for local training in WD and measures in exceptional model mining. Here |D| represents the total number of candidate subgroups and t¯ is the averaged time of training in each subgroup. This difference is non59

trivial on large data sets with models with a large number of parameters, like the Latent Dirichlet Allocation [83] in hierarchical Bayesian modelling.

4.3

Evaluation-Driven and Parameter-Driven

So far, we have seen measures defined via both evaluation-driven and parameter-driven approaches, as well as semi-evaluation-driven measures. In this section, we provide a top level comparison among them, which aims to provide a clear view regarding which approach to adopt for future development of quality measures in subgroup discovery. We address the differences among these approaches through three features: generic, interestingness, and local training. The first feature is straightforward and is the major motivation of proposing evaluation-driven and semi-evaluation-driven measures. A measure is said to be generic if it can switch within a set of model classes without any modification on the measure itself. The second feature is the corresponding interestingness of the quality measures. As parameter-driven measures are defined according to a given and fixed parameter, the obtained subgroups are interesting directly on the selected parameter. On the other hand, evaluation-driven and semi-evaluation-driven measures are calculated according to the evaluation of the overall model, which then might be upon multiple parameters for some model classes. This difference hence provides the main trade-off between evaluation-driven and parameter-driven quality measures. If the user is interested in the statistical deviation on a particular parameter of a model class, parameter-driven measures can ensure that the obtained subgroups are exactly interesting on the focused parameter. In contrast, while the interestingness is not fixed on a single parameter, evaluation-driven measures can provide subgroups that are simultaneously interesting over multiple parameters. However, this also implies that evaluation measures don’t specify the main interestingness of the obtained subgroup regarding parameters, which then requires an extra step to investigate. The last feature is whether the calculation of measures requires the training of local models within candidate subgroups. As discussed previously, while evaluation-driven and parameter-driven measures require local training, the main advantage of semi-evaluation driven is it avoids such computational costs. A summary of the features on different approaches is provided in Table 4.1. Generic Interestingness Local training Evaluation-driven Yes Multiple parameters Yes Semi-evaluation-driven Yes Multiple parameters No Parameter-driven No Single parameter Yes Table 4.1: A comparison among evaluation-driven, semi-evaluation-driven and parameter-driven quality measures. 60

4.4

Experiment with Synthetic Dataset

While model-based subgroup discovery is proposed as a generic, evaluation-driven approach compared to traditional subgroup discovery and exceptional model mining, it is still worth investigating the differences among the subgroups obtained by all the three approaches. As quality measures within each approach have been designed under different statistical models and hence focus on different kinds of interestingness, in general, it is difficult to present a fair comparison between measures. However, there is a particular case where we can link both traditional subgroup discovery and exceptional model mining to the model-based approach. This comparison is achieved through a bivariate Gaussian distribution. As introduced previously, the measure CWRAcc is proposed to find subgroups that have a higher / lower mean value on a continuous target variable. On the other hand, the quality measure EMM-Corr in exceptional model mining is designed to find subgroups with different correlation coefficients among two continuous target variables. As both the mean value and correlation can be seen as parts of the parameter set of a bivariate Gaussian distribution, we use it as a base model to illustrate the differences between the obtained subgroups among the three approaches. Here we use synthetic datasets for the task, which provides a better control of the true distribution behind the observed instances. We consider a scenario where the dataset is drawn from a mixture of two Gaussian distributions. A measure can then be evaluated by its ability to find one of two distributions. Therefore, the research hypothesis is, given that evaluation-driven measures can capture statistical information simultaneously on multiple parameters, it should give a better performance compared to parameter-driven measures on the task described above, as a Gaussian distribution is defined on both the mean vector and the covariance matrix. A single synthetic dataset is generated as follows. We assume a simple univariate feature xi ∈ X = {1, 2, ..., 10}, thus allowing each measure to be applied on all candidate subgroups and removing the effect of greedy search. The subgroup language is any subset of X. To break the symmetry between the two mixture components, we first generate the true description for the minority Gaussian component. We will vary the relative sizes of mixture components in the experiments, parametrised by C ∈ {1, 2, 3, 4, 5}. On average, C out of 10 items will belong to the minority component and 10 −C to the majority component. Denote the true description of the minority component as D1 , it is then generated by uniformly choosing C values from X, so that D1 ⊂ X and |D1 | = C. The true description for the other majority Gaussian component is then denoted as D2 = X \ D1 . An instance in the dataset is then generated as: xi ∼ Unif(X) yi |xi ∈ D1 ∼ Norm(µ1 , Σ1 ) 61

6

6

4

4

2

2

0

0

2

2

4

4

6

6

4

2

0

2

4

6

6

4

2

6

6

4

2

2

4

6

0

2

4

6

6 4 2 0 2 4 6

0

Figure 4.1: Some examples of the synthetic datasets. Red points and blue points are the random targets drawn from the two Gaussian distributions respectively. The ellipse are the 95% confidential level of the corresponding Gaussian. yi |xi ∈ D2 ∼ Norm(µ2 , Σ2 )

Here the Unif and Norm denotes the uniform distribution and Gaussian distribution reC spectively. The instances covered by description D1 will hence take a proportion of 10 of the whole dataset. The parameter sets µ1 , Σ1 and µ2 , Σ2 are fixed for a single dataset, and randomly generated across different datasets as follows. The mean vector µ1 and µ2 are drawn from the standard bivariate Gaussian distribution, with a mean vector of [0, 0] and the covariance equal to the identity matrix. The covariance matrices Σ1 and Σ2 are both filled with value 1 on the descending diagonal, and the values on ascending diagonal are uniformly drawn from [−1, 1] (in this case it equals to both the covariance and the Pearson correlation as the variance for both dimensions are 1) . Finally, for each synthetic dataset, we generate 1 000 random instances. With each C value, a number of 100 synthetic datasets are generated. For instances, C = 1 implies that, description D1 contains one feature value and covers 100 instances, and D2 contains nine feature values and covers 900 instances. Figure 4.1 shows three examples of the synthetic target values with C = 3. We apply the following quality measures on the synthetic datasets for comparison, together with a random baseline, where all the feature values are selected uniformly to 62

C 1 2 3 4 5 All

WD 0.913 ± 0.281 0.814 ± 0.389 0.712 ± 0.453 0.631 ± 0.483 0.587 ± 0.492 0.731 ± 0.443

W RL+ 0.000 ± 0.000 0.016 ± 0.068 0.059 ± 0.166 0.205 ± 0.281 0.546 ± 0.291 0.166 ± 0.284

W RL− 0.687 ± 0.341 0.819 ± 0.269 0.834 ± 0.277 0.680 ± 0.328 0.446 ± 0.326 0.693 ± 0.340

W RL+

W RL−

0.000 ± 0.000 0.044 ± 0.173 0.093 ± 0.238 0.267 ± 0.344 0.571 ± 0.314 0.195 ± 0.323

1.000 ± 0.000 0.956 ± 0.173 0.907 ± 0.238 0.733 ± 0.344 0.429 ± 0.314 0.805 ± 0.323

CW RAcc+ 1 0.239 ± 0.352 0.426 ± 0.457 0.502 ± 0.467 0.489 ± 0.473 0.535 ± 0.487 0.438 ± 0.462

CW RAcc− 1 0.359 ± 0.406 0.399 ± 0.426 0.419 ± 0.464 0.481 ± 0.476 0.469 ± 0.485 0.425 ± 0.455

CW RAcc+ 2 0.271 ± 0.341 0.452 ± 0.466 0.571 ± 0.475 0.402 ± 0.464 0.482 ± 0.459 0.436 ± 0.455

CW RAcc− 2 0.252 ± 0.351 0.412 ± 0.445 0.362 ± 0.448 0.556 ± 0.459 0.521 ± 0.457 0.420 ± 0.448

EMM-Corr 0.612 ± 0.417 0.595 ± 0.484 0.524 ± 0.497 0.586 ± 0.490 0.529 ± 0.499 0.569 ± 0.480

Random 0.104 ± 0.097 0.207 ± 0.162 0.325 ± 0.165 0.415 ± 0.200 0.510 ± 0.191 0.313 ± 0.221

CW RAcc− 2 0.462 ± 0.499 0.495 ± 0.488 0.404 ± 0.482 0.580 ± 0.472 0.515 ± 0.457 0.491 ± 0.483

EMM-Corr 0.750 ± 0.433 0.598 ± 0.485 0.513 ± 0.491 0.553 ± 0.477 0.463 ± 0.449 0.575 ± 0.478

Random 0.567 ± 0.495 0.505 ± 0.347 0.538 ± 0.286 0.515 ± 0.263 0.498 ± 0.220 0.525 ± 0.337

(a) Averaged precision (D1 ) C WD 1 0.913 ± 0.281 2 0.804 ± 0.391 3 0.712 ± 0.453 4 0.626 ± 0.481 5 0.587 ± 0.492 All 0.728 ± 0.443

CW RAcc+ 1 0.423 ± 0.494 0.490 ± 0.485 0.542 ± 0.485 0.505 ± 0.481 0.531 ± 0.484 0.498 ± 0.488

CW RAcc− 1 0.577 ± 0.494 0.510 ± 0.485 0.458 ± 0.485 0.495 ± 0.481 0.469 ± 0.484 0.502 ± 0.488

CW RAcc+ 2 0.538 ± 0.499 0.505 ± 0.488 0.596 ± 0.482 0.420 ± 0.472 0.485 ± 0.457 0.509 ± 0.483

(b) Averaged recall (D1 ) C 1 2 3 4 5 All

W D W RL+ W RL− CW RAcc+ CW RAcc− CW RAcc+ CW RAcc− 1 1 2 2 2.279 7.615 3.005 5.870 5.038 5.471 5.760 3.123 7.505 3.176 5.309 5.358 5.186 5.230 3.712 7.226 3.192 4.966 5.409 4.649 5.635 4.131 6.427 4.131 5.078 5.053 5.495 4.743 4.486 4.861 5.423 4.764 5.159 5.106 4.837 3.546 6.724 3.787 5.197 5.203 5.181 5.242

EMM-Corr 3.668 4.338 4.798 4.641 5.240 4.538

Random 6.293 5.775 5.413 5.301 5.125 5.581

(c) Averaged rank of F-score (D1 ) Table 4.2: Averaged precision, recall, and the rank of F-score of the 100 datasets for each C value, computed with the true description D1 , which corresponds to the minority Gaussian component in the synthetic dataset. The top value for each C is shown in bold. The Friedman test p-values are 1.53e-119, 1.32e-93, and 8.54e-118 respectively. form the final subgroup description: – Weighted divergence (W D) – Weighted relative likelihood (W RL+ , W RL− ) − – Continuous weighted accuracy on y(1) (CW RAcc+ 1 , CWARcc1 ) − – Continuous weighted accuracy on y(2) (CW RAcc+ 2 , CWARcc2 )

– Entropy weighted absolute correlation difference (EMM-Corr) – Random baseline For each measure, we select the top-quality subgroup to compare with the true descriptions D1 and D2 respectively, evaluated by precision, recall and F-score. Table 4.2 shows the averaged precision, recall and rank of F-score for each C value, computed with the true description D1 , which is corresponding to minority Gaussian component in the synthetic dataset. Table 4.3 shows the averaged precision and recall for each C value, computed with the true description D2 , which is corresponding to the majority Gaussian component in the synthetic dataset. 63

C 1 2 3 4 5 All

WD 0.087 ± 0.281 0.186 ± 0.389 0.288 ± 0.453 0.369 ± 0.483 0.413 ± 0.492 0.269 ± 0.443

W RL+ 1.000 ± 0.000 0.984 ± 0.068 0.941 ± 0.166 0.795 ± 0.281 0.454 ± 0.291 0.834 ± 0.284

W RL− 0.313 ± 0.341 0.181 ± 0.269 0.166 ± 0.277 0.320 ± 0.328 0.554 ± 0.326 0.307 ± 0.340

WD 0.087 ± 0.281 0.186 ± 0.389 0.288 ± 0.453 0.369 ± 0.483 0.413 ± 0.492 0.269 ± 0.443

W RL+

W RL−

0.875 ± 0.163 0.907 ± 0.155 0.883 ± 0.212 0.744 ± 0.279 0.475 ± 0.310 0.776 ± 0.283

0.125 ± 0.163 0.093 ± 0.155 0.117 ± 0.212 0.256 ± 0.279 0.525 ± 0.310 0.224 ± 0.283

CW RAcc+ 1 0.761 ± 0.352 0.574 ± 0.457 0.498 ± 0.467 0.511 ± 0.473 0.465 ± 0.487 0.562 ± 0.462

CW RAcc− 1 0.641 ± 0.406 0.601 ± 0.426 0.581 ± 0.464 0.519 ± 0.476 0.531 ± 0.485 0.575 ± 0.455

CW RAcc+ 2 0.729 ± 0.341 0.548 ± 0.466 0.429 ± 0.475 0.598 ± 0.464 0.518 ± 0.459 0.564 ± 0.455

CW RAcc− 2 0.748 ± 0.351 0.588 ± 0.445 0.638 ± 0.448 0.444 ± 0.459 0.479 ± 0.457 0.580 ± 0.448

EMM-Corr 0.388 ± 0.417 0.405 ± 0.484 0.476 ± 0.497 0.414 ± 0.490 0.471 ± 0.499 0.431 ± 0.480

Random 0.896 ± 0.097 0.793 ± 0.162 0.675 ± 0.165 0.585 ± 0.200 0.490 ± 0.191 0.687 ± 0.221

CW RAcc− 2 0.517 ± 0.350 0.523 ± 0.447 0.607 ± 0.451 0.424 ± 0.453 0.475 ± 0.460 0.509 ± 0.438

EMM-Corr 0.140 ± 0.180 0.212 ± 0.263 0.304 ± 0.326 0.301 ± 0.371 0.387 ± 0.421 0.269 ± 0.334

Random 0.527 ± 0.166 0.509 ± 0.182 0.490 ± 0.201 0.482 ± 0.208 0.508 ± 0.232 0.503 ± 0.200

(a) Averaged precision (D2 ) C 1 2 3 4 5 All

CW RAcc+ 1 0.562 ± 0.371 0.483 ± 0.431 0.456 ± 0.455 0.497 ± 0.470 0.465 ± 0.487 0.493 ± 0.446

CW RAcc− 1 0.438 ± 0.371 0.517 ± 0.431 0.544 ± 0.455 0.503 ± 0.470 0.535 ± 0.487 0.507 ± 0.446

CW RAcc+ 2 0.483 ± 0.350 0.477 ± 0.447 0.393 ± 0.451 0.576 ± 0.453 0.525 ± 0.460 0.491 ± 0.438

(b) Averaged recall (D2 ) C 1 2 3 4 5 All

W D W RL+ W RL− CW RAcc+ CW RAcc− CW RAcc+ CW RAcc− 1 1 2 2 7.678 1.981 6.909 3.899 4.889 4.519 4.192 6.750 2.250 6.833 4.627 4.436 4.598 4.534 6.101 2.558 6.755 4.909 4.404 5.260 4.111 5.646 3.578 5.893 4.704 4.777 4.350 5.092 5.264 5.082 4.740 5.010 4.668 4.702 5.014 6.287 3.092 6.224 4.630 4.635 4.687 4.588

EMM-Corr 6.851 6.490 6.139 6.199 5.567 6.249

Random 4.082 4.480 4.764 4.762 4.952 4.608

(c) Averaged rank of F-score (D2 ) Table 4.3: Averaged precision, recall, and the rank of F-score of the 100 datasets for each C value, computed with the true description D2 , which corresponds to the majority Gaussian component in the synthetic dataset. The top value for each C is shown in bold. The Friedman test p-values are 7.27e-120, 4.92e-148, and 5.45e-141 respectively. With respect to the F-score, the results are demonstrated via the critical difference diagrams [84], shown in Figure 4.2 and 4.3 for the true description D1 and D2 respectively. The diagram shows the averaged ranking of the selected measures on the F-score obtained with each synthetic dataset. The non-significant differences are covered with the bold horizontal line, where the significance value of 0.05 is applied under the Nemenyi statistics [84]. As shown by the results, W D and W RL− have a better performance on locating the D1, which is the relatively small Gaussian component. W RL+ , on the other hand, outperforms on finding the larger Gaussian component in the mixture distribution. As mentioned previously, while the actual interestingness is different, W D and W RL− have a similar performance in this experiment, which suggests they both lead to one of the two Gaussian distributions configured in the dataset. This result can be interpreted as follows. As the dataset is generated as a mixture of two Gaussian distributions, the estimated global Gaussian distribution is dominated by the Gaussian component with larger mixture weight. This dominance indicates the observations from the smaller components receive a lower likelihood from the global model, which also gives them a large gain in likelihood while switching to a local model. In this case, subgroups that contain these observations are interesting to both the W D and W RL− measure. This also explains why W RL+ have the 64

CD 1

2

3

4

5

6

7

8

9

WD WRLEMM-Corr CWRAcc+_2 CWRAcc+_1

WRL+ Random CWRAcc-_2 CWRAcc-_1

Figure 4.2: The critical difference diagram with the averaged ranking of the selected quality measures, on the F-score computed with the true description D1 , which corresponds to the minority Gaussian component in the synthetic dataset.

CD 1

2

3

4

5

WRL+ CWRAcc-_2 Random CWRAcc+_1 CWRAcc-_1

6

7

8

9 WD EMM-Corr WRLCWRAcc+_2

Figure 4.3: The critical difference diagram with the averaged ranking of the selected quality measures, on the F-score computed with the true description D2 , which corresponds to the majority Gaussian component in the synthetic dataset. top performance on finding the larger mixture components. In general, this experiment shows that the evaluation-driven measures perform better than parameter-driven measures on finding the subgroups with multiple parametrical interestingnesses, while the dataset is indeed generated from a mixture of two distributions. Besides, while in this case the data is two dimensional, W D and W RL can be easily applied to higher dimensional data, while EMM-Corr requires a selection of two dimensions and CWRAcc can only work on univariate target.

4.5

Experiment with the SPHERE Challenge Dataset

While in the previous experiment we used synthetic data to give a comparison between the model-based approach and existing measures, in this experiment we investigate the advantages of the model-based approach with a real life dataset, which contains the acceleration data collected from a wearable device in a smart home. Smart homes are receiving 65

rapidly growing interest in the area of e-Health and Internet of Things (IoT) [85, 20]. In this experiment we illustrate that, while subgroup discovery can be applied to discover the locations and activities of the users within their smart homes, the model-based approach has a better performance by providing a higher information gain with the obtained subgroups, while it also requires less prior knowledge on selecting the axes of the acceleration data. As mentioned above, the data considered by this experiment contains the acceleration data collected by a user wearing an on-body device and performing a set of activities within the smart house, together with the Received Signal Strength Indicator (RSSI) strength received by multiple wireless access points in the house. Our assumption is, the users perform different sets of activities at different locations within the house. For instance, while the users are in the living room, they might have a higher probability of sitting on the sofa, while the access point in the room also gets a higher RSSI from their wearable device. Therefore, we further assume that the on-body device receives both different acceleration and RSSI reading patterns at such locations [86], compared with the readings from the overall house. We then propose to apply subgroup discovery to describe these locations by using the RSSI readings as features and acceleration readings as targets. As the acceleration data contains the reading from three axes, the target variable is a three-dimensional vector. As there hasn’t been any measure in traditional subgroup discovery and exceptional model mining to simultaneously discover interestingness over all the three dimensions of the target, measures like CWRAcc and EMM-Corr have to be applied on one of the axes and a combination of two axes respectively. In contrast, with the model-based approach, we can adopt a 3D multivariate Gaussian distribution as the model class for such scenarios. In this case, the assumption is that the 3D Gaussian distribution can loosely represent the general activity pattern under each set of observations, in terms of both the estimated means and covariance matrices. After that, the subgroups discovered can then be seen as sharing a different activity pattern compared to the population, which then leads to different functional areas (e.g. locations for certain activities) in the house. The experiment is performed on the SPHERE Challenge Data set [87], as mentioned in Section 1.3. The data set is collected in a residential house equipped with passive environmental sensors such as proximity and temperature sensors, with RGB cameras, and users wearing on-body accelerometers. To obtain the data set, the user is asked to perform a set of scripted activities. The activities are annotated in a multi-label format from video recorded with a head-mounted camera. In total, a number of 10 scripted datasets are provided in the dataset, and we use these 10 sequences to perform the experiments. We select the RSSI data (all in decibels, hence a larger value indicates a stronger signal) from all the four access points (kitchen, lounge, upstairs, study) available to construct our subgroup language. However, in some of the scripted datasets, one of the four access points is turned off, which leaves the description space to be constructed on the RSSI data 66

of the remaining three access points. All RSSI values are discretised with eight equalsize edges. As in the previous experiment, we also apply the quality measure CWRAcc and EMM-Corr to provide a comparison to the proposed measure in model-based subgroup discovery. CWRAcc is performed with the reading from every axis individually, and EMM-Corr is performed with each combination of two axes. Measure WD WRL+ WRL− EMM-Corr1,2 EMM-Corr2,3 EMM-Corr1,3 CWRAcc+ 1 CWRAcc− 1 CWRAcc+ 2 CWRAcc− 2 + CWRAcc3 CWRAcc− 3

Description Relative size (−98.6 ≤ RSSIKitchen ≤ −66.4) ∧ (−105.0 ≤ RSSILounge ≤ −63.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −75.9) 0.26 (−98.6 ≤ RSSIKitchen ≤ −60.0) ∧ (−105.0 ≤ RSSILounge ≤ −63.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −54.0) 0.29 (RSSIKitchen = NA) ∧ (−105.0 ≤ RSSILounge ≤ −69.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −83.1) 0.49 (−98.6 ≤ RSSIKitchen ≤ −72.9) ∧ (−105.0 ≤ RSSILounge ≤ −63.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −75.9) 0.17 (RSSIKitchen ≤ −66.4) ∧ (RSSILounge = NA) ∧ (−105.0 ≤ RSSIU pstairs ≤ −75.9) 0.32 (RSSIKitchen ≤ −66.4) ∧ (−105.0 ≤ RSSILounge ≤ −63.0) ∧ (−97.7 ≤ RSSIU pstairs ≤ −75.9) 0.28 (RSSIKitchen ≤ −60.0) ∧ (−105.0 ≤ RSSILounge ≤ −87.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −75.9) 0.42 (RSSIKitchen ≤ −60.0) ∧ (−87.0 ≤ RSSILounge ≤ −69.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −54.0) 0.47 (RSSIKitchen ≤ −60.0) ∧ (−105.0 ≤ RSSILounge ≤ −99.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −61.3) 0.42 (RSSIKitchen ≤ −79.3) ∧ (−99.0 ≤ RSSILounge ≤ −69.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −75.9) 0.52 (RSSIKitchen ≤ −72.9) ∧ (−81.0 ≤ RSSILounge ≤ −69.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −75.9) 0.41 (RSSIKitchen ≤ −60.0) ∧ (−105.0 ≤ RSSILounge ≤ −81.0) ∧ (−105.0 ≤ RSSIU pstairs ≤ −61.3) 0.55

Table 4.4: The descriptions of the top-quality subgroups obtained with each measure, on the scripted dataset 01, with their relative sizes. NA indicates the wearable device is out of the range of the corresponding access point.

WD WRL+ WRLEMM-Corr-1,2 EMM-Corr-2,3 EMM-Corr-1,3 CWRAcc1+ CWRAcc1CWRAcc2+ CWRAcc2CWRAcc3+ CWRAcc3-

0.4 0.2 0.0 0.2

bath bed1 bed2 hall kitchen living stairs study toilet

0.4

Figure 4.4: The difference between the room distribution within the obtained subgroups and the one of the overall dataset. Each subgroup is the top-quality subgroup discovered with the corresponding measure.

67

0.20 0.15

WD WRL+ WRLEMM-Corr-1,2 EMM-Corr-2,3 EMM-Corr-1,3 CWRAcc1+ CWRAcc1CWRAcc2+ CWRAcc2CWRAcc3+ CWRAcc3-

0.10 0.05 0.00 0.05

a_ascend a_descend a_jump a_loadwalk a_walk p_bent p_kneel p_lie p_sit p_squat p_stand t_bend t_kneel_stand t_lie_sit t_sit_lie t_sit_stand t_stand_kneel t_stand_sit t_straighten t_turn

0.10 0.15 0.20

Figure 4.5: The difference between the activity distribution within the obtained subgroups and the one of the overall dataset. Each subgroup is the top-quality subgroup discovered with the corresponding measure.

The top-quality subgroup description of each measure with the scripted dataset 01 is given in Table 4.4. While the description itself only tells limited information about the obtained subgroup, it can be observed that these descriptions show a general pattern regarding the signal strength between the device and the access point, which is related to the distance between the user and the installing room of the access point. To further evaluate the obtained subgroups, we compute the class distribution of the room locations and performing activities of the user using the annotated ground truth provided in the dataset, both for the obtained subgroups and the overall dataset. Figure 4.4 and Figure 4.5 shows the difference between the local class distribution within the subgroups and the overall distribution of the dataset, on the annotated locations and activities respectively. Two clear patterns can be observed in the figures. Subgroups obtained by measures like − + WD, WRL+ , EMM-Corr1,2 , EMM-Corr2,3 , CWRAcc+ 1 , CWRAcc2 and CWRAcc3 show a higher probability for the user to be in the kitchen, as well as a higher probability for the user to be standing, compared to the overall dataset. The remaining subgroups show a higher probability on the multiple locations like the living room and bedroom 2 and activities like sit and lie. While these results have shown that subgroup discovery can indeed help find particular locations and activities of the user, we can tell that the model68

Script WD 01 10181.4 2.0 02 9721.5 3.0 03 8744.3 5.0 04 9521.8 3.0 05 9516.9 1.0 06 4878.5 5.0 07 8600.8 2.0 08 9661.6 5.0 09 7434.6 2.0 10 3141.0 11.0 Avg.Rank 3.9

WRL+ 11222.8 1.0 11078.9 1.0 9449.3 3.0 8237.3 6.0 6368.2 11.0 5002.5 2.5 7001.5 5.0 10862.3 2.0 5137.8 7.5 7251.4 4.0 4.3

WRL− 7117.4 9.0 5243.4 12.0 5751.8 11.0 5726.7 11.0 6493.8 10.0 2868.4 12.0 3066.3 9.0 5432.8 11.0 4426.0 10.0 4622.4 9.0 10.4

EMM-Corr-1,2 5766.5 12.0 7343.3 9.0 9373.3 4.0 3451.5 12.0 4781.0 12.0 4840.7 6.0 2320.7 11.0 9857.6 4.0 5137.8 7.5 3971.9 10.0 8.8

EMM-Corr-2,3 9446.5 4.0 7156.4 10.0 7032.8 10.0 10016.9 1.0 7400.5 8.0 4788.0 7.0 8822.0 1.0 9178.0 6.0 7828.2 1.0 9674.6 1.0 4.9

EMM-Corr-1,3 CWRAcc+ 1 7051.7 10.0 8084.6 7.0 8080.4 6.0 9512.6 4.0 9916.6 1.0 5194.3 12.0 6824.1 9.0 7052.9 8.0 8618.3 6.0 8359.6 7.0 4894.9 4.0 3720.6 9.0 1152.1 12.0 6687.8 6.0 6740.2 9.0 7627.5 7.0 7426.3 3.0 4385.7 11.0 9628.1 2.0 5167.2 7.0 6.2 7.8

CWRAcc− 1 8300.7 6.0 8077.5 7.0 9730.0 2.0 8284.3 5.0 7190.3 9.0 5373.2 1.0 2927.1 10.0 5266.0 12.0 5032.5 9.0 7211.8 5.0 6.6

CWRAcc+ 2 10150.8 3.0 9202.7 5.0 7949.6 7.0 9642.1 2.0 8685.7 3.5 3878.6 8.0 5733.3 8.0 6922.7 8.0 5437.8 5.0 4929.2 8.0 5.8

CWRAcc− 2 8749.3 5.0 9811.0 2.0 8164.2 6.0 7630.7 7.0 8704.6 2.0 5002.5 2.5 7970.9 3.0 10692.4 3.0 3187.4 12.0 8940.3 3.0 4.5

CWRAcc+ 3 7740.0 8.0 7782.8 8.0 7679.8 9.0 6771.5 10.0 8636.0 5.0 3334.2 11.0 7789.9 4.0 6701.6 10.0 7294.5 4.0 5690.7 6.0 7.5

CWRAcc− 3 6844.0 11.0 5561.0 11.0 7833.2 8.0 8291.5 4.0 8685.7 3.5 3630.3 10.0 6580.8 7.0 10905.5 1.0 5390.7 6.0 3118.1 12.0 7.3

Table 4.5: The generalised information gain (log-loss) on the annotated locations for the top-quality subgroup found with each measure on each scripted dataset. The subscript shows the rank of each measure on the corresponding dataset. The top value for each dataset is shown in bold. The Friedman test p-value is 6.15e-4. Script 01 02 03 04 05 06 07 08 09 10 Avg.Rank

WD 3927.3 3.0 3215.9 5.0 2666.0 8.0 3194.3 3.0 2813.9 4.0 934.1 3.0 3375.7 3.0 3081.5 5.0 2229.3 2.0 839.7 10.0 4.6

WRL+ 4439.5 1.0 3784.5 3.0 3244.2 2.0 1939.8 7.0 859.0 10.0 847.7 6.5 1336.7 9.0 3654.9 2.0 355.2 10.5 952.1 7.0 5.8

WRL− 2114.3 8.0 504.4 12.0 1408.5 11.0 1368.8 10.0 1846.0 8.0 1181.5 1.0 1764.7 4.0 601.0 12.0 825.5 7.0 1204.9 5.0 7.8

EMM-Corr-1,2 EMM-Corr-2,3 2610.1 7.0 3871.7 4.0 2562.3 6.0 1486.7 10.0 3031.2 4.0 2974.7 5.0 1142.8 11.0 3204.4 2.0 684.5 11.0 1830.5 9.0 777.8 8.0 773.2 9.0 1506.7 6.0 3940.7 1.0 3265.2 4.0 2995.7 6.0 355.2 10.5 1804.3 4.0 920.5 8.0 2663.0 3.0 7.5 5.3

EMM-Corr-1,3 1148.4 12.0 3254.4 4.0 3410.8 1.0 1102.3 12.0 3042.6 2.0 856.5 5.0 284.6 12.0 1901.8 7.0 2234.3 1.0 2663.5 2.0 5.8

CWRAcc+ 1 3463.4 5.0 3813.8 2.0 853.3 12.0 2200.7 6.0 596.5 12.0 392.4 12.0 1461.6 7.0 1773.8 8.0 601.8 8.0 719.0 12.0 8.4

CWRAcc− 1 1873.0 9.0 2306.9 8.0 3209.8 3.0 1911.3 8.0 2190.5 7.0 904.6 4.0 672.5 11.0 1331.6 10.0 350.6 12.0 805.4 11.0 8.3

CWRAcc+ 2 4147.3 2.0 2410.3 7.0 2917.9 6.0 3372.7 1.0 2365.2 5.5 1014.0 2.0 1624.4 5.0 1477.9 9.0 1788.4 5.0 1073.6 6.0 4.8

CWRAcc− 2 1827.7 10.0 4072.9 1.0 2280.9 9.0 2241.7 5.0 3007.0 3.0 847.7 6.5 3902.8 2.0 3580.3 3.0 516.5 9.0 2821.5 1.0 5.0

CWRAcc+ 3 1756.6 11.0 811.6 11.0 2673.0 7.0 1460.0 9.0 3346.5 1.0 757.2 10.0 1376.6 8.0 998.5 11.0 2196.8 3.0 1716.2 4.0 7.5

CWRAcc− 3 2827.7 6.0 1652.3 9.0 2040.8 10.0 2457.0 4.0 2365.2 5.5 524.5 11.0 769.1 10.0 3693.0 1.0 1058.4 6.0 918.9 9.0 7.2

Table 4.6: The generalised information gain (log-loss) on the annotated activities for the top-quality subgroup found with each measure on each scripted dataset. The subscript shows the rank of each measure on the corresponding dataset. The top value for each dataset is shown in bold. The Friedman test p-value is 1.05e-1. based measures (WD, WRL+ , WRL− ) are capable of discovering subgroups with similar interestingness to the ones of CWRAcc and EMM-Corr, without asking a selection on the dimensions of the target variables. To quantify the goodness of the discovered locations and activities, we propose to again calculate the generalised information gain of the obtained subgroups on the annotated locations and activities. The generalised information gain is a suitable metric because it rewards a different distribution over the locations and activities, as well as a large size of the obtained subgroups. That is, assuming two subgroups both have a higher probability of the location of the kitchen, the one with a larger size is better as it finds a longer time for the user being in the kitchen. We apply the log-loss version of the generalised information gain given it can be interpreted as the gain of Shannon entropy, and all the probability vectors are Laplace-corrected to avoid computing log-loss on a probability of 0. The results with all the ten scripted datasets are given in Table 4.5 and Table 4.6 for the annotated locations and activities respectively. While the results of the activities are seen as insignificant according to Friedman test, W D and W RL+ get the top two averaged performance on the generalised information gain of the annotated locations, and W D got the top performance on the generalised information gain on the annotated locations. While W RL− doesn’t have a leading performance here, as shown in the previous figures, the subgroup found by it has a higher probability on multiple locations and activities, which limits its performance on information gain here. 69

In general, while both the W RL measures can lead to better local distribution under some scenarios like in the previous synthetic datasets, its advantage is the higher computation efficiency by not training local models, which we will demonstrate in the next experiment.

4.6

Experiment with the Casas Dataset

One of the advantages of model-based subgroup discovery is its flexibility for knowledge discovery that is obtained by constructing different base models. In our third experiment we demonstrate that, in a similar smart house environment, the model-based approach can be applied to a more elaborate base model to discover further knowledge, which is not supported by existing measures in both traditional subgroup discovery and exceptional model mining. In contrast with the previous experiment, which was performed on scripted data collected over short periods, in this experiment we will look at long-term data collected by people living in the smart house. We propose to use a mixture model class to capture the spatio-temporal pattern within the house. As a result, subgroups detected under this model class represent unusual time and location information compared to the population, which can be further explained to the users for different purposes (e.g. health care, security).

4.6.1

Data Set and Experiment Settings

We select the twor2009 data set from the CASAS project [88]. It contains the data recorded from a smart house with two residents within the period from 01/Feb to 03/May of the year 2009. We focus on 51 motions sensors, the layout of which is provided in Figure 4.6. The target variable constitute a two-dimensional vector y = [y(1) , y(2) ]. Each instance represents a particular time point where one of the motion sensors is triggered. y(1) ∈ {0, ..., 23} represents the discrete hour of the day. y(2) ∈ {1, ..., 51} indicates which sensor is active at the corresponding time point. We specify the subgroup language as all the dates within the data set. The month is set as a nominal attribute and the day of the month is used as a numeric attribute to generate different intervals of days. Therefore, an interesting subgroup will be a period of dates that is unusual with respect to the sensor observations, in terms of both location and time of the day. The way of modelling this location and time information will be further specified by the following mixture model class. The structure of our mixture model class is as follows. The hidden binary variable z represents the two major unknown sets of Activities of Daily Life (ADLs) occurring within the house, each set of ADLs will generate a different set of sensor observations together with a range of timestamps. For instance, daytime activities will trigger the 70

Figure 4.6: The floor plan of the smart house in the twor2009 data set. Each small circle with the character M represents the location of 1 of the 51 PIR sensors

sensor around the living room, and activities such as sleep will be reflected by the sensors in the bedroom within the night hours. To further specify the details of the mixture model, we define the variables as: z ∼ Ber(w) (z)

y1 | z ∼ Cat(µtime ) (z)

y2 | z ∼ Cat(µlocation ) Here Ber and Cat are respectively the Bernoulli distribution and Categorical distribution with w, µ representing the parameters (here all parameters represent the mean of the distribution). That is, w is the mixture weight of the two values of z (e.g. the proportion of (z) each set of Activities of Daily Life). µtime models the probability of observing each hour (z) given a particular set of Activities of Daily Life, and µlocation stands for the probability that sensor j is triggered, also given a set of ADLs specified by z. The corresponding factor graph is provided in Figure 4.7. The nodes represent the variables, where a shaded node means the corresponding variable is observed, and each arrow implies a conditional dependency between two variables. 71

Ber z

Cat

Cat

y(2)

y(1) N

Figure 4.7: The factor graph of the base model. Both the sensor readings y(2) and time stamps y(1) are treated as observations given the hidden activities z. N is the total number of training observations.

4.6.2

Results

The model is trained via the EM algorithm by literately updating the parameters and inferring the distribution of the hidden variable z. As in general EM is a non-deterministic algorithm and is sensitive to initial values, in our WD implementation we provide the parameters of the global model as initial values to each local model. This procedure can be seen as a special case of online learning (e.g. first trained on the population, and then trained on the subgroup). The estimated parameters on the population are depicted in Figure 4.8 (a) and Fig(z) ure 4.9 respectively. Figure 4.8 (a) gives the estimated hour-of-day distribution µtime for (z) z = 0 (left) and z = 1 (right) respectively. Figure 4.9 (b) shows the estimated µlocation for the motion sensors via a visualisation on the floor plan with the sensor layout, again with z = 1 on the left. Our mixture model has done well on capturing the spatio-temporal relationship within the house. For z = 1 the time parameter has a peak around the night time, with a small peak around the early morning. As shown by the sensor layout in Figure 4.6, motion sensors 1 to 26 is on the ground floor (living room, kitchen), and sensors 27 to 51 on the second floor (bedrooms, shower and toilet). Hence the location parameter in Figure 4.9 also shows higher values around the second floor where the bedrooms are. On the other hand, z = 0 covers mainly the morning and afternoon period in the time parameter, with a higher probability of triggering sensors on the ground floor. Supplied with the global model shown above, the top subgroups discovered with each measure are shown in Table 4.7. While both WRL+ and WRL− find large subgroups, the subgroup detected by WD appears to have small coverage (three days in May). Besides, the subgroups found by WRL+ and WRL− each contain an interval of days disregarding the months. The parameters estimated in the subgroups are given in Figure 4.8 and Figure 4.9 for each measure respectively. While the relative size is small, the subgroup found with WD provides a quite different set of parameters compared to the population, eω also sug72

z=0

0.20

0.15 Mu_{time}

Mu_{time}

0.15 0.10 0.05 0.00

z=1

0.20

0

5

10

15

Hour of the day

0.10 0.05 0.00

20

0

5

10

15

20

15

20

Hour of the day

(a) Global model z=0

0.20

0.15 Mu_{time}

Mu_{time}

0.15 0.10 0.05 0.00

z=1

0.20

0

5

10

15

Hour of the day

0.10 0.05 0.00

20

0

5

10

Hour of the day

(b) Local model within the top quality subgroup found by WD z=0

0.20

0.15 Mu_{time}

Mu_{time}

0.15 0.10 0.05 0.00

z=1

0.20

0

5

10

15

Hour of the day

0.10 0.05 0.00

20

0

5

10

15

Hour of the day

20

(c) Local model within the top quality subgroup found by WRL+ z=0

0.20

0.15 Mu_{time}

Mu_{time}

0.15 0.10 0.05 0.00

z=1

0.20

0

5

10

15

Hour of the day

0.10 0.05 0.00

20

0

5

10

15

Hour of the day

20

(d) Local model within the top quality subgroup found by WRL−

Figure 4.8: The estimated parameter µtime with both z values, of the global model and local models within the top-quality subgroup found by each measure. Measure Subgroup Description Relative size Quality WD (Month = May) ∧ (01 ≤ Day ≤ 03) 0.030 2.28 · 104 + WRL 4 ≤ Day ≤ 19 0.480 3.83 · 104 WRL− 20 ≤ Day ≤ 29 0.348 2.56 · 104

eω 6.41 1.10 1.09

Table 4.7: top subgroups discovered with the quality measure WD, WRL+ and WRL− respectively

gests that this local model is on average about 6.41 times more likely than the global model. The z = 1 component tends to have a stronger mass distributed after midnight with more sensors activated around the ground floor and fewer sensors activated around the bedrooms. z = 0 also shows a higher distribution around noon time with the location parameter slightly changed from the population. Therefore, this subgroup reflects that there is an abnormal behaviour pattern within the house, namely more activities are 73

z=0

z=1

(a) Global model

z=0

z=1

(b) Local model within the top quality subgroup found by WD

z=0

z=1

(c) Local model within the top quality subgroup found by WRL+

z=0

z=1

(d) Local model within the top quality subgroup found by WRL−

Figure 4.9: The estimated parameter µlocation with both z values, of the global model and local models within the top-quality subgroup found by each measure.

performed around midnight. With respect to the subgroups found by WRL+ and WRL− , while in general the parameters are similar to the population, it can be seen that there is a clear difference with 74

the motion sensor 31 (in bedroom 1). The one with WRL− has a lower probability of triggering this sensor and WRL+ has a higher probability of triggering it, given z = 0. This result can be explained by the mechanism of the quality measures WRL− and WRL+ , where the resulting subgroups respectively contain a lower and higher proportion of the high-likelihood observations. Here one of these observations turns out to be the trigger event of sensor31. As it is in the bedroom, it is one of the most commonly triggered sensors around evening and night. As a result, both subgroups indicate that there is potential sleep deviation from the population within the covered periods. While in the previous experiments the model class is set as the Gaussian distribution which requires little time to estimate the model parameters, in this experiment we propose a mixture model class that requires the EM algorithm for parameter estimation. This results in a much longer running time for the subgroup discovery task with the WD measure. Figure 4.10 shows the per-subgroup computation time against the relative size of the corresponding subgroup. The hardware environment of the experiment was a mobile workstation with common specifications as in the year of 2017 (e.g. the processor is quadcores i7-7700HQ from Intel), and the software environment was Python 3.5 with latest Numpy and Scipy installed as the computational backend. As the figure suggests, with our mixture model class, W RL consumes about one or two seconds to evaluate each subgroup disregarding the size, and the computation time of W D increases approximately linearly with the size of the subgroup and can get above 20 seconds for large subgroups. While the computational time can further vary for the W D measures with different stop criterion of the EM algorithm, here WRL shows significant advantages in terms of computational efficiency. To conclude, in this experiment we illustrate that the model-based approach can be adopted together with a mixture model to detect the abnormal event or life pattern within a smart house. Since we don’t have access to the ground truth of the activities and events within the house, it is not possible to further justify the obtained subgroups. However, this is precisely the aim of subgroup discovery: find subgroups that can be of interest to the domain experts (e.g. here it would be the households). The difference between the computational time of the WD and WRL measures are also experimentally illustrated. As in this experiment there is only the long-term data from one house, the base model is only applied to find periods as subgroups. One potential future research direction is to apply a hierarchical model (e.g. with Bayesian hyper-priors as in [89, 90]) to a set of smart houses, which enables to find subgroups on a larger scale.

4.7

Chapter Summary

In this chapter we generalised the measures with proper scoring rules to work with any probabilistic model class. The main motivation is to provide a generic, evaluation-driven 75

Computation time (seconds)

WD WRL

20 15 10 5 0

0.0

0.2

0.4

0.6

Subgroup size

0.8

1.0

Figure 4.10: The computation time for evaluating a single subgroup with the measure WD and WRL.

approach to find subgroups based on a variation of model classes. The adoption of probabilistic models also increases the capability of modelling the uncertainty within the provided data [91, 92]. As a result, the major difference between model-based subgroup discovery and exceptional model mining is the way statistical unusualness is defined. Exceptional model mining defines it explicitly according to a certain parameter of the model class. Therefore it requires a certain amount of prior knowledge on the data set and the model class. It also takes the effort to design the quality measure and investigate the interpretations. On the other hand, model-based subgroup discovery defines the statistical unusualness through the evaluation of the base model. With the measures derived from proper scoring rules, the model-based approach only requires a suitable probabilistic model for the target attributes, which can be easier to specify for machine learning and data mining researchers. The quality of subgroups is also evaluated according to the overall model instead of a certain parameter, which can cover a larger range of statistical unusualness in some model classes. Additionally, the idea semi-evaluation-driven measures provide an option not to train the local models, this makes it more time-efficient with more elaborate model classes (e.g. mixture models as in our third experiment). To conclude, when the user has a particular interest on a certain statistical parameter and knows how to quantify the unusualness, parameter-driven measures like the ones in exceptional model mining can be directly used. For the case where the user is interested in a general unusualness of some model class, or the model takes a long time to train, the model-based approach can be a better choice. 76

Another discussion is regarding which of the three measures (WD, WRL+ and WRL− ) to use under the context of Model-Based Subgroup Discovery (MBSD). This can be summarised as follows. If the task of interest is related to obtaining a better local statistical model within the subgroup, WD should be used. If the local model is not the primary interest and the training of base model class is expensive, WRL+ and WRL− can be applied to find subgroups more efficiently.

77

78

Chapter 5 HMM-Based Subgroup Discovery for Sequential Datasets In the previous chapter, we introduce the model-based approach, which is designed as a generic method to find subgroups with any probabilistic class. In the experiment on the SPHERE challenge data, we adopted a multivariate Gaussian distribution as the model class to capture the statistical information from some accelerometer data. This choice indicates that we assume that each reading yi from the 3-axis accelerometer is i.i.d. according to the Multivariate Gaussian distribution: yi ∼ fθMV G

(5.1)

Now further assume that we have a set of T consecutive readings from an accelerometer, denoting as {y1 , ..., yT }. T he i.i.d. assumption above gives: T

P(y1 , ..., yT | fθMV G ) = ∏ fθMV G (yi )

(5.2)

i=1

While the i.i.d. assumption is computationally efficient, it also brings a major disadvantage for modelling sequential data. For instance, as in the accelerometer case, while the person wearing the device is sitting still on a sofa, one could expect that the accelerometer will keep sending the same readings. When the person is walking, the accelerometer will instead produce readings that are fluctuating periodically. It is more reasonable to assume these readings are correlated. For the case of sitting, the probability of a tiny acceleration should be much higher given neighbouring readings are also small in value. Regarding the latter case of walking, given the neighbouring readings, the probability density should be concentrated around the same value as in the previous periods. It is, therefore, more reasonable to make our base model capable of capturing sequential dependencies among datasets such as the accelerometer case. Although at first glance this task can be achieved simply by adopting a sequential base model, there are still a few 79

Date Jan 01 Jan 02 Food calories high middle Hours of sleep 3 6

Jan 03 middle 8

Jan 04 low 7

Jan 05 ... Dec 31 high ... low 4 ... 5

Table 5.1: A toy data set that contains a person’s food / sleep record for the duration of one year. questions to be answered. As modelling sequential data generally requires more parameters to be estimated, the fundamental question is hence whether we need the sequential models in the first place, and to what extent we can obtain extra knowledge compared to existing approaches. Also, most sequential datasets need to be pre-processed to be amenable to modelling. In this thesis, we assume all the sequences can be expressed as a feature (target) vector obtained by concatenating T feature (target) vectors, which can be satisfied for most datasets by applying methods such sliding window. In this chapter, we will a adopt the HMM as a base model to capture the sequential statistical information and lead to interesting subgroups on such information. After introducing the concept of HMM-based subgroup discovery, we perform two sets of experiments. The first set of experiments is performed on synthetic datasets, where we compare the HMM-based measures to existing measures, in terms of their ability to locate the true subgroups. The second set of experiments focuses on the application side, with the SPHERE challenge dataset, we further demonstrate the applied aspects of the HMM-based approach and the corresponding interestingness.

5.1

Model-Based Subgroup Discovery with Sequential Data

In this section, we will first define the task of discovering subgroups with sequential data, which covers how we describe the sequential subgroups, and how we model the sequential information within the data. Given a sequential dataset {Xi ,Yi }N i=1 , we start with the sequential features. As each [i] [i] [i] xt is an M-dimensional vector, Xi = (x1 , ..., xT ) can be formalised as a vector with MT dimensions, resulting in a feature space XT . Hence, assuming the same description language is used, the obtained description space will be DT instead of the original D. Therefore, the number of candidate subgroups increases exponentially with the window length T . However, under certain datasets this DT can be redundant. Now we use two examples to demonstrate scenarios where this increased candidate space can or cannot be reduced. For the first scenario where the sequential description can be simplified, consider a toy dataset as shown in Table 5.1. Assume for now we only use the date as feature with a sliding window of 3 days and step size of 1 day (e.g. the next window moves 1 day further from the current window), the feature vector of the first sequence is X1 = [Jan 01, Jan 02, Jan 03], and the consecutive sequences have feature values like 80

X2 = [Jan 02, Jan 03, Jan 04], X3 = [Jan 03, Jan 04, Jan 05] etc. While the feature vector of each sequence has a length of 3, one can observe that they are all three consecutive dates. Therefore, the whole sequence can be described via only one of the three features in the vector. For instance, denote the three features in the vector as date 1, date 2 and date 3 respectively, and assume the operations ≤ and ≥ represents before or after a date. The logical expression date 1 ≤ Feb 5 explicitly implies date 2 ≤ Feb 6 and date 3 ≤ Feb 7. As shown by this example, if the features in the sequential data are tied to some known deterministic function g, that is: [i] [i] (5.3) xt 0 = g(xt ) the description on the resulted sequence can be simply built on the original feature space X instead of XT , without increasing the number of candidate subgroups to be searched and evaluated. Now we look at the other scenario where the sequential feature vector is not redundant. Using same toy data as given in Table 5.1 with food calories as the original feature. The window length and step size are as previously. This gives the first three sequential feature vectors as X1 = [high, middle, middle], X2 = [middle, middle, low], X3 = [middle, low, high]. One can observe that, comparing the previous case, the whole sequential vector can no longer be specified via describing one of the features in the vector. Both the first and the second feature vector get the value middle at their second position. If we assume these sequential vectors can achieve all of the possible states, nine combinations of the three possible values will appear in the sequential vectors. This means the feature value at a given dimension cannot be inferred from its neighbouring dimensions. It is then necessary to keep all the dimensions within the description, as it might contain further information for the obtained subgroups. For instance, the description [high, middle, low] will cover all the sequential instances where the person kept reducing his / her amount of food taken. This might give interesting pattern on other attributes like the hours of sleep in the table. However, in practice it is common that the feature vectors lies between the two scenarios above. That is, the values in the same feature vector are close or related to each other, but cannot be inferred from each other deterministically. For such a situation, to maintain a reasonable computational time, additional pre-processing approaches can be applied, such as using the averaged value or median value of the feature vector to construct the description language. For the target variable, we also have a set of sequences of vectors, with each sequence [i] [i] denoted as Yi = [y1 , ..., yT ]. As illustrated at the beginning of this chapter, the major motivation of sequential subgroup discovery is to capture the sequential information among different instances, which cannot be modelled with previous approaches under the i.i.d. setting. As a result, we assume that all the T instances are jointly drawn from a sequential 81

model class:

[i]

Yi

∼

f (Yi ; θ )

[i]

∼

f (y1 , ..., yT ; θ )

[y1 , ..., yT ]

[i]

[i]

(5.4)

In the field of Random Process, this assumption is equivalent as assuming a random process is stationary, which states that the joint distribution of consecutive random variables doesn’t change while shifted in the sequence.

5.2

HMM-Based Subgroup Discovery

One of the major reasons that HMM has been so widely adopted in the modelling of sequences is the interpretability of its parameters. The transition probabilities directly show whether a state tends to stay the same or change into another state. The emission probabilities can also be easily interpreted under certain model classes. For instance, a discrete emission model can be understood in the same way as the transition probabilities, where the likely and unlikely observations are specified by the corresponding parameters. A Gaussian emission model can be simply visualised via mean and standard deviation in the 1-D case. This makes HMM suitable as a base model within the context of Subgroup Discovery. While the model-based approach is designed to be a generic approach that can find subgroups according to a set of different parameters, the interpretability of parameters is still important, as eventually they can help us understand why a subgroup is interesting. Besides the interpretability of parameters, there are also a few other advantages for find subgroups with HMM. As illustrated in Chapter 2, with HMM the model-based likelihood of an entire sequence can be effectively calculated via the forward algorithm. Since the hidden states are summed over to give the marginal likelihood at each time point within the forward algorithm, this means that the model-based likelihood of HMM can also be obtained on the data points where there is no ground truth for the hidden states, one the parameters have been trained. For similar reasons, the generative nature of HMM enables the possibility of unsupervised learning, which is non-trivial for datasets that don’t come with well-defined labels. Furthermore, as HMM has been well studied under different modelling contexts, there have been many extended HMMs. This ensures our approach is open to improved modelling towards finding subgroups under datasets and contexts. As in the general case of the model-based approach described in the previous chapter, there are also two settings for HMM-based subgroup discovery: discovery with locallytrained models and discovery without locally-trained models. For the former, we denote the local model and the global model as fρHMM and fπHMM respectively. If we want to find subgroups where the local HMM summarises the data better than the global one, the 82

corresponding quality can then be calculated as: ΦW D (d) =

∑

log fρHMM (Yi ) − log fπHMM (Yi )

(5.5)

i:Id (Xi )=1 [i]

Notice that, instead of calculating the model-based likelihood on yt at instance level, here the measure WD is calculated at sequence level with Yi . This is reasonable as we are now modelling the joint likelihood of a sequence of instances. The equation also implies that the size of the subgroups is calculated as the number of covered sequences. As a result, to obtain the statistical unusualness ΩKL from ΦW D , it should also be done by dividing ΦW D with the total number of sequences: ΩKL (d) =

ΦW D (d) ∑N i=1 Id (Xi )

(5.6)

To find subgroups without training local models, the corresponding WRL measures are calculated as: + ΦW RL (d) =

∑

log fπHMM (Yi ) −

i:Id (Xi )=1

N ∑N i=1 Id (Xi ) · ∑ log fπ (Yi ) N i=1

(5.7)

+ As in the previous case, the ΦW RL measures will lead to subgroups that contain a higher proportion of observations with high likelihood values according to the global HMM. We − + can again use ΦW RL = −ΦW RL (d) to locate subgroups that contain a higher proportion of low likelihood observations.

Quality measures for HMM-based subgroup discovery are similar to those for the nonsequential case. This is expected as model-based subgroup discovery has been designed as a generic approach at the first place, which is mainly achieved by the probabilistic model assumption and the likelihood-driven quality measures. For the case of a HMM, the statistical information captured by it is represented through the three sets of parameters: the initial probability, the transition probabilities, and the emission probabilities. If the aim is to discover subgroups that are only interesting according to a single set of parameters, existing approaches can be directly adopted. For instance, to find subgroups with a different initial probability, one can simply treat it as a special form of class distribution and use measures in traditional subgroup discovery to evaluate the subgroups. Exceptional model mining can also be applied to find subgroups that are interesting on either the transition probabilities or the emission probabilities. The transition-driven quality measure proposed in [17] is one example. The advantage of HMM-based subgroup discovery is then its ability to find subgroups that are potentially interesting according to the differences on any set or combinations of the parameters. We will further demonstrate this point through our synthetic experiment described in the next section. 83

5.3

Experiment on Synthetic Datasets

The primary objective of this experiment is to show the advantages of HMM-based subgroup discovery, over existing approaches and other non-sequential variations of modelbased subgroup discovery. We again adopt synthetic datasets for this purpose, which is similar to our previous setup in Chapter 4. Each dataset is generated from a mixture of two HMMs. Given a feature space X = {1, 2, ..., 10} and an integer C ∈ {1, 2, 3, 4, 5}, we uniformly choose C values from X to form the true description D1 , and use the rest values in X to generate another true description D2 . A sequential instance (Xi = [xi ], ,Yi = [i] [i] [i] [i] [y1 , ...yT ], Zi = [z1 , ...zT ]) in the dataset is then generated as: xi ∼ UnifX [i]

[i]

[i]

[i]

[i]

[i]

[i]

[i]

([y1 , ...yT ], [z1 , ...zT ]) | xi ∈ D1 ∼ HMM(µinitial-1 , µtransition-1 , θemission-1 ) ([y1 , ...yT ], [z1 , ...zT ]) | xi ∈ D2 ∼ HMM(µinitial-2 , µtransition-2 , θemission-2 )

Here the Unif denotes the uniform distribution, Xi , Yi , Zi denote the feature, observations, and hidden states of the sequence respectively. We use a non-sequential feature here (e.g. Xi contains a single value of xi ) to simplify the description space. We also include the hidden state Z in the dataset as later we will also compare measure that looks at the transitional behaviour of the sequences. To provide a fair comparison, the HMMs in this experiment are also trained with Z (e.g. supervised learning). Again, the parameter sets µinitial , µtransition , θemission are fixed for a single dataset, and randomly generated across [i] different datasets. We set the HMMs to have two states (zt ∈ 0, 1) and the emission model class to be a 1-D Gaussian distribution. The parameter sets can then be generated as follows. For the first value in the initial probability and the first value in the transition probability of state j ∈ {1, 2}, we drawn them from a uniformly in the open interval (0, 1), that is: (1)

µinitial ∼ Unif(0,1) [ j](1)

µtransition ∼ Unif(0,1) The second value of each parameter can then be set as the difference between one and the first value (e.g. µ (2) = 1 − µ (1) ). For the Gaussian emission models, the mean and variance for the two hidden states are set as: [1]

θµ ∼ Norm(µ = 0, σ = 1) [2]

θµ ∼ Norm(µ = 0, σ = 1) [1]

θσ 2 = 1 84

[2]

θσ 2 = 1

Here Norm denotes the Gaussian distribution, which is used to randomly draw the means for the Gaussian emission models. Given that a HMM has three sets of parameters, the HMMs with D1 and D2 can be different on some subsets of the parameters, as well as all sets parameters. Therefore, in our experiment we also consider four different settings where the two models are the same some of the parameters, to check if a measure is capable of locating different kinds of interestingness. For simplicity, we refer these four settings as: I, T, E and ITE. The first threes settings indicate the two HMMs in the dataset are only different on their initial parameter, transition parameter, and emission parameters respectively, and the last setting ITE means all the three sets of parameters are different for the two models. Figure 5.1 provides some random sequences generated under different settings. In our experiment, these four settings are selected uniformly for each synthetic dataset. Finally, for each synthetic dataset we generate 100 sequences, with each sequence has a length of 10. For each C value, a number of 1000 datasets are generated. As in the previous experiment, we select the top-quality subgroup to compare with the true descriptions D1 and D2 respectively, evaluated by precision, recall and F-scores. The measures we compared against our HMM-based approach includes measures from both traditional subgroup discovery, exceptional model mining, as well as W D and W RL with a Gaussian model class. Namely, the nine measures used in this experiment are: – Continuous weighted accuracy (CW RAcc+ , CWARcc− ) – Weighted total variation distance (EMM-TV ) – Gaussian-based weighted divergence (Norm-W D) – Gaussian-based relative likelihood (Norm-W RL+ , Norm-W RL− ) – HMM-based weighted divergence (HMM-W D) – HMM-based relative likelihood (HMM-W RL+ , HMM-W RL− ) Table 5.2 shows the averaged precision , recall and rank of F-scores for each C value, computed with the true description D1 , which is corresponding to the HMM with relatively smaller mixture weight in the synthetic dataset. Table 5.3 shows the same quantities for the true description D2 , which is corresponding to the HMM with relatively larger mixture weight in the synthetic datasets. The critical difference diagram for the F-scores on both D1 and D2 are given in Figure 5.2 and Figure 5.3. 85

2

2.5 0

20

40

60

80

100

0

20

40

60

80

2 4

100

0

20

40

60

80

100

0

20

40

60

80

100

20

40

60

80

100

0

20

40

60

80

0

100

60

80

100

0

20

40

60

80

100

(b) Setting T

0

20

40

60

80

100

0

20

40

60

80

D2

D1

D2

0

100

0

20

40

60

80

100

0

20

40

60

80

100

0

20

40

60

80

100

0

20

40

60

80

100

0

20

40

60

80

100

D1

D2

D1

1

0

20

40

60

80

D2

D1

40

5

1

0

20

D2

D1

D2

0

(a) Setting I 5.0 2.5 0.0 2.5

0

1

D1

1

0

0

D2

D1

0.0

D2

D1

2.5

0

100

(c) Setting E

(d) Setting ITE

Figure 5.1: Some examples of random sequences generated from two HMMs. Each pair of figures on the left show the observations and hidden states drawn from one HMM, and the pair of figures on the right shows sequences drawn from another HMM. The settings are given in the caption of each group of figures. The character I indicates the two HMM have different initial probabilities, T for different transition probabilities, E for different emission probabilities, and ITE means all the three sets of parameters are different between the two models. As shown by the results, HMM-WD gives the top performance on finding the true description D1 . While performing closely to Norm-WRL+ , HMM-WRL+ gives the best results on finding the true description D2 . This agrees with our previous conclusion that, given the overall dataset is a mixture of two models, WD and WRL− tends to find the distribution with relatively smaller mixture weight and WRL+ tends to find the one with relatively larger mixture weight. Although all the improvements of the HMM-based measures over the Gaussian-based measures are seen as non-significant by the critical difference diagram, HMM-WD still gives a larger gap to Norm-WD compared to the WRL cases. One potential reason for the close performance between the HMM-based measures and Gaussian-based measures is, both deviating initial probabilities and transition probabilities can affect the marginal distribution of the observations. In this case, such differences can also be partly captured by the Gaussian-based measures, although it doesn’t directly model the sequential dependency among the observations. To summarise, in this experiment we have demonstrated that, by incorporating the sequential information with the hidden Markov models, the model-based measures are able to show improvements over the Gaussian-based versions, as well as outperform parameter-driven measures in both traditional subgroup discovery and exceptional model mining. However, the performances of the WRL measures based on HMMs are close to the ones based on Gaussian distributions, which indicates that the Gaussian versions can be used for simplicity when there isn’t a strong belief on the sequential information within the dataset. The HMM-WD is still a preferable choice for sequential datasets, considering 86

C 1 2 3 4 5 All

HMM-W D 0.620 ± 0.448 0.673 ± 0.444 0.756 ± 0.382 0.663 ± 0.427 0.552 ± 0.470 0.653 ± 0.440

HMM-W RL+ HMM-W RL− Norm-W D Norm-W RL+ 0.060 ± 0.094 0.249 ± 0.276 0.561 ± 0.465 0.070 ± 0.101 0.113 ± 0.144 0.372 ± 0.263 0.633 ± 0.434 0.138 ± 0.163 0.227 ± 0.197 0.435 ± 0.297 0.738 ± 0.378 0.235 ± 0.208 0.368 ± 0.263 0.468 ± 0.297 0.628 ± 0.422 0.369 ± 0.284 0.511 ± 0.266 0.494 ± 0.278 0.541 ± 0.450 0.503 ± 0.262 0.256 ± 0.263 0.404 ± 0.296 0.620 ± 0.436 0.263 ± 0.265

Norm-W RL− 0.235 ± 0.271 0.336 ± 0.276 0.443 ± 0.303 0.467 ± 0.310 0.504 ± 0.280 0.397 ± 0.305

CW RAcc+ 0.170 ± 0.215 0.307 ± 0.343 0.368 ± 0.368 0.458 ± 0.388 0.505 ± 0.389 0.362 ± 0.366

CW RAcc− 0.175 ± 0.269 0.302 ± 0.334 0.389 ± 0.369 0.424 ± 0.381 0.498 ± 0.382 0.357 ± 0.367

EMM-TV 0.127 ± 0.216 0.288 ± 0.333 0.392 ± 0.366 0.420 ± 0.355 0.500 ± 0.381 0.345 ± 0.359

Random 0.099 ± 0.116 0.174 ± 0.139 0.307 ± 0.192 0.413 ± 0.192 0.498 ± 0.198 0.298 ± 0.225

CW RAcc− 0.470 ± 0.499 0.500 ± 0.447 0.520 ± 0.425 0.480 ± 0.410 0.504 ± 0.386 0.495 ± 0.436

EMM-TV 0.410 ± 0.492 0.480 ± 0.441 0.517 ± 0.407 0.477 ± 0.368 0.502 ± 0.385 0.477 ± 0.422

Random 0.490 ± 0.500 0.465 ± 0.362 0.503 ± 0.289 0.505 ± 0.218 0.510 ± 0.205 0.495 ± 0.333

(a) Averaged precision (D1 ) C 1 2 3 4 5 All

HMM-W D 0.710 ± 0.454 0.655 ± 0.434 0.697 ± 0.389 0.578 ± 0.419 0.394 ± 0.394 0.607 ± 0.435

HMM-W RL+ HMM-W RL− Norm-W D Norm-W RL+ 0.300 ± 0.458 0.700 ± 0.458 0.640 ± 0.480 0.340 ± 0.474 0.300 ± 0.367 0.700 ± 0.367 0.645 ± 0.432 0.360 ± 0.407 0.383 ± 0.335 0.617 ± 0.335 0.660 ± 0.386 0.387 ± 0.332 0.465 ± 0.316 0.535 ± 0.316 0.505 ± 0.394 0.463 ± 0.338 0.520 ± 0.271 0.480 ± 0.271 0.352 ± 0.369 0.520 ± 0.271 0.394 ± 0.366 0.606 ± 0.366 0.560 ± 0.431 0.414 ± 0.377

Norm-W RL− 0.660 ± 0.474 0.640 ± 0.407 0.613 ± 0.332 0.537 ± 0.338 0.480 ± 0.271 0.586 ± 0.377

CW RAcc+ 0.530 ± 0.499 0.500 ± 0.447 0.480 ± 0.425 0.520 ± 0.410 0.496 ± 0.386 0.505 ± 0.436

(b) Averaged recall (D1 ) C HMM-W D HMM-W RL+ 1 3.365 6.825 2 3.785 7.175 3 3.650 6.550 4 4.735 5.965 5 5.840 5.170 All 4.275 6.337

HMM-W RL− 4.980 4.780 5.505 5.420 5.415 5.220

Norm-W D 3.870 3.800 3.855 5.145 6.390 4.612

Norm-W RL+ 6.640 6.845 6.495 5.910 5.340 6.246

Norm-W RL− CW RAcc+ CW RAcc− 5.215 5.545 5.850 5.185 5.695 5.535 5.500 5.800 5.660 5.515 5.380 5.680 5.345 5.285 5.290 5.352 5.541 5.603

EMM-TV 6.380 5.705 5.765 5.490 5.370 5.742

Random 6.330 6.495 6.220 5.760 5.555 6.072

(c) Averaged rank of F-score (D1 ) Table 5.2: Averaged precision, recall, and the rank of F-score of the 100 datasets for each C value, computed with the true description D1 , which is corresponding to the HMM with relatively smaller mixture weight in the synthetic dataset. The top value for each C is shown in bold. The Friedman test p-values are 2.97e-81, 3.92e-13, and 1.22e-46 respectively.

both its performance and the knowledge obtained by training multiple local parameters.

5.4

Experiments on the SPHERE Challenge Dataset

During the previous experiment, we demonstrate the difference between HMM-based subgroup discovery and existing approaches with respect to their ability to locate the true subgroup among synthetic datasets. In this experiment, we will instead look at the application side. The motivation of this experiment is to illustrate, while we apply HMMbased subgroup discovery to a set of sequences from real life, what kinds of subgroups can be obtained, and how to interpret them. This experiment is again performed on the SPHERE challenge dataset, which was introduced in the previous chapter. The experiment is organised into two settings, where we will look at different tasks of discovering subgroups among sequences. We name these two settings as intra-sequence discovery and inter-sequence discovery. Although from the modelling perspective these two settings are basically the same, the difference comes with the interpretation of the obtained subgroups. We will illustrate these points through the following experiments and discussions. 87

C HMM-W D 1 0.380 ± 0.448 2 0.327 ± 0.444 3 0.244 ± 0.382 4 0.337 ± 0.427 5 0.448 ± 0.470 All 0.347 ± 0.440

HMM-W RL+ 0.940 ± 0.094 0.887 ± 0.144 0.773 ± 0.197 0.632 ± 0.263 0.489 ± 0.266 0.744 ± 0.263

HMM-W RL− 0.751 ± 0.276 0.628 ± 0.263 0.565 ± 0.297 0.532 ± 0.297 0.506 ± 0.278 0.596 ± 0.296

Norm-W D 0.439 ± 0.465 0.367 ± 0.434 0.262 ± 0.378 0.372 ± 0.422 0.459 ± 0.450 0.380 ± 0.436

Norm-W RL+ 0.930 ± 0.101 0.862 ± 0.163 0.765 ± 0.208 0.631 ± 0.284 0.497 ± 0.262 0.737 ± 0.265

Norm-W RL− CW RAcc+ CW RAcc− EMM-TV Random 0.765 ± 0.271 0.830 ± 0.215 0.825 ± 0.269 0.873 ± 0.216 0.901 ± 0.116 0.664 ± 0.276 0.693 ± 0.343 0.698 ± 0.334 0.712 ± 0.333 0.826 ± 0.139 0.557 ± 0.303 0.632 ± 0.368 0.611 ± 0.369 0.608 ± 0.366 0.693 ± 0.192 0.533 ± 0.310 0.542 ± 0.388 0.576 ± 0.381 0.580 ± 0.355 0.587 ± 0.192 0.496 ± 0.280 0.495 ± 0.389 0.502 ± 0.382 0.500 ± 0.381 0.502 ± 0.198 0.603 ± 0.305 0.638 ± 0.366 0.643 ± 0.367 0.655 ± 0.359 0.702 ± 0.225

(a) Averaged precision (D2 ) C HMM-W D 1 0.113 ± 0.171 2 0.135 ± 0.230 3 0.120 ± 0.232 4 0.172 ± 0.266 5 0.328 ± 0.403 All 0.174 ± 0.283

HMM-W RL+ 0.613 ± 0.190 0.639 ± 0.180 0.607 ± 0.227 0.567 ± 0.264 0.500 ± 0.281 0.585 ± 0.237

HMM-W RL− 0.387 ± 0.190 0.361 ± 0.180 0.393 ± 0.227 0.433 ± 0.264 0.500 ± 0.281 0.415 ± 0.237

Norm-W D 0.113 ± 0.141 0.104 ± 0.136 0.100 ± 0.167 0.198 ± 0.275 0.306 ± 0.369 0.164 ± 0.249

Norm-W RL+ 0.602 ± 0.198 0.610 ± 0.189 0.604 ± 0.239 0.562 ± 0.277 0.518 ± 0.283 0.579 ± 0.243

Norm-W RL− 0.398 ± 0.198 0.390 ± 0.189 0.396 ± 0.239 0.438 ± 0.277 0.482 ± 0.283 0.421 ± 0.243

CW RAcc+ 0.483 ± 0.256 0.501 ± 0.319 0.501 ± 0.344 0.485 ± 0.367 0.488 ± 0.390 0.492 ± 0.339

CW RAcc− 0.517 ± 0.256 0.499 ± 0.319 0.499 ± 0.344 0.515 ± 0.367 0.512 ± 0.390 0.508 ± 0.339

EMM-TV 0.549 ± 0.230 0.505 ± 0.316 0.491 ± 0.343 0.527 ± 0.349 0.498 ± 0.382 0.514 ± 0.329

Random 0.513 ± 0.166 0.510 ± 0.179 0.507 ± 0.216 0.483 ± 0.209 0.534 ± 0.250 0.510 ± 0.207

(b) Averaged recall (D2 ) C HMM-W D 1 8.965 2 8.280 3 8.225 4 7.605 5 6.580 All 7.931

HMM-W RL+ 3.890 3.460 4.020 4.405 5.280 4.211

HMM-W RL− 5.600 6.035 5.445 5.335 5.135 5.510

Norm-W D 8.990 8.835 8.575 7.645 6.760 8.161

Norm-W RL+ 4.050 3.760 4.030 4.525 4.980 4.269

Norm-W RL− CW RAcc+ CW RAcc− 5.400 5.180 4.345 5.665 4.550 4.975 5.425 4.780 4.985 5.235 5.075 4.875 5.360 5.295 5.170 5.417 4.976 4.870

EMM-TV 4.005 4.870 4.965 5.025 5.285 4.830

Random 4.575 4.570 4.550 5.275 5.155 4.825

(c) Averaged rank of F-score (D2 ) Table 5.3: Averaged precision, recall, and the rank of F-score of the 100 datasets for each C value, computed with the true description D2 , which is corresponding to the HMM with relatively smaller mixture weight in the synthetic dataset. The top value for each C is shown in bold. The Friedman test p-values are 2.97e-81, 2.37e-254, and 2.64e-213 respectively.

5.4.1

Intra-Sequence Discovery

We first look at the setting of intra-sequence discovery. Under this setting, we assume the raw dataset only contains a single sequence within it, and this single sequence is divided into sub-sequences by applying a sliding window. For instance, a single sequence can be the whole gene sequence of an individual, or the texts from a single document. From the perspective of subgroup discovery, the purpose of this setting is hence to find the statistical interesting chunks of a given sequence. For the SPHERE challenge dataset, one task of intra-sequence discovery is to find interesting chunks of the acceleration sequences generated by a set of activities performed in the smart home. Therefore, the data mining task is similar to previous experiments with Gaussian distributions, which is to find different locations and activities using the RSSI readings as features and acceleration readings as targets. The experiments are configured as follows. Since the acceleration data has a frequency of 20Hz, we set the length of the sliding window to be 1200 (about 60 seconds), with no overlap between two consecutive windows. We again select the RSSI data to construct the description space with the candidate subgroups. We further assume that there is limited information contained by a set of consecutive RSSI readings, we hence further simplify the RSSI readings as the rounded averaged value within each window. 88

CD 1

2

3

4

5

6

HMM-WD Norm-WD HMM-WRLNorm-WRLCWRAcc+

7

8

9 10

HMM-WRL+ Norm-WRL+ Random EMM-TV CWRAcc-

Figure 5.2: The critical difference diagram with the averaged ranking of the selected quality measures, on the F-score computed with the true description D1 , which is corresponding to the Gaussian distribution with relative smaller mixture weight in the synthetic dataset. As the target variable is the 3-axis acceleration reading, we adopt the multivariate Gaussian distribution as the emission model class. The number of hidden state is set to 24. While this number may not be the best option to model the data, we select a number slightly higher than the number of annotated activities to provide extra capacities for modelling other activities. In the next section, we will also discuss the scenario where the HMM is trained with 20 hidden states from the labelled activities in the dataset. We apply the same set of measures as in the previous experiment but excludes the CWRAcc measures, as they have been compared with Gaussian-based measures in the previous chapter. We first give the difference between the local class distribution within the subgroups and the overall distribution of the script 01, on the annotated locations and activities in Figure 5.4 and Figure 5.5 respectively. As the figure shows, HMM-WD demonstrates a higher difference in the probability of living room, compared to measures with similar subgroups (HMM-WRL+ , Norm-WRL− and EMM-TV). HMM-WRL− , on the other hand, gives subgroup similar to the ones of Norm-WD and Norm-WRL+ , showing a higher probability on both bathroom and kitchen. To quantify the interestingness of the obtained subgroups with respect to the labelled notations and activities, Table 5.4 and Table 5.5 provide the generalised information gain for each subgroup calculated on the annotations. On the locations, HMM-WD gives the best performance, HMM-WRL+ also outperforms Norm-WRL+ , but HMM-WRL− shows worse results compared to Norm-WRL− . One potential reason is, as WRL+ and WRL− are designed to find subgroups with inverse interestingness, the good performance of HMM-WRL+ on finding the room location makes the subgroup obtained by HMMWRL− less interesting on the same aspect. For the activities, while Norm-WRL+ gives the top performance, and the HMM-based measures outperform the other measures. One possible explanation here is, as HMM also model the transition among activities, the sub89

CD 1

2

3

4

5

6

HMM-WRL+ Norm-WRL+ Random EMM-TV CWRAcc-

7

8

9 10

Norm-WD HMM-WD HMM-WRLNorm-WRLCWRAcc+

Figure 5.3: The critical difference diagram with the averaged ranking of the selected quality measures, on the F-score computed with the true description D2 , which is corresponding to the Gaussian distribution with relative smaller mixture weight in the synthetic dataset. Script HMM-WD 01 6892.9 6.0 02 9223.9 1.0 03 7369.7 3.0 04 10312.8 1.5 05 7119.1 3.0 06 4742.9 3.0 07 7546.0 1.5 08 8814.5 1.0 09 6498.9 3.5 10 9938.8 1.0 Avg.Rank 2.5

HMM-WRL+ 10071.1 1.0 7371.7 4.0 10041.8 1.0 7954.0 5.0 8329.5 1.0 2104.6 7.0 5335.3 5.0 5081.2 5.0 7636.8 1.0 7576.9 2.0 3.2

HMM-WRL- Norm-WD Norm-WRL+ Norm-WRL8456.0 3.0 8174.5 5.0 8402.9 4.0 8945.5 2.0 2802.0 7.0 7878.2 2.5 7878.2 2.5 6561.1 5.0 5464.5 5.0 8527.7 2.0 6431.8 4.0 4835.6 6.0 8503.4 4.0 5579.7 6.0 10312.8 1.5 8538.9 3.0 4958.4 5.0 7382.3 2.0 882.3 7.0 6389.5 4.0 4238.7 5.0 5393.0 2.0 6425.0 1.0 4524.8 4.0 2480.5 7.0 7546.0 1.5 6661.7 4.0 6705.6 3.0 2089.3 6.0 6245.1 3.0 8396.5 2.0 6193.4 4.0 4961.5 6.0 6498.9 3.5 5420.3 5.0 6633.7 2.0 2445.4 7.0 3293.9 5.0 4583.1 4.0 5217.4 3.0 5.5 3.2 3.5 3.6

EMM-TV 6466.8 7.0 2804.4 6.0 4548.7 7.0 4456.3 7.0 4450.6 6.0 4040.2 6.0 3345.4 6.0 1623.2 7.0 4470.4 7.0 3169.0 6.0 6.5

Table 5.4: The generalised information gain (log-loss) on the annotated locations for the top-quality subgroup found with each measure on each scripted dataset. The top value for each row is shown in bold. The Friedman test p-value is 1.44e-4.

groups obtained by the HMM-based measures tend to include a set of different activities to encode the transitional information, which makes it less optimal on finding a single activity. To summarise, in this experiment we demonstrate how HMM-based subgroup discovery can be applied to a set of sub-sequences obtained from a single raw sequence with sliding windows. The dataset we used contains the acceleration data from a participant performing a set of activities within a residential environment. The results suggest that, by incorporating the sequential information with the HMM, the model-based measures provide better results on finding a room location within the smart home. Although in the case of activities the HMM-based measures don’t come with the top performance on information gain, it is due to the HMMs tend to find deviating statistic also on the transitional behaviour of the hidden activities. We will further demonstrate this point in the next experiment. 90

0.6 HMM-WD

0.4

HMM-WRL+

0.2

HMM-WRLNorm-WD

0.0

Norm-WRL+

0.2

Norm-WRL-

0.4 toilet

study

stairs

living

kitchen

hall

bed2

bed1

bath

EMM-TV

0.6

Figure 5.4: The difference between the room distribution within the obtained subgroups and the one of the overall dataset (script 01). Each subgroup is the top-quality subgroup discovered with the corresponding measure.

5.4.2

Inter-Sequence Discovery

In contrast to intra-sequence discovery, in inter-sequence discovery the main difference is that now we have multiple raw sequences instead of a single sequence in the intrasequence case, which is then linked to the different interpretations of the obtained subgroups. As mentioned previously, inter-sequence discovery doesn’t differ from the intrasequence case from the perspective of modelling. That is, in both settings, the raw sequences are broken into a set of equal-length sub-sequences. The global and local HMM are trained on these sub-sequences via the same way under both settings. The difference on the interpretations hence comes from the language side. For instance, while a set of raw sequences are provided, it is likely that there exist meta level features that describe each of the raw sequences. Even when there aren’t any meta level features on the raw sequences, one can always code them via either a nominal or categorical manner to create a meta level description, if the aim is to find subgroups of raw sequences that deviate from the overall dataset. And within each raw sequence, there might still be features that also characterise each instance within the raw sequence. Depending on whether the users are interested in describing subgroups within the sequences, they can decide either to disregard such features for smaller search space and hence less computation time, or to keep 91

0.2

0.1 HMM-WD HMM-WRL+ HMM-WRLNorm-WD Norm-WRL+ Norm-WRLEMM-TV a_ascend a_descend a_jump a_loadwalk a_walk p_bent p_kneel p_lie p_sit p_squat p_stand t_bend t_kneel_stand t_lie_sit t_sit_lie t_sit_stand t_stand_kneel t_stand_sit t_straighten t_turn

0.0

0.1

0.2

Figure 5.5: The difference between the activity distribution within the obtained subgroups and the one of the overall dataset (script 01). Each subgroup is the top-quality subgroup discovered with the corresponding measure. them for better insights on a sub-sequence level. Introduced in the previous section, the SPHERE challenge contains ten scripted sequences with different people performing it under a home with various sensors. We hence focus on the task of learning the subgroups representing people that behaved differently on their acceleration data, comparing to the overall population. Therefore, as discussed above, we adopted a nominal coding here as the meta level feature, and build our subgroup language purely on this single meta feature. While in the previous setting we trained the HMMs via an unsupervised manner with an assumption of two hidden states, in this experiment we will instead adopt the 20 labelled activities to train the HMM through supervised learning. For the rest setup of the experiment (e.g. window length), we simply adopt the same setting as in the previous experiments. Figure 5.6 highlights the obtained subgroup from the measure WD, with the loglikelihood of both the local model and the global model. The obtained subgroup clearly has a significantly higher likelihood of the local model comparing to the one from the global model, which is precisely the interestingness defined by the measure WD. The next step is hence to investigate whether this subgroup is evaluated as most interesting 92

Script 01 02 03 04 05 06 07 08 09 10 Avg.Rank

HMM-WD 2620.0 6.0 3364.1 1.0 2715.9 2.0 2804.0 2.5 1468.1 6.0 980.8 1.0 1717.5 4.5 852.7 5.0 817.2 4.5 1670.2 1.0 3.4

HMM-WRL+ 3464.0 5.0 3135.2 2.0 1543.9 4.0 2574.1 4.0 3272.7 2.0 181.3 7.0 3357.7 1.0 417.8 6.0 967.6 2.0 861.7 4.0 3.7

HMM-WRL5521.9 3.0 1233.5 5.0 930.0 6.0 3758.8 1.0 3489.8 1.0 669.1 2.0 2348.9 3.0 1478.0 3.0 867.9 3.0 403.3 7.0 3.4

Norm-WD Norm-WRL+ 5955.0 2.0 5972.4 1.0 2903.7 3.5 2903.7 3.5 2570.3 3.0 3056.2 1.0 1573.5 6.0 2804.0 2.5 2004.8 4.0 1525.1 5.0 521.5 4.0 604.4 3.0 1717.5 4.5 3128.9 2.0 2127.8 2.0 3016.9 1.0 817.2 4.5 682.4 6.0 715.3 6.0 1058.1 3.0 4.0 2.8

Norm-WRL3612.8 4.0 1107.0 6.0 1045.3 5.0 1940.8 5.0 2524.6 3.0 496.8 5.0 765.0 7.0 963.4 4.0 1182.1 1.0 1126.9 2.0 4.2

EMM-TV 843.6 7.0 486.9 7.0 705.5 7.0 824.1 7.0 345.6 7.0 349.1 6.0 1024.6 6.0 98.8 7.0 555.0 7.0 771.0 5.0 6.6

Table 5.5: The generalised information gain (log-loss) on the annotated activities for the top-quality subgroup found with each measure on each scripted dataset. The top value for each row is shown in bold. The Friedman test p-value is 3.16e-3.

Figure 5.6: The log-likelihood of the global model (blue) on all the 10 scripts in the dataset, and the log-likelihood of the local model (red) within the top-quality subgroup (highlighted) obtained by the measure WD. through the base HMMs. Unlike the unsupervised setting as in the last experiment, we can explicitly link the corresponding states between the local model and the global model, since they are trained from the labelled activities. This enables us to directly compare the estimated parameters between the two HMMs. Figure 5.7 shows the differences between the local model and the global model, respectively on their estimated initial probabilities and transition probabilities. The differences in the initial probabilities show that the local model has clearly higher estimations on activities walk and sit, and lower estimations mainly on the activities ascend, lie and bend. As the initial probabilities are obtained through normalising the counts on each activity at the first instance of all the sub-sequences, this result suggests that all the activities mentioned above have a higher / lower chance to appear in the sequences covered by the subgroup. To be noticed, this result also depends on the setting of the sliding windows. If the length and overlap of the windows are both small, the initial probabilities get closer to the marginal distributions of the labels. The transition probabilities also show a recognisable pattern through the diagonal, that most activities have a higher probability to be kept performing during the coming 93

)URP

DBDVFHQG DBGHVFHQG DBMXPS DBORDGZDON DBZDON SBEHQW SBNQHHO SBOLH SBVLW SBVTXDW SBVWDQG WBEHQG WBNQHHOBVWDQG WBOLHBVLW WBVLWBOLH WBVLWBVWDQG WBVWDQGBNQHHO WBVWDQGBVLW WBVWUDLJKWHQ WBWXUQ



72

(a) Initial

(b) Transition

Figure 5.7: The differences between the local model and the global model on their initial probabilities and transition probabilities (e.g. Global parameters subtracted from the local parameters).

time point. Since the transition probabilities are calculated by averaging through all the neighbouring instances among the sub-sequences, such result might be interpreted as the participants within the subgroup take a longer time to finish each activity. This can also be seen via a few blue blocks in the figure. For instance, the local model has a lower probability of transiting from stand-kneel to kneel, which might advocate that the participants in the subgroup require a longer time to finish the transition. The differences on the estimated Gaussian emission models are shown in Figure 5.8, in terms of the mean and covariance of each state. The general pattern in there two sets of parameters can also be easily observed in the figure. The local model has a higher mean value on the x-axis and a lower mean value for most activities. Lower variance also appears on the x-axis and y-axis higher covariances are seen between x-axis and y-axis on the covariance matrix for most activities. As the differences from both parameters are roughly the same access different activities, it is unlikely that the participants within the subgroups perform the activities differently from the population. One possible answer here is the differences are caused by the accelerometers, as different accelerometers might not be calibrated to the same scale, and the location of wearing might also introduce extra biases into the acceleration data. However, the question still needs to be further investigated by the corresponding domain experts, as the task of Subgroup Discovery is to provide such interpretable interestingness to domain experts at the first place. Figure 5.9 provides the top-quality subgroup obtained with the measure WRL+ , with both the log-likelihood and averaged log-likelihood drawn for each script data. As the figure shows, the mean log-likelihood within the subgroup are generally higher than the 94

[

\

WBWXUQ

WBVWUDLJKWHQ

WBVWDQGBVLW

WBVLWBVWDQG

WBVWDQGBNQHHO

WBVLWBOLH

WBOLHBVLW

WBNQHHOBVWDQG

WBEHQG

SBVWDQG

SBVTXDW

SBVLW

SBOLH

SBNQHHO

SBEHQW

DBZDON

DBMXPS

DBORDGZDON

DBGHVFHQG

DBDVFHQG

]

(a) Mean

DBDVFHQG DBGHVFHQG

DBMXPS DBORDGZDON

DBZDON

SBEHQW

SBNQHHO

SBOLH

SBVLW

SBVTXDW

SBVWDQG

WBEHQG WBNQHHOBVWDQG WBOLHBVLW

WBVLWBOLH

WBVLWBVWDQGWBVWDQGBNQHHOWBVWDQGBVLW WBVWUDLJKWHQ

WBWXUQ

(b) Covariance

Figure 5.8: The differences between the local model and the global model on the mean and covariance of each state (e.g. Global parameters subtracted from the local parameters).

scripts in the complement (hence the overall population), which meets the interestingness required by the WRL+ measure. Without the local model being trained, one approach to investigate the interestingness of the obtained subgroups is to look at the labels again. The differences between the obtained subgroup and the population are visualised in Figure 5.10, with respect to the marginal distributions on the labelled activities. The sequences within the subgroup have an evidently larger probability mass on still activities like sit, bent and kneel. As discussed in the previous experiment, such activities might have a higher likelihood for the global HMM as most transition probabilities have a higher value to transit to the same state. They might also have a higher likelihood accruing to the emission model as the corresponding Gaussian has a lower variance and covariance. This 95

Figure 5.9: The log-likelihood of the global model (blue) on all the 10 scripts in the dataset, and with the mean of log-likelihood (dashed) within each script. The obtained subgroup with the measure WRL+ is highlighted with the mean of log-likelihood drawn in red.


Figure 5.10: The differences between the obtained subgroup with measure WRL+ and the population on the marginal distributions of the annotated activities. Here the barchart is drawn in black to distinguish from the previous case of differences on the initial probabilities. explains the reason why such subgroup is evaluated as top by the measure WRL+ . The obtained subgroup with the measure WRL− is shown in Figure 5.11. This subgroup turns out to be the complement of the subgroup obtained with WRL+ . This is expected as the search space here is relatively small and exhaustive search was performed, where using different sign on measures like WRL and CWRAcc will always locate subgroups that are the complement to each other. Therefore, the differences between the marginal distributions of activities in Figure 5.12 are the inverse of the one in Figure 5.10, representing the participants were more often found performing non-still activities like walk and turn. To conclude, in this section we performed experiments under the inter-sequence discovery setting with a supervised training of the HMMs. With the measure WD, the resulted subgroup shows a general difference in all the three sets of parameters. This fulfils our purpose of discovering generic interestingness on different statistical parameters on 96

Figure 5.11: The log-likelihood of the global model (blue) on all the 10 scripts in the dataset, and with the mean of log-likelihood (dashed) within each script. The obtained subgroup with the measure WRL− is highlighted with the mean of log-likelihood drawn in red.


Figure 5.12: The differences between the obtained subgroup with measure WRL− and the population on the marginal distributions of the annotated activities. Here the barchart is drawn in black to distinguish from the previous case of differences on the initial probabilities. sequences and time-series. Further investigation on both the local model and the global model show that, under the SPHERE challenge dataset, our approach is able of discovering abnormal transition behaviour on the activities, as well as deviated emissions to the acceleration from wearable devices. Regarding the measure WRL, the obtained subgroups are the reflection on the proportion of the hidden states of the sequences. That is, WRL+ leads to the subgroups that are containing a higher proportion of states that are more likely to be transited and more likely to generate the final target variables.

5.5

Chapter Summary

In this chapter, we investigate how to extend the MBSD approach to find subgroups on sequential datasets and time series. We proposed a solution based on the HMM, a widely 97

adopted probabilistic model class to capturing sequential statistical information. The major advantage of our approach is, the obtained subgroup is captured simultaneously on the deviated estimation on all the three sets of parameters, which cannot be achieved by existing approaches in both traditional subgroup discovery and exceptional model mining. With a set of synthetic sequences, we show that the proposed approach generally has a better performance of locating the true subgroup, where the true subgroup refers the sequences generated from a single HMM with a relatively small mixture weight, under a dataset generated from a mixture of two HMMs. With the SPHERE dataset, we demonstrate how the proposed approach can be applied to two different settings. For the intra-sequence discovery with unsupervised training on the HMMs, the experiments show that the WD measure can locate the subgroup where the participant is more active within a residential environment, and excludes locations like kitchen and living room, where the participant only perform activities like sitting and standing still. The WRL measures can also lead to similar subgroups as the WD measure, benefiting from a short computation time at the cost of less discriminative on the activities and locations. For the inter-sequence discovery with supervised training on the HMMs, we demonstrate how HMM-based Subgroup Discovery can be used on a dataset that contains multiple raw sequences. The results mainly benefit from the availability of the provided labels, where the interestingness of the obtained subgroups can be directly inspected by comparing the parameters of the local model and global model, as with the WD measure, or to be investigated through the marginal distribution of the labels, as with the WRL measures.

98

Chapter 6 Applications for Predictive Machine Learning: Model Reuse with Subgroup Discovery So far, we have investigated generic approaches to perform subgroup discovery with various statistical models, with the purpose of data mining and knowledge discovery. In this chapter, we study subgroup discovery from a different perspective, which is to adopt the idea of model-based subgroup discovery to the task of predictive machine learning, where the purpose is to provide as accurate predictions on some future and unseen data. Some early work in this direction can be found in [39], where the authors demonstrated how exceptional model mining could be used to detect the subgroups where a soft classifier works better or worse than its general performance on the overall dataset. Here we address the following research question: after we obtained subgroups where the base model has a deviating performance, how can we make use of them to formalise a better predictive model? As the tasks of predictive machine learning differ in many ways, here we consider the research question above in a given context, where the task is to reuse pre-trained models in some new deployment contexts. The general idea of model reuse is to use trained models from other operating contexts under a new operating context. For instance, while an activity classifier is trained with the sensor data from a set of houses, it can be reused on the sensor data from some flats, instead of training new models from scratch. Such a strategy can bring benefits in two aspects. First, it can reduce model training time in the new operating contexts and take advantages of the computation power already spent on the pre-trained models. Second, if the new operating context only has limited data, model reuse can extend the scale of the training data by adding training data from other operating contexts, which helps to improve the prediction’s performance. One of the major challenges of model reuse is that the patterns behind the data can 99

vary over different operating contexts [93, 94, 95]. This makes it difficult to directly apply trained models from the training context to a new context. For instance, to predict activities of the households from the reading of sensors is one of the leading applications of a smart home. However, as both the households and layout of the house vary between different houses, it is hard to directly deploy a model trained in one house to another house. Therefore, to recognise and deal with such variations through different operating contexts is a non-trivial research task for model reuse [96]. This chapter proposes to adopt the concept of model-based subgroup discovery to solve the problem of model reuse across different contexts. We made an assumption that, while one trained model, denoted as the base-model, deviates from its average performance within a subgroup of data points, another trained model might be working better particularly within this subgroup. With such an assumption, we design measures to locate these subgroup and further select a number of sub-models that have better performance. The overall prediction can be obtained by averaging the predictions from both the base-models and sub-models, with some averaging strategy. As experiments show, our proposed approach can significantly outperform single or averaged pre-trained models, as well as the averaging of a set of pre-trained models. The structure of this chapter is as follows. Section 6.1 gives the motivation and context of our model reuse approach. The proposed quality measures, description language, and search strategy are presented and discussed in section 6.2 and section 6.3 respectively. Section 6.4 shows our proposed strategy for model reuse, together with two baseline approaches later used in the experiments. Section 6.5 demonstrates our approach with experiment results. As the method we propose in this chapter is designed for regression models, in section 6.6 we also discuss the possibility of extending our approach to classification tasks, and conclude the whole chapter.

6.1

Motivation

The main motivation of this work is the machine learning challenge MoReBikeS, which was organised by the 2nd International Workshop on Learning over Multiple Contexts (LMCE 2015), within the conference ECML-PKDD 2015. As the name suggests, the purpose of the workshop is to investigate and discuss the approaches for model generalisation and reuse. The task of the MoReBikeS challenge is to predict the number of bikes available at a particular rental station 3 hours later, given historical data. In detail, the overall data-set is obtained from 275 bike rental stations located in Valencia, Spain. All participants get access to the data for all the 275 stations during October of 2014 and use these data as training data. 6 trained linear regression models are also provided for each station from station 1 to station 200. These linear regression models are trained with the data that covers the whole year of 2014. The task of the challenge is, by reusing these 100

trained models and limited training data, to predict the number of available bike at some new bike stations (station 201 to station 275).

6.2

Quality Measures

The first step is to design corresponding measures to locate deviating performance of the base-models. Suppose the prediction for the ith instance made by the base-model on the target is denoted as yî = fbase (xi ). The absolute error of the prediction is denoted as zAE i = |yî − yi |. The first proposed quality measure Weighted Relative Mean Absolute Error (WRMAE) is based on the absolute error of the base model, which is designed to find subgroups with large coverage and relatively higher Mean Absolute Error (MAE) than the population. : + ΦW RMAE (d) = ϒrel (d) · (

AE AE ∑N ∑N i=1 Id (xi ) · zi i=1 zi − ) N ∑N i=1 Id (xi )

(6.1)

This measure is an extension of the CWRAcc measure, as it switches the target variable to the prediction error of the base-model, instead of the raw target provided in the dataset. This makes it a semi-evaluation-driven measure. By changing the sign of WRMAE we can find subgroups where the base model tends to have lower MAE than the population: + − ΦW RMAE (d) = −ΦW RMAE (d)

(6.2)

While the WRMAE is built on the MAE of the predictions made by the base-model, another interestingness is to separate the cases where the base model over-estimated or under-estimated the target variable. We hence define the over-estimated error as: zOE i

 yˆ − y i i = 0

if yî ≥ yi

(6.3)

otherwise

Notice here the under-estimation errors (yî < yi ) are set to zero, so that the quality of subgroup is not affected by having both high over-estimated errors and high under-estimated errors. The quality measure Weighted Relative Mean Over-Estimated Error (WRMOE) is then given as: ΦW RMOE (d) = ϒrel (d) · (

OE OE ∑N ∑N i=1 Id (xi ) · zi i=1 zi − ) N ∑N i=1 Id (xi )

(6.4)

Like WRMOE, the under-estimated error and the corresponding quality measure Weighted 101

Relative Mean Under-Estimated Error (WRMUE) can be defined as: zUE i

 y − yˆ i i = 0

ΦW RMUE (d) = ϒrel (d) · (

6.3

if yi ≥ yî

(6.5)

otherwise

UE UE ∑N ∑N i=1 Id (xi ) · zi i=1 zi ) − N ∑N i=1 Id (xi )

(6.6)

Description Language and Search Strategy

While in the previous chapters we construct the description space by applying conjunctions between attributes and disjunctions between the values within the same attribute, for the model reuse task we only use disjunction of values within a single attribute. The adoption of this language scheme mainly concerns the computational costs. Following such setting, the description space is reduced significantly for each single subgroup discovery task. As shown later, our model reuse approach creates a number of Subgroup Discovery tasks that is linear to the number of total trained models, as well as the number of deploy contexts. Furthermore, for numerical attributes, a pre-processing is performed to divide each numerical attribute into equal size bins treated as nominal attributes, which means the order information among them is ignored. As only a single attribute is used to describe each subgroup, the search strategy can be seen as a refinement process with adding different values of the corresponding attribute. Here we further use a greedy search algorithm to increase the search speed and reduce the memory usage. The algorithm is performed as follows. The bin with the highest mean value of the target variable (e.g. MAE) is firstly added to the description, and the bin with the next highest mean value is added at each following step. The algorithm terminates once the quality measure is smaller than the previous step. This search algorithm is generally similar to a beam search algorithm except the beam width is fixed as 1, due to the fact that the refinement is done within the same attribute. Finally, for each attribute, only the top-quality subgroup described by that attribute is selected and sent to the final subgroup list. Therefore, the top-q subgroups in the final list can be seen as subgroups described by q different attributes respectively.

6.4

Model Reuse Strategy

In this section we introduce how to reuse trained models with the previously discovered subgroups, together with two baseline methods for comparison. The general idea can be described as follows: for each deployment context, we can select a number of trained models with top performance, as evaluated by MAE for our task. Then with each quality 102

measure above, we can obtain some ranked subgroups with each model. Sub-models are then ranked and selected according to their performances within each subgroup. The final prediction is hence obtained by averaging the prediction from the base-models and sub-models. We start by introducing the baseline methods.

6.4.1

Baseline Method 1

The first baseline method is, for each deployment context, to simply select one model from all trained models that has the lowest MAE on the deployment context.

6.4.2

Baseline Method 2

The second baseline method is, for each deployment context, to rank all of the trained models according to their MAE on this context. The final prediction is the average of the prediction of the top-n models (the selected models are later referred to as base-models for the subgroup discovery task): (n)

yˆ i

n

=

( j)

∑ fbase(x)

(6.7)

j=1

Baseline method 2 can be seen as related to Bootstrap aggregating (Bagging), as each pre-trained model can be treated as a bootstrap model that was training in a different context.

6.4.3

MBSD-reuse method

The proposed method is to use the measures proposed in section 6.2 to find the top-q subgroups (top-quality subgroup described by q attributes respectively) for each basemodel in the baseline method 2. Then a sub-model is selected from all pre-trained models according to the MAE within the subgroups. For the j-th base model, denoting a pretrained model as f , this can be formalised as: ( j)

fsub = argmin f

∑N i=1 Id j (xi ) · |yi − f (xi )| ∑N i=1 Id j (xi )

(6.8)

To combine the predictions from both the base-models and sub-models, here the strategy is to use the base-models for the data points that are not covered by the subgroups, and use the average of base-models and sub-models for the data points within the subgroups. For the j-th base model, the mixture model can be given as: ( j)

( j)

( j) fmix (xi ) =

fbase (xi ) + Id j (xi ) · fsub (xi ) 1 + Id j (xi ) 103

(6.9)

For the case that there are multiple subgroups (hence multiple sub-models) for each base model (with different rank or different quality measures), the mixture model (with K different subgroups) can be given as: ( j,k)

( j)

( j) fmix (xi ) =

fbase (xi ) + ∑K k=1 Id j,k (xi ) · f sub (xi ) 1 + ∑K k=1 Id j,k (xi )

(6.10)

Final, we can get the final prediction by averaging the top-n mixture models: ( j)

(n) yˆ i

6.5

∑nj=1 fmix (xi ) = n

(6.11)

Experiments with the MoReBikeS data

In this section, we experimentally demonstrate the proposed approach with the MoReBikeS dataset. The section is organised into two parts. For the first part, we provide a general description of the dataset, followed by presenting on the subgroups obtained with each proposed measure. Secondly, we show the performances of our reuse strategy with the proposed quality measures.

6.5.1

Model-Based Subgroup Discovery with Single Bike Station

In the MoReBikeS challenge, there are totally 25 attributes in the raw data-set. Table 6.1 summarises the information for each feature in the provided one-month data, such as name, type (binary, nominal, numerical), number of values, and number of bins configured in subgroup discovery task (only for numerical attributes). Although the 25 attributes can all be used to construct description space and hence the candidate subgroups, as the attribute bikes is the variable to be predicted, it can be removed from the description. Also because the subgroup discovery task is going to be performed under each individual station during Oct 2014, the attribute station, latitude, longitude, year, month, and timestamp can be further excluded. As introduced previously, there are six pre-trained models provided for station 1 to station 200. For simplicity, from now on we use model i- j to refer the j-th pre-trained model under station i ( j = 1 for short, j = 2 for short temp, j = 3 for f ull, j = 4 for f ull temp, j = 5 for short f ull, j = 6 for short f ull temp). The task is to reuse the pre-trained models on station 201 to station 275. For these deployment stations, training data is provided for a length of one month (Oct 2014). Therefore, the subgroup tasks are performed on the prediction errors of the pre-trained models on this one-month bike data. For instance, Figure 6.1 (left) shows the prediction on station 201 using the model 1-1, together with the ground truth. Figure 6.1 (bottom) gives the 104

attribute station latitude longitude numDocks timestamp year month day hour weekday weekhour isHoliday windMaxSpeed.m.s windMeanSpeed.m.s windDirection.grades temperature.C relHumidity.HR airPressure.mb precipitation.l.m2 bikes 3h ago full profile 3h diff bikes full profile bikes short profile 3h diff bikes short profile bikes bikes

type number of values nominal 275 numerical 275 numerical 275 numerical 19 numerical 745 numerical 1 numerical 1 numerical 31 numerical 24 numerical 7 numerical 168 binary 2 numerical 28 numerical 16 numerical 17 numerical 142 numerical 72 numerical 283 numerical 1 numerical 41 numerical 17304 numerical 17632 numerical 419 numerical 231 numerical 41

number of bins NA 275 275 19 32 1 1 31 24 7 168 NA 28 16 17 16 8 32 1 41 32 32 32 32 NA

Table 6.1: The 25 attributes in the October data-set and their properties.

empirical distribution of the prediction errors. As the figure shows, the pre-trained model 1 − 1 over-estimated and under-estimated the number of bikes around different location. Notably, many prediction errors occur because as the pre-trained model outputs values below 0, which is not valid as the targets are the number of bikes that should be no less than 0. While such errors can be easily filtered out by setting all negative output to be 0, here we keep them to the subgroup discovery process, since they presumably contain useful information regarding the prediction pattern of the base-model. The next step is then to perform subgroup discovery based on this pre-trained model 1-1. We first apply the measure WRMAE+ , as the results show, the top-quality subgroup is found by describing disjunctions of values under the feature weekhour. The corresponding feature values are shown in Figure 6.2 (top). Since we treat this numerical attribute as a nominal attribute (e.g. the data points covered by a candidate subgroup can contain any combination of feature values), the values found within the subgroup description look 105

Station 201 with model 1-1October 30

ground truth

predicted

Bikes

20 10 0 -10 -20 1

4

7

10

13

16

19

22

25

28

31

Day of the Month 0.3 0.25 0.2 0.15 0.1 0.05 0 -15

-10

-5

0

5

10

Figure 6.1: The prediction of station 1 from a pre-trained model (top), and the empirical distribution of errors (bottom). sparse. However, there are still some patterns that can be seen in the figure. One observation is, most of the attribute values are located around the middle of each day (e.g. 12, 36 ). Given the quality measure is WRMAE+ , this suggests that the base model has larger prediction errors around the afternoon of each day. Intuitively this makes sense as one might agree that the afternoon is the busy hours for bike rental, especially for a tourist city like Valencia, and this makes the number of bikes at a station more unpredictable. However, a different pattern appears when it comes to the weekend, the covered values of the description turns to be around the midnight (e.g. for weekhour from 132 and 168) instead of the noon as during the weekdays. While this observation suggests a different bike usage pattern occurs during the weekend, it also explains why the weekhour provides the top-quality subgroup, and the feature hour (of the day) doesn’t. Figure 6.2 (bottom) gives the empirical distribution of the prediction errors within the subgroup. Compared to Figure 6.1, here the distribution of errors has a higher variance, which indicates a higher MAE. As the number shows, the MAE of the base-model in this subgroup is about 4.553, while 2.751 was given by the population. Therefore, it can be concluded that the measure WRMAE+ works in the way it was designed for, leading 106

Description

Station 201 with model 1-1, WRMAE + True

False

12

24

36

48

60

72

84

96

108 120 132 144 156 168

weekhour 0.3

Population

Subgroup

0.25 0.2 0.15 0.1 0.05 0 -15

-10

-5

0

5

10

Figure 6.2: The best subgroup found with the quality measure WRMAE+ (top), and the empirical distribution of errors within the subgroup (bottom).

to a subgroup with sufficient size and usually larger MAE of the base-model, as well as interpretable subgroup description. The next measure tried is the measure WRMAE− . Figure 6.3 (top) shows the topquality subgroup obtained with WRMAE− . Since WRMAE− is again the inverse version of WRMAE, it is expected that a subgroup obtained with WRMAE− is approximately the complement of the one in the WRMAE+ case. The observed pattern is hence, the same base-model has a lower MAE on its predictions during the early morning, late afternoon and evening during the weekdays, plus larger range around the middle day during the weekend. With respect to the empirical distribution of prediction errors, as shown in Figure 6.3 (bottom), the local errors within the subgroup show clearly a higher mass around 0 and lower mass around both tails, which satisfies our requirement to WRMAE− . We now look at the measure WRMOE and WRMUE at the same time. The obtained subgroup descriptions and the empirical distributions of both measures are given in Figure 6.4 and Figure 6.5 respectively. As it turns out, the top-quality subgroups for both measures are again expressed by the feature weekhour, which further advises the importance of this feature in capturing the prediction pattern of the base-model. The over107

Description

Station 201 with model 1-1, WRMAE True

False

12

24

36

48

60

72

84

96

108 120 132 144 156 168

weekhour 0.3

Population

Subgroup

0.25 0.2 0.15 0.1 0.05 0 -15

-10

-5

0

5

10

Figure 6.3: The best subgroup found with the quality measure WRMAE− (top), and the empirical distribution of errors within the subgroup (bottom). estimated errors are mainly seen around the afternoon to evening interval, where the base-model predicts a higher number of bikes than the actual number of availability at the station. The under-estimated errors are occurring during noon mostly, which is similar to the pattern for the WRMAE+ case. This can be explained via looking at the error distribution as in Figure 6.5 (bottom). The errors below 0 appear in a more flattened shape, while there are also considerable positive errors. This is because we ignore the positive errors during the design of the measure. As the results show, the model tends to make more negative errors while also making positive errors. However, if we look at Figure 6.4 (bottom), the error distribution shows a peak around the positive region, which then indicates that the base-model does sometimes make more over-estimated predictions without making more under-underestimations. Above we show the top-quality subgroup obtained from each measure, which is all described through feature weekhour. In Table 6.2 we show the description feature for the top-5 subgroups obtained from each measure. While the top-quality subgroup is described via the same feature across different measures, a variety of features have been selected by the remaining subgroups. Given the relationship between the two WRMAE measures, they end up with the same rank of the features in their subgroups. The WRMAE measures 108

Station 201 with model 1-1, WRMOE Description

True

False

12

24

36

48

60

72

84

96

108 120 132 144 156 168

weekhour 0.3

Population

Subgroup

0.25 0.2 0.15 0.1 0.05 0 -15

-10

-5

0

5

10

Figure 6.4: The best subgroup found with the quality measure WROE (top), and the empirical distribution of errors within the subgroup (bottom).

Rank 1 2 3 4 5

WRMAE+ (WRMAE− ) weekhour hour f ull pro f ile 3h di f f bikes day f ull pro f ile bikes

WRMOE weekhour f ull pro f ile 3h di f f bikes windMaxSpeed hour windDirection

WRMUE weekhour hour day f ull pro f ile 3h di f f bikes f ull pro f ile bikes

Table 6.2: The description features for top-5 subgroups with each quality measure.

and WRMUE measure picked the same set of features but with different ranking on the corresponding subgroups. WRMOE, on the other hand, contains subgroups described by features like windMaxSpeed and windDirection, which are not included for the top-5 subgroups with other measures. Such results illustrate how our proposed measures are able to make use of different features to describe a set of interesting prediction pattern of one base-model under a given deployment context. 109

Station 201 with model 1-1, WRMUE Description

True

False

12

24

36

48

60

72

84

96

108 120 132 144 156 168

weekhour 0.3

Population

Subgroup

0.25 0.2 0.15 0.1 0.05 0 -15

-10

-5

0

5

10

Figure 6.5: The best subgroup found with the quality measure WRUE (top), and the empirical distribution of errors within the subgroup (bottom).

6.5.2

Model Reuse

In this section, we demonstrate the proposed Model Reuse strategy, which is built on a set of Subgroup Discovery tasks as demonstrated in the previous section. The experiments are organised as follows. The training process involves the data of all 275 stations during Oct 2014, as well as the 1200 pre-trained models provided in the dataset. For testing data, the full year data of station 1 to station 10 and the 3-month data of station 226 to station 275 will be used. The experiments are then performed in two different setups. For the first setup, each single station is seen as a deployment context, therefore both the subgroup discovery and model reuse is performed at the level of individual stations. We refer to this setting as Station-Oriented (SO). For the second experiment setup, all the test stations are seen as a single context, and referred as Non-Station-Oriented (NSO). The subgroups and model reuse is hence performed on the level of the set of stations. For instance, if the test set is the full year data of station 1 to station 10, both the base-models and subgroups will be selected on the data from all the ten stations. 110

In both experiments, the performances are compared among nine methods: 1. Base method 1 2. Base method 2 3. WRMAE+ reuse 4. WRMAE− reuse 5. WRMOE reuse 6. WRMUE reuse 7. 3-mixture reuse (with WRMAE+ , WRMOE, WRMUE) 8. 3-mixture reuse (with WRMAE− , WRMOE, WRMUE) 9. 4-mixture reuse (with all 4 measures) A number of up to top q = 16 subgroups (ranked by the quality measures) are discovered and used in the prediction, and up to 512 base-models (ranked by MAE) are selected and averaged for each deployment context. The station-oriented error curves for station 1 to station 10 and station 226 to station 275 are given in Fig 6.6 and Fig 6.7 respectively. It can be seen that the baseline method 2 generally beats baseline method 1. This indicates that, when the training data of the deployment context is limited, to select a bunch of pre-trained models to get the averaged outputs can potentially help reduce the predictions errors. As previously discussed, in this scenario each station can be approximately thought as a bootstrap, the baseline method is hence similar to the bagging strategy in terms of the averaging. However, for this approach, an important issue is to decide the number of models to be averaged, as it tends to over-fit quickly after the number gets larger. This is due to the fact that we are introducing these pre-trained model in the same order as their rank of MAE on the corresponding station, which makes it difficult to know at which rank we have gone through all the “good” models and begin to add the “bad” models. For the proposed methods, the figures show that the WRMAE− method gets the best performance except in one case, where only the top-quality subgroup is used to predict the test set of station 1 to 10. The reason behind the good performance of WRMAE− reuse can be linked to the error distributions given in the previous section. As Fig 6.3 (bottom) shows, only with the quality measure WRMAE− , the error distribution is still close to a Gaussian distribution with 0 mean, but with less variance than the one in the population. The subgroup can hence be seen as representing a less noisy context, which helps to select another pre-trained model to specifically fit this subgroup. On the other hand, especially with large q, the proposed methods tend to reduce the effect of over-fitting from baseline 111

Station 1 to 10, SO, q = 1

2.76


2.74

2.75 2.73

MAE

2.74 2.72

2.73 2.72

2.71

2.71 Base 1 Base 2

2.7 2.7

MAE +

2.69

2.69 100

MAE

200

300

400

500


2.75

100

2.74

2.74

2.73

2.73

2.72

2.72

2.71

2.71

2.7

2.7

2.69

200

300

400

500


2.75

MAE MOE MUE 3-mix-MAE+ 3-mix-MAE full-mix

2.69 100

200

300

400

500

100

Number of models averaged

200

300

400

500


Figure 6.6: The MAE for all the 9 methods, tested on the full year data of station 1 to station 10. (Setting SO, with q = 1, 5, 9, 16) method 2. This is mainly because these methods are designed to fit a better model for the data points that have deviated prediction patterns as given by the base models. Therefore, the effect becomes more significant when the number of q gets larger, as more sub-models are involved in the prediction. This makes the choice of a number of averaged models less problematic than the case of base line method 2. The non-station-oriented error curves for the two groups of stations are shown in Figure 6.8 and Figure 6.9 respectively. The first observation is that, for both groups, the MAE of the baseline method is significantly lower than in the ones under the station-orientated setting. This indicates that to treat a set of stations (deployment contexts) as single deployment context can potentially help to get better performance. This also means that the feature station might not contribute as descriptions to separate (describe) different deployment contexts. The second observation is that the baseline method 2 in the nonstation-oriented setting generally has a higher MAE than the baseline method 1. One possible reason could be that, since now the training data is mixed with different stations and has a larger size, this makes it more representative for the further test data sets. As a result, to simply select and average a number of base-models according to the MAE can cause potential over-fitting and hence lower the performance of the averaged prediction. Since all proposed methods are essentially based on the baseline method 2, although they mostly perform better than the baseline method 2, their MAE are still higher than baseline method 1 for many cases. However, with the case q = 16, it can be seen both WRMAE− reuse and 3-mixture reuse (with WRMAE− , WRMOE, WRMUE) can still 112

2.08


2.11

2.075

2.1

2.07

2.09

2.065

MAE


2.08

2.06 2.07 2.055 2.06

2.05 2.045 2.04 2.035

Base 1 Base 2

2.04

MAE +

2.03 100

2.11

2.05

200

300

400

500


100 2.09

2.1

200

300

400

500



2.08

2.09 2.07

MAE

2.08 2.07

2.06

2.06

2.05

2.05 2.04 2.04 2.03

2.03 2.02

2.02 100

200

300

400

500

100

200

300

400

500



Figure 6.7: The MAE for all the 9 methods, tested on the three months data of station 201 to station 275. (Setting SO, with q = 1, 5, 9, 16) reach an MAE lower than the baseline method 1 and achieve top performance.

6.6

Chapter Summary

In this chapter we demonstrated the idea of model-based subgroup discovery via another perspective. Instead of only aiming at obtained subgroups and the corresponding knowledge from the data, we moved to the area of predictive machine learning. Specifically, we considered the question of how to adapt subgroup discovery to the task of reusing pretrained linear regression models. As the interestingness has changed, we proposed four extra measures designed for discovering deviated absolute errors, over-estimated errors, and under-estimated errors of the base regression model. A model reuse strategy is then designed to make use of the obtained subgroups by selecting sub-models that give the top-performances for the data points in the corresponding subgroup. The experiments further demonstrated the performance of our reuse strategy with different measures and different combination measures. On the MoReBikeS dataset, our proposed approaches outperformed the pre-trained model with the lowest MAE in the training set, as well as the averaged prediction from a set of top-performance models. While our proposed method is designed to reuse linear regression models, it can be potentially generalised to reuse classification models that output scores to quantify their beliefs on the labels. As such scores also suffer the problem of over-estimation and underestimation, adopting our methods can potentially lead to better-calibrated scores outputted 113

MAE

2.75

Station 1 to 10, NSO, q = 1

2.74

2.74

2.73

2.73

2.72

2.72

2.71

2.71

2.7

2.7

2.69

2.69 100

MAE

2.75

200

300

400


2.75

500


MAE +

100 2.75

2.74

2.74

2.73

2.73

2.72

2.72

2.71

2.71

2.7

2.7

2.69

Base 1 Base 2

200

300

400

500



2.69 100

200

300

400

500

100


200

300

400

500


Figure 6.8: The MAE for all the 9 methods, tested on the full year data of station 1 to station 10. (Setting NSO) by reusing the pre-trained models in a classification setting.

114

MAE

2.1


2.1

2.08

2.08

2.06

2.06

2.04

2.04

2.02

2.02

2

2

1.98

MAE

Base 1 Base 2 MAE +

1.98 100

2.1


200

300

400

500


100 2.1

2.08

2.08

2.06

2.06

2.04

2.04

2.02

2.02

2

2

1.98

200

300

400

500



1.98 100

200

300

400

500

100


200

300

400

500


Figure 6.9: The MAE for all the 9 methods, tested on the three months data of station 201 to station 275. (Setting NSO)

115

116

Chapter 7 Conclusions and Future Work In this thesis, we proposed and investigated a generic, evaluation-driven approach to find a broad range of statistically deviating subgroups from a given dataset. We named this approach Model-Based Subgroup Discovery and designed it to be an alternative to the existing frameworks of Traditional Subgroup Discovery and Exceptional Model Mining. In this chapter, we provide an overall summary of the thesis and discuss some potential directions for future work.

7.1

Thesis Summary

While traditional subgroup discovery is defined to quantify the interestingness by comparing the average value of a uni-variate target variable, the exceptional model mining framework is an extension to support mining on multivariate targets as well as a larger range of statistical models. Both traditional subgroup discovery and exceptional model mining are parameter-driven approaches. While in such approaches the interestingness can be explicitly defined through a given parameter in the corresponding statistical model, it also comes with two disadvantages. The first is that a new quality measure must be designed for each model class to capture the interestingness on the specified parameter. This requirement increases the development costs of the whole process of subgroup discovery. The second is that the user has to select a parameter of interests in the first place, which is not necessarily the case for models with multiple or even a large scale of parameters. The users might be interested in the models with any deviating parameter as well as a combination of deviating parameters. To overcome these disadvantages stated above, we proposed the idea of using evaluationdriven measures, to quantify the interestingness of the candidate subgroups by evaluating the goodness of local models. In Chapter 3, we showed that the Proper Scoring Rules, a family of functions widely used to evaluate probability estimation, can be adopted to construct quality measures in the context of tradition subgroup discovery. The interest117

ingness is evaluated regarding whether a subgroup provides a better statistical summary of the overall dataset. With the further development of a generative model, the proposed measures outperform existing measures regarding the generalisation of the obtained subgroups and their local class distributions.

In Chapter 4, we investigated how the measures derived from proper scoring rules can be generalised to an arbitrary probabilistic model class. Given that the log-loss of a probability estimation is equal to the log-likelihood of a probabilistic model on a given instance, we selected it to design two generic evaluation-driven measures. The first measure, Weighted Divergence, can locate subgroups where the local model has a higher log-likelihood compared to the global model. The second measure, Weighted Relative Likelihood, uses only the global model to find the subgroups where the data points are either more likely or less likely to be observed according to the global model. This makes weighted relative likelihood a computationally efficient measure, especially for model classes that require a longer training time. With synthetic data, we compared the proposed measures with existing parameter-driven measures and demonstrated the differences on their interestingness. Two experiments on different smart home datasets further showed that the proposed measure could be generalised with different base models for various purposes.

In Chapter 5, we further demonstrated the flexibility of the model-based approach by applying it to sequential datasets. We selected the Hidden Markov Model as the base model class, given its high interpretability. A synthetic dataset was again created to demonstrate the advantages of the HMM-based subgroup discovery, compared to parameter-driven measures as well as model-based approaches with a non-sequential base model class. The experiment gave a case study of the HMM-based approach in two different settings, where intra-sequence discovery aims to find deviating sub-sequences within an overall sequence, and inter-sequence discovery aims to find an interesting set of sequences among all the sequences within a dataset.

Finally, in Chapter 6 we investigated how the model-based approach can be used to reuse trained models to a different context, which is a predictive machine learning task. As the task has shifted to a different area, we further propose four semi-evaluation-driven measures to find deviating prediction pattern on linear regression models. A model reuse strategy was also introduced to make use of the obtained subgroups, with selecting submodels that have top performance among the subgroups. While the proposed approach outperforms the baseline approaches, the experiments also provided a detailed illustration regarding the mechanism of the whole model reuse system. 118

7.2

Future Work

We divide the future work of this thesis into three directions. The first direction is to develop model classes and associated algorithms for further probabilistic modelling, which contains potential work for both a theoretical side and an application side. For the theoretical side, one work is to involve further Bayesian modelling for the base model classes. As shown in Chapter 3, by associating the Dirichlet prior and the generative structure, we can increase the generalisation of the obtained local models. Therefore, while in Chapter 4 and Chapter 5 we only adopted maximum likelihood estimation for simplicity, the assumption here is that the estimated parameters can also be more accurate if appropriate prior is involved in the modelling. Further research is also required to increase the computational efficiency on model classes with a large number of parameters, as such models typically require a longer time to train, as commonly seen among the cases of hierarchical Bayesian modelling. The framework of topic models [83, 97] is one example. Many methods have been proposed in the field of variational inference to make such probabilistic models be trained more efficiently [98, 72, 73, 99]. To investigate approaches to implement such quick inference methods is another future research task to improve the model-based approach, following the probabilistic direction. While in this thesis the covering algorithm was set to remove the discovered instances after each iteration, the CN2-SD algorithm has shown that the weighted covering algorithm can improve the overall interestingness of the obtained subgroups. However, the weighted covering algorithm in CN2-SD performs counting on the class values under the context of traditional subgroup discovery (e.g. univariate discrete target), and it can not be directly generalised to the model-based approach. Another theoretical work is then to investigate weighted covering algorithm for the proposed evaluation-driven measures for probabilistic model classes. For the applications side, we have shown that simple probabilistic models can be directly applied to find subgroups for indoor localisation and abnormal detection within a smart home. These models can be further developed to incorporate more types of sensors, as well as prior beliefs on the parameters to achieve a better modelling of the statistical information within the house, which makes the model more accurate for certain tasks like the one of indoor localisation. To develop more sequential probabilistic model classes is also a valuable option. Regarding Hidden Markov Models, as there have been many variations proposed, like coupled hidden Markov models [100], factorial hidden Markov models [101], and hidden semi-Markov models [102], to support these variations can increase the capacity of the model-based approach on the datasets where the Markov assumption is valid. Gaussian Process [27] has recently received large interests as a non-parametric Bayesian proba119

bilistic approach. Given its advantages on modelling the uncertainty of time series and sequential data [103, 104, 105], Gaussian processes are also a potential base model class for the future development of the model-based approach. The second direction, which concentrates more on the theoretical side, is to develop more potential evaluation-driven measures. As each model evaluation approach comes with its unique interestingness, the contribution with this direction is to create new interestingness for a potentially large set of base model classes. While in this thesis we mainly derived our measures with the proper scoring rules, other divergence measures like the f-divergence can also provide measures with different interpretations [70]. Nonprobabilistic evaluation measures are also worth further researches in specific contexts. For instance, the f-measure has been widely used for information retrieval and evaluating classifiers on unbalancing data [106], which makes it suitable to assess subgroups for similar tasks and contexts. The last but not least direction is to investigate the role of the model-based approach in the context of predictive machine learning. As demonstrated in the thesis, this can potentially lead to approaches that can help to build stronger model ensembles. Some recent researches in the field of Classifier Calibration [107, 108, 109, 110, 111] have focused on improving the probabilistic outputs from a classifier. As discussed in Chapter 7, to develop our model reuse approach to support scoring classifiers is another important further work for this thesis, which aims to develop the proposed regression approaches to provide better calibrate probabilities for scoring classifiers.

120

Bibliography [1] Rob Kitchin. Big data, new epistemologies and paradigm shifts. Big Data & Society, 1(1):2053951714528481, 2014. [2] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [3] Peter Flach. Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, 2012. [4] Kevin P Murphy. Machine learning: A probabilistic perspective. MIT press, 2012. [5] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, 1996. [6] Usama M Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. Advances in knowledge discovery and data mining, volume 21. AAAI press Menlo Park, 1996. [7] Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011. [8] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016. [9] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs: From multi-relational link prediction to automated knowledge graph construction. arXiv preprint arXiv:1503.00759, 2015. [10] Willi Klösgen. Explora: A multipattern and multistrategy discovery assistant. In Advances in knowledge discovery and data mining, pages 249–271. American Association for Artificial Intelligence, 1996. [11] Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Principles of Data Mining and Knowledge Discovery, pages 78–87. Springer, 1997. 121

[12] Nada Lavraˇc, Branko Kavˇsek, Peter Flach, and Ljupˇco Todorovski. Subgroup discovery with CN2-SD. The Journal of Machine Learning Research, 5:153–188, 2004. [13] Franciso Herrera, Cristóbal José Carmona, Pedro González, and Mar´ıa José del Jesus. An overview on subgroup discovery: Foundations and applications. Knowledge and information systems, 29(3):495–525, 2011. [14] Dennis Leman, Ad J Feelders, and Arno Knobbe. Exceptional model mining. In Machine Learning and Knowledge Discovery in Databases, pages 1–16. Springer, 2008. [15] Wouter Duivesteijn, Arno Knobbe, Ad Feelders, and Matthijs van Leeuwen. Subgroup discovery meets Bayesian networks–an exceptional model mining approach. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 158– 167. IEEE, 2010. [16] Wouter Duivesteijn, Ad J Feelders, and Arno Knobbe. Exceptional model mining. Data Mining and Knowledge Discovery, 30(1):47–98, 2016. [17] Florian Lemmerich, Martin Becker, Philipp Singer, Denis Helic, Andreas Hotho, and Markus Strohmaier. Mining subgroups with exceptional transition behavior. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 965–974. ACM, 2016. [18] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, Feb 1989. [19] Thomas G Dietterich. Machine learning for sequential data: A review. In Structural, syntactic, and statistical pattern recognition, pages 15–30. Springer, 2002. [20] Ni Zhu, Tom Diethe, Massimo Camplani, Lili Tao, Alison Burrows, Niall Twomey, Dritan Kaleshi, Majid Mirmehdi, Peter Flach, and Ian Craddock. Bridging e-health and the internet of things: The sphere project. Intelligent Systems, IEEE, 30(4):39– 46, 2015. [21] Hao Song, Peter Flach, and Georgios Kalogridis. Dataset shift detection with model-based subgroup discovery. 2nd International Workshop on Learning over Multiple Contexts (LMCE 2015), 2015. [22] Hao Song and Peter Flach. Model reuse with subgroup discovery. In Proceedings of the 2015th International Conference on ECML PKDD Discovery Challenge - Volume 1526, ECMLPKDDDC’15, pages 5–18, Aachen, Germany, Germany, 2015. CEUR-WS.org. 122

[23] Hao Song, Meelis Kull, Peter Flach, and Georgios Kalogridis. Subgroup discovery with proper scoring rules. In Paolo Frasconi, Niels Landwehr, Giuseppe Manco, and Jilles Vreeken, editors, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II, pages 492–510, Cham, 2016. Springer International Publishing. [24] Arno Siebes. Data surveying: Foundations of an inductive query language. In KDD, pages 269–274, 1995. [25] Wouter Duivesteijn. Exceptional model mining. PhD thesis, Leiden Institute of Advanced Computer Science (LIACS), Faculty of Science, Leiden University, 2013. [26] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines: And Other Kernel-based Learning Methods. Cambridge University Press, New York, NY, USA, 2000. [27] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006. [28] Marc G Genton. Classes of kernels for machine learning: a statistics perspective. Journal of machine learning research, 2(Dec):299–312, 2001. R in [29] Yoshua Bengio. Learning deep architectures for AI. Foundations and trends Machine Learning, 2(1):1–127, 2009.

[30] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015. [31] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158, 2015. [32] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. [33] Laurent Hyafil and Ronald L Rivest. Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5(1):15–17, 1976. [34] Florian Lemmerich, Martin Atzmueller, and Frank Puppe. Fast exhaustive subgroup discovery with numerical target concepts. Data Mining and Knowledge Discovery, 30(3):711–762, 2016. 123

[35] Yuehua Xu and Alan Fern. On learning linear ranking functions for beam search. In Proceedings of the 24th international conference on Machine learning, pages 1047–1054. ACM, 2007. [36] Cristóbal Romero, Pedro González, Sebastián Ventura, Mar´ıa José del Jesús, and Francisco Herrera. Evolutionary algorithms for subgroup discovery in e-learning: A practical application using Moodle data. Expert Systems with Applications, 36(2):1632–1644, 2009. [37] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine learning, 3(4):261–283, 1989. [38] Stefan Wrobel. Inductive logic programming for knowledge discovery in databases. In Relational data mining, pages 74–101. Springer, 2001. [39] Wouter Duivesteijn and Julia Thaele. Understanding where your classifier does (not) work–the SCaPE model class for EMM. In Data Mining (ICDM), 2014 IEEE International Conference on, pages 809–814. IEEE, 2014. [40] Nada Lavraˇc, Peter Flach, and Blaz Zupan. Rule evaluation measures: A unifying view. In International Conference on Inductive Logic Programming, pages 174– 185. Springer Berlin Heidelberg, 1999. [41] Martin Atzmueller and Florian Lemmerich. Fast subgroup discovery for continuous target concepts. In Foundations of Intelligent Systems, pages 35–44. Springer, 2009. [42] Katherine Moreland and Klaus Truemper. Discretization of target attributes for subgroup discovery. In Machine Learning and Data Mining in Pattern Recognition, pages 44–52. Springer, 2009. [43] Tarek Abudawood and Peter Flach. Evaluation measures for multi-class subgroup discovery. In Machine Learning and Knowledge Discovery in Databases, pages 35–50. Springer, 2009. [44] Ryszard S Michalski. Pattern recognition as rule-guided inductive inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2(4):349–361, July 1980. [45] Peter Clark and Robin Boswell. Rule induction with CN2: Some recent improvements. In Machine learning EWSL 91, pages 151–163. Springer, 1991. [46] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD 124

International Conference on Management of Data, SIGMOD ’93, pages 207–216, New York, NY, USA, 1993. ACM. [47] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, A Inkeri Verkamo, et al. Fast discovery of association rules. Advances in knowledge discovery and data mining, 12(1):307–328, 1996. [48] William W Cohen. Fast effective rule induction. In Proceedings of the twelfth international conference on machine learning, pages 115–123, 1995. [49] Martin Atzmueller. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(1):35–49, 2015. [50] Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 43–52. ACM, 1999. [51] Stephen D Bay and Michael J Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3):213–246, 2001. [52] Esther Galbrun and Pauli Miettinen. From black and white to full color: Extending redescription mining outside the boolean world. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(4):284–303, 2012. [53] Petra Kralj Novak, Nada Lavraˇc, and Geoffrey I Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. The Journal of Machine Learning Research, 10:377–403, 2009. [54] Mar´ıa José Del Jesus, Pedro González, Francisco Herrera, and Mikel Mesonero. Evolutionary fuzzy rule induction process for subgroup discovery: A case study in marketing. Fuzzy Systems, IEEE Transactions on, 15(4):578–592, 2007. [55] Michael Mampaey, Siegfried Nijssen, Ad Feelders, and Arno Knobbe. Efficient algorithms for finding richer subgroup descriptions in numeric and nominal data. In IEEE International Conference on Data Mining, pages 499–508, 2012. [56] Martin Atzmueller and Frank Puppe. SD-Map–a fast algorithm for exhaustive subgroup discovery. In Knowledge Discovery in Databases: PKDD 2006, pages 6–17. Springer, 2006. [57] Henrik Grosskreutz and Stefan Rüping. On subgroup discovery in numerical domains. Data mining and knowledge discovery, 19(2):210–226, 2009. 125

[58] Dragan Gamberger and Nada Lavrac. Expert-guided subgroup discovery: Methodology and application. Journal of Artificial Intelligence Research, 17(1):501–527, 2002. [59] Dragan Gamberger, Nada Lavraˇc, and Goran Krstaˇcić. Active subgroup mining: A case study in coronary heart disease risk group detection. Artificial Intelligence in Medicine, 28(1):27–57, 2003. [60] Martin Atzmueller, Frank Puppe, and Hans-Peter Buscher. Towards knowledgeintensive subgroup discovery. In LWA, pages 111–117. Citeseer, 2004. [61] Nada Lavraˇc, Bojan Cestnik, Dragan Gamberger, and Peter Flach. Decision support through subgroup discovery: Three case studies and the lessons learned. Machine Learning, 57(1-2):115–143, 2004. [62] Marvin Meeng and Arno Knobbe. Flexible enrichment with cortana–software demo. In Proceedings of BeneLearn, pages 117–119, 2011. [63] Martin Atzmueller and Florian Lemmerich. Vikamine–open-source subgroup discovery, pattern mining, and analytics. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 842–845. Springer, 2012. ˇ Gorup, Tomaˇz Hoˇcevar, Mitar Mi[64] Janez Demˇsar, Tomaz Curk, Aleˇs Erjavec, Crt lutinoviˇc, Martin Moˇzina, Matija Polajnar, Marko Toplak, Anˇze Stariˇc, et al. Orange: data mining toolbox in python. Journal of Machine Learning Research, 14(1):2349–2353, 2013. [65] Robert L Winkler. Scoring rules and the evaluation of probability assessors. Journal of the American Statistical Association, 64(327):1073–1078, 1969. [66] R. L. Winkler, Javier Muñoz, José L. Cervera, José M. Bernardo, Gail Blattenberger, Joseph B. Kadane, Dennis V. Lindley, Allan H. Murphy, Robert M. Oliver, and David R´ıos-Insua. Scoring rules and the evaluation of probabilities. Test, 5(1):1–60, Jun 1996. [67] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138– 156, 2006. [68] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359– 378, 2007. 126

[69] Meelis Kull and Peter Flach. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Machine Learning and Knowledge Discovery in Databases, pages 68–85. Springer International Publishing, 2015. [70] Mark D Reid and Robert C Williamson. Information, divergence and risk for binary experiments. The Journal of Machine Learning Research, 12:731–817, 2011. [71] Yisen Wang, Chaobing Song, and Shu-Tao Xia. Improving decision trees by tsallis entropy information metric method. In 2016 International Joint Conference on Neural Networks (IJCNN), pages 4729–4734, July 2016. [72] Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001. [73] Martin J Wainwright and Michael I Jordan. Graphical models, exponential famR in Machine Learning, ilies, and variational inference. Foundations and Trends 1(1-2):1–305, 2008. [74] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [75] Chuong B Do and Serafim Batzoglou. What is the expectation maximization algorithm? Nature biotechnology, 26(8):897–899, 2008. [76] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. [77] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and regression trees. CRC press, 1984. [78] A Ng and A Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems, 14:841, 2002. [79] Foster Provost and Tom Fawcett. Robust classification for imprecise environments. Machine Learning, 42(3):203–231, 2001. [80] Johannes Fürnkranz and Peter A Flach. Roc ’n’ rule learning: Towards a better understanding of covering algorithms. Machine Learning, 58(1):39–77, 2005. 127

[81] Peter A Flach. The geometry of roc space: understanding machine learning metrics through roc isometrics. In ICML, pages 194–201, 2003. [82] Moshe Lichman. UCI machine learning repository, 2013. [83] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [84] Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006. [85] Bo Tan, Alison Burrows, Robert Piechocki, Ian Craddock, Qingchao Chen, Karl Woodbridge, and Kevin Chetty. Wi-fi based passive human motion sensing for in-home healthcare applications. In Internet of Things (WF-IoT), 2015 IEEE 2nd World Forum on, pages 609–614. IEEE, 2015. [86] Pierluigi Casale, Oriol Pujol, and Petia Radeva. Personalization and user verification in wearable systems using biometric walking patterns. Personal and Ubiquitous Computing, 16(5):563–580, 2012. [87] Niall Twomey, Tom Diethe, Meelis Kull, Hao Song, Massimo Camplani, Sion Hannuna, Xenofon Fafoutis, Ni Zhu, Pete Woznowski, Peter Flach, et al. The sphere challenge. arXiv preprint arXiv:1603.00797, 2016. [88] Diane J Cook, M Schmitter-Edgecombe, et al. Assessing the quality of activities in a smart environment. Methods Inf Med, 48(5):480–485, 2009. [89] Tom Diethe, Niall Twomey, and Peter Flach. Bayesian modelling of the temporal aspects of smart home activity with circular statistics. In Machine Learning and Knowledge Discovery in Databases, pages 279–294. Springer, 2015. [90] Yu Chen, Tom Diethe, and Peter Flach. AdlTM : A topic model for discovery of activities of daily living in a smart home. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 1404–1410. AAAI Press, 2016. [91] Stuart Russell. Unifying logic and probability. Communications of the ACM, 58(7):88–97, 2015. [92] Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521(7553):452–459, 2015. [93] Jose G Moreno-Torres, Troy Raeder, Roc´ıo Alaiz-Rodr´ıguez, Nitesh V Chawla, and Francisco Herrera. A unifying view on dataset shift in classification. Pattern Recognition, 45(1):521–530, 2012. 128

[94] José Hernández-Orallo, Nicolas Lachiche, and Adolfo Martınez-Usó. Predictive models for multidimensional data when the resolution context changes. In LMCE 2014, First International Workshop on Learning over Multiple Contexts, 2014. [95] Meelis Kull and Peter A Flach. Patterns of dataset shift. In LMCE 2014, First International Workshop on Learning over Multiple Contexts, 2014. [96] Niall Twomey and Peter A Flach. Context modulation of sensor data applied to activity recognition in smart homes. In LMCE 2014, First International Workshop on Learning over Multiple Contexts, 2014. [97] David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77– 84, 2012. [98] Zoubin Ghahramani and Matthew J Beal. Variational inference for bayesian mixtures of factor analysers. In Advances in neural information processing systems, pages 449–455, 2000. [99] Wittawat Jitkrittum, COM Arthur Gretton, SM Ali Eslami, COM Balaji Lakshminarayanan, Dino Sejdinovic, and Zoltán Szabó. Just-in-time kernel regression for expectation propagation. In International Conference on Machine Learning (ICML)–Large-Scale Kernel Learning: Challenges and New Opportunities workshop, 2015. [100] M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for complex action recognition. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 994–999, Jun 1997. [101] Zoubin Ghahramani and Michael I Jordan. Factorial hidden Markov models. Machine learning, 29(2-3):245–273, 1997. [102] Thi V Duong, Hung Hai Bui, Dinh Q Phung, and Svetha Venkatesh. Activity recognition and abnormality detection with the switching hidden semi-markov model. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 838–845. IEEE, 2005. [103] Carl Henrik Ek, Philip HS Torr, and Neil D Lawrence. Gaussian process latent variable models for human pose estimation. In International workshop on machine learning for multimodal interaction, pages 132–143. Springer, Berlin, Heidelberg, 2007. [104] James Hensman, Neil D Lawrence, and Magnus Rattray. Hierarchical bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC bioinformatics, 14(1):252, 2013. 129

[105] Stephen Roberts, M Osborne, M Ebden, Steven Reece, N Gibson, and S Aigrain. Gaussian processes for time-series modelling. Phil. Trans. R. Soc. A, 371(1984):20110550, 2013. [106] Peter Flach and Meelis Kull. Precision-recall-gain curves: PR analysis done right. In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 838–846. Curran Associates, Inc., 2015. [107] Terry Windeatt and Reza Ghaderi. Coding and decoding strategies for multi-class learning problems. Information Fusion, 4(1):11–21, 2003. [108] José Hernández-Orallo, Peter Flach, and Cesar Ferri. A unified view of performance metrics: Translating threshold choice into expected classification loss. The Journal of Machine Learning Research, 13(1):2813–2869, 2012. [109] Peter A Flach. Classifier calibration. Encyclopedia of Machine Learning and Data Mining, pages 1–8, 2016. [110] Meelis Kull and Peter A Flach. Reliability maps: A tool to enhance probability estimates and improve classification accuracy. In Machine Learning and Knowledge Discovery in Databases, pages 18–33. Springer, 2014. [111] Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 623–631, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.

130

List of Figures 1.1

3.1

3.2 3.3

3.4

4.1

4.2

4.3

An example bivariate data set with two subgroups (shaded areas) defined on the discretised features, both capturing an area of statistical deviation in comparison to the overall population. The subgroup on the left is preferred by a commonly used evaluation measure (WRAcc) while the right subgroup is preferred by the one of the measures we propose in this paper.

3

Isometrics for (a) local class distribution with π (1) /π (2) = 1, (b) local class distribution with π (1) /π (2) = 3, (c) relative size with π (1) /π (2) = 1, (d) relative size with π (1) /π (2) = 3. . . . . . . . . . . . . . . . . . . . . .

47

Isometrics for (a) W RAcc with π (1) /π (2) = 1, (b) W RAcc with π (1) /π (2) = 3, (c) IG − BS with π (1) /π (2) = 1, (d) IG − BS with π (1) /π (2) = 3. . . . .

48

ˆ − BS with (a) π (1) /π (2) = 1, N = 10, (b) π (1) /π (2) = 1, Isometrics for IG N = 100, (c) π (1) /π (2) = 1, N = 1000, (d) π (1) /π (2) = 3, N = 10, (e) π (1) /π (2) = 3, N = 100, (f) π (1) /π (2) = 3, N = 1000. . . . . . . . . . . .

49

Isometrics for PSR−BS with (a) π (1) /π (2) = 1, N = 10, (b) π (1) /π (2) = 1, N = 100, (c) π (1) /π (2) = 1, N = 1000, (d) π (1) /π (2) = 3, N = 10, (e) π (1) /π (2) = 3, N = 100, (f) π (1) /π (2) = 3, N = 1000. . . . . . . . . . . .

50

Some examples of the synthetic datasets. Red points and blue points are the random targets drawn from the two Gaussian distributions respectively. The ellipse are the 95% confidential level of the corresponding Gaussian. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

The critical difference diagram with the averaged ranking of the selected quality measures, on the F-score computed with the true description D1 , which corresponds to the minority Gaussian component in the synthetic dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

The critical difference diagram with the averaged ranking of the selected quality measures, on the F-score computed with the true description D2 , which corresponds to the majority Gaussian component in the synthetic dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

131

4.4

4.5

The difference between the room distribution within the obtained subgroups and the one of the overall dataset. Each subgroup is the top-quality subgroup discovered with the corresponding measure. . . . . . . . . . .

67

The difference between the activity distribution within the obtained subgroups and the one of the overall dataset. Each subgroup is the top-quality subgroup discovered with the corresponding measure. . . . . . . . . . . .

68

4.6

The floor plan of the smart house in the twor2009 data set. Each small circle with the character M represents the location of 1 of the 51 PIR sensors 71

4.7

The factor graph of the base model. Both the sensor readings y(2) and time stamps y(1) are treated as observations given the hidden activities z. N is the total number of training observations. . . . . . . . . . . . . . . .

72

The estimated parameter µtime with both z values, of the global model and local models within the top-quality subgroup found by each measure. . .

73

The estimated parameter µlocation with both z values, of the global model and local models within the top-quality subgroup found by each measure.

74

4.10 The computation time for evaluating a single subgroup with the measure WD and WRL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

4.8

4.9

5.1

5.2

5.3

5.4

Some examples of random sequences generated from two HMMs. Each pair of figures on the left show the observations and hidden states drawn from one HMM, and the pair of figures on the right shows sequences drawn from another HMM. The settings are given in the caption of each group of figures. The character I indicates the two HMM have different initial probabilities, T for different transition probabilities, E for different emission probabilities, and ITE means all the three sets of parameters are different between the two models. . . . . . . . . . . . . . . . . . . . . .

86

The critical difference diagram with the averaged ranking of the selected quality measures, on the F-score computed with the true description D1 , which is corresponding to the Gaussian distribution with relative smaller mixture weight in the synthetic dataset. . . . . . . . . . . . . . . . . . .

89

The critical difference diagram with the averaged ranking of the selected quality measures, on the F-score computed with the true description D2 , which is corresponding to the Gaussian distribution with relative smaller mixture weight in the synthetic dataset. . . . . . . . . . . . . . . . . . .

90

The difference between the room distribution within the obtained subgroups and the one of the overall dataset (script 01). Each subgroup is the top-quality subgroup discovered with the corresponding measure. . . . .

91

132

5.5

5.6

5.7

5.8

5.9

The difference between the activity distribution within the obtained subgroups and the one of the overall dataset (script 01). Each subgroup is the top-quality subgroup discovered with the corresponding measure. . . . .

92

The log-likelihood of the global model (blue) on all the 10 scripts in the dataset, and the log-likelihood of the local model (red) within the topquality subgroup (highlighted) obtained by the measure WD. . . . . . . .

93

The differences between the local model and the global model on their initial probabilities and transition probabilities (e.g. Global parameters subtracted from the local parameters). . . . . . . . . . . . . . . . . . . .

94

The differences between the local model and the global model on the mean and covariance of each state (e.g. Global parameters subtracted from the local parameters). . . . . . . . . . . . . . . . . . . . . . . . . .

95

The log-likelihood of the global model (blue) on all the 10 scripts in the dataset, and with the mean of log-likelihood (dashed) within each script. The obtained subgroup with the measure WRL+ is highlighted with the mean of log-likelihood drawn in red. . . . . . . . . . . . . . . . . . . . .

96

WRL+

5.10 The differences between the obtained subgroup with measure and the population on the marginal distributions of the annotated activities. Here the bar-chart is drawn in black to distinguish from the previous case of differences on the initial probabilities. . . . . . . . . . . . . . . . . . .

96

5.11 The log-likelihood of the global model (blue) on all the 10 scripts in the dataset, and with the mean of log-likelihood (dashed) within each script. The obtained subgroup with the measure WRL− is highlighted with the mean of log-likelihood drawn in red. . . . . . . . . . . . . . . . . . . . .

97

5.12 The differences between the obtained subgroup with measure WRL− and the population on the marginal distributions of the annotated activities. Here the bar-chart is drawn in black to distinguish from the previous case of differences on the initial probabilities. . . . . . . . . . . . . . . . . . .

97

6.1

The prediction of station 1 from a pre-trained model (top), and the empirical distribution of errors (bottom). . . . . . . . . . . . . . . . . . . . . 106

6.2

The best subgroup found with the quality measure WRMAE+ (top), and the empirical distribution of errors within the subgroup (bottom). . . . . 107

6.3

The best subgroup found with the quality measure WRMAE− (top), and the empirical distribution of errors within the subgroup (bottom). . . . . 108

6.4

The best subgroup found with the quality measure WROE (top), and the empirical distribution of errors within the subgroup (bottom). . . . . . . . 109

6.5

The best subgroup found with the quality measure WRUE (top), and the empirical distribution of errors within the subgroup (bottom). . . . . . . . 110 133

6.6 6.7 6.8 6.9

The MAE for all the 9 methods, tested on the full year data of station 1 to station 10. (Setting SO, with q = 1, 5, 9, 16) . . . . . . . . . . . . . . . . The MAE for all the 9 methods, tested on the three months data of station 201 to station 275. (Setting SO, with q = 1, 5, 9, 16) . . . . . . . . . . . The MAE for all the 9 methods, tested on the full year data of station 1 to station 10. (Setting NSO) . . . . . . . . . . . . . . . . . . . . . . . . . . The MAE for all the 9 methods, tested on the three months data of station 201 to station 275. (Setting NSO) . . . . . . . . . . . . . . . . . . . . .

134

112 113 114 115

List of Tables 3.1

Micro-averaged F-scores on the artificial data, for different class distributions (π (1) ). The best results for each row are shown in bold. . . . . . . .

51

Average Brier scores on the artificial data. The best results are shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.3

Average Log-loss on the artificial data. The best results are shown in bold.

52

3.4

The 20 UCI data sets used in the experiments. . . . . . . . . . . . . . . .

53

3.5

Average BS for the UCI data sets. The best results are shown in bold. . .

54

3.6

Average LL for the UCI data sets. The best results are shown in bold. . .

54

4.1

A comparison among evaluation-driven, semi-evaluation-driven and parameterdriven quality measures. . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2

Averaged precision, recall, and the rank of F-score of the 100 datasets for each C value, computed with the true description D1 , which corresponds to the minority Gaussian component in the synthetic dataset. The top value for each C is shown in bold. The Friedman test p-values are 1.53e119, 1.32e-93, and 8.54e-118 respectively. . . . . . . . . . . . . . . . . .

63

Averaged precision, recall, and the rank of F-score of the 100 datasets for each C value, computed with the true description D2 , which corresponds to the majority Gaussian component in the synthetic dataset. The top value for each C is shown in bold. The Friedman test p-values are 7.27e120, 4.92e-148, and 5.45e-141 respectively. . . . . . . . . . . . . . . . .

64

The descriptions of the top-quality subgroups obtained with each measure, on the scripted dataset 01, with their relative sizes. NA indicates the wearable device is out of the range of the corresponding access point. . .

67

The generalised information gain (log-loss) on the annotated locations for the top-quality subgroup found with each measure on each scripted dataset. The subscript shows the rank of each measure on the corresponding dataset. The top value for each dataset is shown in bold. The Friedman test p-value is 6.15e-4. . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.2

4.3

4.4

4.5

135

4.6

4.7

5.1 5.2

5.3

5.4

5.5

6.1 6.2

The generalised information gain (log-loss) on the annotated activities for the top-quality subgroup found with each measure on each scripted dataset. The subscript shows the rank of each measure on the corresponding dataset. The top value for each dataset is shown in bold. The Friedman test p-value is 1.05e-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . top subgroups discovered with the quality measure WD, WRL+ and WRL− respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A toy data set that contains a person’s food / sleep record for the duration of one year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Averaged precision, recall, and the rank of F-score of the 100 datasets for each C value, computed with the true description D1 , which is corresponding to the HMM with relatively smaller mixture weight in the synthetic dataset. The top value for each C is shown in bold. The Friedman test p-values are 2.97e-81, 3.92e-13, and 1.22e-46 respectively. . . . . . . . . Averaged precision, recall, and the rank of F-score of the 100 datasets for each C value, computed with the true description D2 , which is corresponding to the HMM with relatively smaller mixture weight in the synthetic dataset. The top value for each C is shown in bold. The Friedman test p-values are 2.97e-81, 2.37e-254, and 2.64e-213 respectively. . . . . . . . The generalised information gain (log-loss) on the annotated locations for the top-quality subgroup found with each measure on each scripted dataset. The top value for each row is shown in bold. The Friedman test p-value is 1.44e-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The generalised information gain (log-loss) on the annotated activities for the top-quality subgroup found with each measure on each scripted dataset. The top value for each row is shown in bold. The Friedman test p-value is 3.16e-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69 73

80

87

88

90

93

The 25 attributes in the October data-set and their properties. . . . . . . . 105 The description features for top-5 subgroups with each quality measure. . 109

136

List of Acronyms SD

Subgroup Discovery

MBSD


EMM

Exceptional Model Mining

RL

Relative Mean Log-Likelihood

RAcc

Relative Accuracy

WRAcc

Class-wise Weighted Relative Accuracy

MWRAcc

Multi-class Weighted Relative Accuracy

WMWRAcc Weighted Multi-class Weighted Relative Accuracy CWRAcc

Continuous Weighted Relative Accuracy

PSRs

Proper Scoring Rules

i.i.d.

Independent and Identically Distributed

EM

Expectation Maximisation

ADLs

Activities of Daily Life

BS

Brier Score

LL

Log Loss

ROC

Receiver Operating Characteristic

pmf

Probability Mass Function

pdf

Probability Density Function

RSSI

Received Signal Strength Indicator

IoT

Internet of Things 137

HMM

Hidden Markov Model

WD

Weighted Divergence

WRL

Weighted Relative Likelihood

WRMAE

Weighted Relative Mean Absolute Error

WRMOE

Weighted Relative Mean Over-Estimated Error

WRMUE

Weighted Relative Mean Under-Estimated Error

MAE

Mean Absolute Error

138