Learning from Ubiquitous Data Streams

3 downloads 0 Views 13MB Size Report
addressing clustering of heterogenous data, the online discretization step ... de reduzir a comunicação na rede (até 70% menos em dados reais de ... de memória, da α-weighted average para uma janela de tamanho w, dentro de um limite do.
Pedro Pereira Rodrigues

Learning from Ubiquitous Data Streams

Clustering Data and Data Sources

Porto 2010

Doctoral Programme in Computer Science

Departamento de Ciência de Computadores

EXECUÇÃO GRÁFICA

MEDISA - EDIÇÕES E DIVULGAÇÕES CIENTÍFICAS, LDA EDIÇÃO

AUTOR ISBN

978-989-20-2129-4 TIRAGEM

15 EXEMPLARES DATA

OUTUBRO 2010

APOIO

Dissertação submetida à Faculdade de Ciências da Universidade do Porto para obtenção do grau de Doutor em Ciência de Computadores.

Author

Pedro Pereira Rodrigues, MSc PhD Student Associate Member Collaborator Invited Assistant

LIAAD - INESC Porto L.A. & University of Porto CRACS - INESC Porto L.A. & University of Porto CINTESIS - University of Porto Biostatistics and Medical Informatics Department Faculty of Medicine of the University of Porto

Advisers

João Manuel Portela da Gama, PhD Researcher Assistant Professor

LIAAD - INESC Porto L.A. & University of Porto Mathematics and Informatics Department Faculty of Economics of the University of Porto

Luís Miguel Barros Lopes, PhD Researcher Associate Professor

CRACS - INESC Porto L.A. & University of Porto Computer Science Department Faculty of Sciences of the University of Porto

Para nós,

Cláudia e Tomás

Acknowledgments

Acknowledgments

The man of knowledge must be able not only to love his enemies but also to hate his friends.  Friedrich Nietzsche, German philosopher (1844-1900)

This work could not have been accomplished without the help and support of several institutions, to which I hereby thank. First of all, I thank the Faculty of Sciences of the University of Porto (FCUP), for allowing me to endure the doctoral program. This work was nancially supported by Fundação para a Ciência e a Tecnologia (FCT), with a PhD grant (SFRH/BD/29219/2006), through POS_C and POPH programs, partly funded by the European Union through the European Social Fund, and the Portuguese government. I deeply thank the support of the Laboratory of Articial Intelligence and Decision Support (LIAAD), my host institution during this PhD work and ever since I started doing research in 2003. During the time period of this thesis, I was also integrated in two other research units, which helped me getting into touch with some valuable research and researchers: the Center for Research on Advanced Computing Systems (CRACS) and the Center for Research in Health Technologies and Information Systems (CINTESIS). Moreover, the integration of LIAAD and CRACS in INESC Porto, L.A. certainly boosted the help that both institutions gave me to achieve my goals. I would also like to thank both the Computer Science Department (DCC) of FCUP, and the Biostatistics and Medical Informatics Department (SBIM) of the Faculty of Medicine of the University of Porto (FMUP), for my appointment as Invited Assistant during 2007/2008 and 2008/2010, respectively. This contact with both undergraduate and graduate students helped me growing a more mature vision about teaching and research. I acknowledge Telefónica - Investigación y Desarrollo, S.A. for receiving me on a complimentary internship in Madrid, Spain, during six weeks of 2009, which gave me new insights about current applied research in real-world organizations.

Also acknowledged is the Laboratory

for Cognitive Modeling (LKM) of the University of Ljubljana, for receiving me as a tendays visiting researcher in October 2007 and again in September 2010, and the European Coordinating Committee for Articial Intelligence (ECCAI). The work endured in this PhD project was also partially included in the scope of some research projects:

KDUDS (PTDC/EIA-EIA/98355/2008), CALLAS (PTDC/EIA/71462/

2006), ALES II (POSC/EIA/55340/2004) and KDubiq (IST-6FP-021321).

xi

ACKNOWLEDGMENTS As I am not a man of knowledge (or simply do not agree with Nietzsche) I certainly do not hate my friends.

Several exceptional researchers and friends were present one way or the

other in my life and in the path to the deployment of this work.

I intend to thank them

properly with my friendship, but I couldn't let this go without a public word of appreciation. They were both a source of inspiration and a support of knowledge and learning. This thesis was jointly supervised by Prof. João Gama, Assistant Professor at the Faculty of Economics of the University of Porto (FEP) and Prof. Luís Lopes, Associate Professor at DCC-FCUP, to whom I am grateful for accepting this supervision. João Gama was the main propeller of my ight in the data stream mining and streaming machine learning areas of research. Since 2003, he has been much more than an adviser, a truly research collaborator. His help and guidance are always the perfect partners, providing stable foundations for an exceptional work. For his supervising and collaborative endeavor. Luís Lopes has been an exceptional adviser in sensor networks, enabling the proof-of-concept of some of the techniques proposed in this work, and the impact of such techniques on ubiquitous systems. For his friendship and focused support. I deeply thank the help of João Araújo on implementing the simulations of L2GClust experiments in VisualSense, and Cristina Santos for her help on statistical analysis and fruitful discussions on agreement theory. A special word is due to Raquel Sebastião, who has also been a close collaborator in the past years, in several dierent topics within this eld of research. I acknowledge my colleagues at LIAAD and SBIM-FMUP for the exceptional atmosphere that surrounds me every day at work, with a special word of appreciation to the support given by the institutions heads, Prof. Pavel Brazdil and Prof. Altamiro Costa-Pereira, respectively. For their unconditional help and friendly support in my academic trajectory, I deeply thank Luís Antunes and Ricardo Correia. I utterly thank all the support given by my parents and family throughout my studies. Furthermore, I thank my young siblings for all the energy that has been induced in my interpretation of life as a whole. All of my friends know they are part of my life and their friendship is the foundation of my will to succeed and evolve. For all the friendship, I thank you all.

For being yourselves, with me every day, thank you Tomás and Cláudia.

xii

Abstract

Abstract Background:

Nowadays information is generated and gathered from distributed data sources,

at a very high rate, stressing communications and computing infrastructure, making it hard to transmit, compute and store. Data streams dier from the conventional stored relation model in that they are potentially unbounded in size, the data elements arriving online, being processed on the y and discarded afterwards, and changing its concept through time. Knowledge discovery from ubiquitous data streams has become a major goal for all sorts of applications. When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques, such as clustering.

Then, two sub-problems exist:

clustering streaming data examples and clustering streaming data sources.

The former

searches for dense regions of the data space, identifying hot-spots where data sources tend to produce data, while the later nds groups of sources that behave similarly through time.

Aim:

The aim of this work is three-fold.

First, we aimed at reviewing existent scientic

evidence of valuable distributed algorithms addressing either of the two previously enunciated sub-problems of clustering. Then, we addressed clustering of the data being heterogeneously produced by a sensor network, investigating if local discretization and centralized clustering of representative data can achieve better results than simple centralized clustering. Finally, we addressed clustering of streaming data sources, to investigate the hypothesis of a fully distributed clustering algorithm being able to achieve global clustering denition of the entire network without a centralized server.

Methods:

First, with the goals of exposing scenarios where data is produced as ubiquitous

data streams, identifying clustering algorithms for ubiquitous data streams, and pointing out advantages and disadvantages of distributed procedures, a systematic review was endured, with clear selection criteria and analysis methodology.

For the second aim, we proposed

DGClust, a new distributed grid clustering algorithm for data produced on wide sensor networks. Its core is based on online local discretization of data, centralized frequent state monitoring, and centralized online partitional clustering.

Comparison was performed with

centralized streaming and batch clustering methodologies, focusing evaluation on validity, communication and processing load. Regarding the third aim, we have dene a three-step methodology. We started by enunciating basic requirements for methods addressing clustering of ubiquitous streaming data sources. Then, the memoryless fading window model was dened for data stream processing, especially focused on the

α-fading

average computation. Finally,

we proposed L2GClust, a simple local algorithm for clustering sources based on furthestpoint clustering of the fading average of the data they produce, where each local node keeps an estimate of the global clustering of nodes in the network.

The fading window

model was theoretically supported, while comparison of L2GClust was performed with a

xv

ABSTRACT centralized streaming approach which would gather all the fading averages from the local sites, focusing evaluation on the global, compactness and separability agreement on node-tocluster assignment.

Findings:

Seventeen papers were included in the review, most of which addressing clustering

data (76%) rather than clustering data sources (29%). Main dierences between the two areas include proposing algorithm as a general method (54% vs 0%), addressing homogeneously distributed data (92% vs 20%), and aiming at the clustering problem per se (77% vs 20%). Most of the papers evaluated scalability and computation load (53%) followed by clustering validity and processing speed (47%). Regarding the second aim, evaluation of the DGClust algorithm showed that it is capable of reducing the overall communication in the network (up to 70% less in real sensor data), also reducing the dimensionality of the clustering problem (only a fraction of 10% to 50% of the frequent states are used, and less than 15% of the data points triggered cluster updates), while keeping clustering validity competitive with centralized approaches. For the data stream processing, the be a memoryless approximation of the within a bounded error of

2εR,

where

α-fading

average was shown to

α-weighted average computed for a window of size w, 1 R is the past range of the variable, if α = ε w . Finally,

evaluation of the L2GClust algorithm revealed high agreement with the centralized streaming counterpart (always above 75% and most of the times above 90%) being especially robust in terms of separability agreement (all scenarios with statistically dierent compactness and separability agreement present better results for the later).

Concluding Remarks:

This thesis presents several contributions. Overall, distributed stream

clustering methods improve communication ratios, processing speed and resources consumption, while achieving similar clustering validity with the centralized counterparts.

When

addressing clustering of heterogenous data, the online discretization step executed by DGClust at each local site reduced the overall communication in the network, while the central frequent state monitoring reduced the dimensionality of the clustering problem. processing methods should consider using the

α-fading

Stream

window model, as it gives a non-

catastrophic forgetting of old data, while being adaptative to new data points and, most of all, a memoryless bounded approximation of a weighted window model.

Finally, a fully

distributed clustering algorithm such as L2GClust can achieve global clustering denition of the entire network of data sources without a centralized server. As a wrapper for the entire contributions, we discuss sensitive topics, such as the implications of this work in terms of advantages for specic domains of applications, including sensor networks management and comprehension, and the evaluation of unsupervised learning algorithms.

xvi

Resumo

Resumo Problema:

Actualmente, a informação é gerada e recolhida por fontes de dados distribuídas, a

alta velocidade, prejudicando as comunicações e as infraestruturas de computação, tornandoa difícil de transmitir, calcular e armazenar.

Os uxos contínuos de dados diferem do

modelo relacional de armazenamento convencional no sentido em que são potencialmente innitas, os elementos chegam de forma sequencial, devendo ser processados e imediatamente descartados, possivelmente com alteração do conceito ao longo do tempo. A descoberta de conhecimento a partir de uxos contínuos e ubíquos de dados é hoje um objectivo fulcral para todos os tipos de aplicações.

Quando não existe informação adicional, são habitualmente

usadas técnicas de aprendizagem não supervisionada, como o clustering, quer clustering de dados em uxos contínuos, quer clustering das fontes que produzem esses dados. O primeiro caso tenta encontrar regiões densas do espaço dos dados onde as fontes tendem a produzir dados, enquanto o segundo tenta encontrar grupos de fontes que se comportam de forma semelhante ao longo do tempo.

Objectivos:

Existem três objectivos pricipais neste trabalho.

Primeiro, rever a evidência

cientíca existente de algoritmos distribuídos para os dois tipos de clustering estudados. Depois, investigar se discretização local e clustering de pontos representantes obtém melhores resultados que centralizar todo o processo de clustering, quando os dados são produzidos de forma heterogénea numa rede de sensores.

Finalmente, estudar a hipótese de efectuar o

clustering global das fontes de dados de uma rede, mas usando apenas processamento local.

Métodos:

Foi realizada uma revisão sistemática, com critérios e análise bem denidos, com

o objectivo de expôr os cenários onde os dados são produzidos de forma distribuída e em alta velocidade, identicar algoritmos de clustering para esses cenários e apontar as vantagens e desvantagens de estratégias distribuídas.

Para o segundo objectivo, foi desenvolvido o

DGClust, um novo algoritmo para o clustering de uxos ubíquos de dados de sensores. O cerne do método está na discretização local dos dados, monitorização dos estados mais frequentes, e clustering centralizado. A comparação foi efectuada com o clustering totalmente centralizado, focando na qualidade, comunicação e capacidade de processamento. Quanto ao terceiro objectivo, a metodologia baseou-se em três fases. Inicia por enunciar os requisitos para efectuar o clustering distribuíduo de fontes ubíquas de dados. Depois, dene o modelo de janela temporal fading para processamento de dados, especialmente útil na manutenção de médias móveis. Finalmente, propõe o L2GClust, um algoritmo local para clustering das fontes de dados, baseado no furthest-point clustering das médias móveis dos dados produzidos, onde cada nó possui uma estimativa do clustering global da rede. O modelo de janela temporal é suportado teoricamente, enquanto que o L2GClust é comparado com uma versão centralizada, focando a avaliação na concordância global, de compactacção e de separação, em relação à

xix

RESUMO atribuição nó-cluster.

Resultados:

Desassete artigos foram incluídos na revisão, a maior parte lidando com cluster-

ing dos dados (76%) ao invés de clustering das fontes (29%). As principais diferenças entre as duas áreas incluem propôr o algoritmo como um método geral (54% vs 0%), lidar com dados homogéneos (92% vs 20%), e estudar o problema de clustering per se (77% vs 20%). A maior parte dos artigos avaliam a escalabilidade e processamento (53%) seguido de validade e velocidade (47%). Quanto ao segundo objectivo, a avaliação mostrou que o DGClust é capaz de reduzir a comunicação na rede (até 70% menos em dados reais de sensores) reduzindo também a dimensionalidade (apenas uma fracção de 10% a 50% dos estados frequentes são usados, e menos de 15% dos pontos dispararam actualizações do clustering, mantendo a validade competitiva com as versões centralizadas.

Quanto ao processamento de uxos

α-fading average é uma aproximação, sem ocupação de memória, da α-weighted average para uma janela de tamanho w , dentro de um limite do 1 erro de 2εR onde R é o âmbito passado da variável, se α = ε w . Finalmente, a avaliação do contínuos de dados, demonstra-se que a

L2GClust mostrou que possui um elevado grau de concordância com a versão centralizada (sempre acima de 75% e muitas vezes acima de 90%) sendo especialmente robusto em termos de concordância de separação (todos os cenários onde a compactação e a separação são estatísticamente diferentes apresentam melhores resultados para a segunda).

Conclusões:

Esta tese apresenta diversas contribuições. De uma forma geral, os métodos

distribuídos para o clustering em uxos contínuos de dados melhoram a comunicação, velocidade de processamento e o consumo de recursos, conseguindo manter resultados de validade competitiva com as versões centralizadas.

Ao efectuar clustering de dados distribuídos de

forma heterogénea, a discretização local do DGClust reduz a comunicação na rede, enquanto que a monitorização dos estados mais frequentes diminui a dimensionalidade do problema de clustering. Os métodos de processamento de uxos contínuos de dados deverão considerar o modelo de

α-fading

window, já que este permite um esquecimento não catastróco dos

dados, sendo no entanto adaptativo para novos dados e, principalmente, uma aproximação a um modelo de janela ponderada, com erro limitado, não necessitando de guardar os pontos em memória. Finalmente, foi demonstrado que um algoritmo local completamente distribuído para efectuar clustering de fontes de dados, como o L2GClust, consegue encontrar a denição global de clustering sem um servidor central.

Como discussão nal, discute-se

tópicos sensíveis, como as implicações deste trabalho em termos de vantages para domínios especícos de aplicação, incluindo gestão e compreensão de redes de sensores, e a avaliação de algoritmos não supervisionados.

xx

Contents

Contents Acknowledgments

1

2

ix

Abstract

xiii

Resumo

xvii

List of Tables

xxix

List of Figures

xxxiv

Introductory Note

1

1.1

Context Note

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.1

RQ1: Clustering form Ubiquitous Data Streams . . . . . . . . . . . .

5

1.2.2

RQ2: Clustering Distributed Data Streams . . . . . . . . . . . . . . .

5

1.2.3

RQ3: Distributed Clustering of Streaming Sources . . . . . . . . . . .

6

1.3

Parallel Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.4.1

Rationale and Aim (Chapters 2 and 3) . . . . . . . . . . . . . . . . .

11

1.4.2

Contributed Research (Chapters 4, 5 and 6) . . . . . . . . . . . . . .

11

1.4.3

Discussion and Remarks (Chapters 7 and 8) . . . . . . . . . . . . . .

11

1.4.4

Addenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Rationale

13 xxiii

CONTENTS 2.1

Chapter Overview

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2

The Data Stream Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.2.1

Data Stream Models

. . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.2.2

Data Streams Management Systems . . . . . . . . . . . . . . . . . .

17

2.2.3

Learning from Data Streams

. . . . . . . . . . . . . . . . . . . . . .

18

. . . . . . . . . . . . . . . . . . . . . . .

19

2.3

2.4

2.5

2.6

2.7

3

Summarization of Streaming Data 2.3.1

Elementary Statistics

. . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.3.2

Sampling for Data Reduction . . . . . . . . . . . . . . . . . . . . . .

20

2.3.3

Window Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.3.4

Online Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.3.5

Monitoring Frequent Items

. . . . . . . . . . . . . . . . . . . . . . .

25

Cluster Analysis over Data Streams . . . . . . . . . . . . . . . . . . . . . . .

26

2.4.1

Clustering in Streaming Scenarios

. . . . . . . . . . . . . . . . . . .

27

2.4.2

Clustering Data Streams

. . . . . . . . . . . . . . . . . . . . . . . .

28

2.4.3

Clustering Streaming Data Sources . . . . . . . . . . . . . . . . . . .

32

Issues in Learning from Data Streams

. . . . . . . . . . . . . . . . . . . . .

36

2.5.1

Incremental and Decremental Learning . . . . . . . . . . . . . . . . .

37

2.5.2

Managing Model Instability . . . . . . . . . . . . . . . . . . . . . . .

39

2.5.3

Concept Drift and Novelty Detection . . . . . . . . . . . . . . . . . .

41

2.5.4

Evaluation of Stream Learning Algorithms . . . . . . . . . . . . . . .

42

Issues in Learning from Ubiquitous Data Streams

. . . . . . . . . . . . . . .

46

2.6.1

Ubiquitous Streaming Data Sources

. . . . . . . . . . . . . . . . . .

46

2.6.2

Ubiquitous Streaming Data Visualization . . . . . . . . . . . . . . . .

49

2.6.3

Ubiquitous Streaming Data Quality . . . . . . . . . . . . . . . . . . .

50

2.6.4

Ubiquitous Data Clustering . . . . . . . . . . . . . . . . . . . . . . .

51

Desiderata for Ubiquitous Stream Learning . . . . . . . . . . . . . . . . . . .

54

Research Questions and Aims

57

3.1

Distributed Clustering of Ubiquitous Data Streams

3.2

Clustering Distributed Sensor Streams

xxiv

. . . . . . . . . . . . . .

60

. . . . . . . . . . . . . . . . . . . . .

61

CONTENTS 3.3

4

62

Distributed Clustering from Ubiquitous Data Streams

63

4.1

Chapter Overview

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

4.2

Motivation and Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

4.3

Review Methodology

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.1

Elegibility Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.2

Search Strategy and Study Selection . . . . . . . . . . . . . . . . . .

66

4.3.3

Statistical Analysis

67

4.3.4

Study Characteristics

4.3.5

Outcome Measures

4.4

4.5

4.6

5

Distributed Clustering of Streaming Data Sources . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

. . . . . . . . . . . . . . . . . . . . . . . . . . .

68

Search Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.4.1

Selection Process

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.4.2

Overview of Included Studies . . . . . . . . . . . . . . . . . . . . . .

70

Description of Included Studies . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.5.1

Clustering Ubiquitous Streaming Data Points

. . . . . . . . . . . . .

75

4.5.2

Clustering Ubiquitous Streaming Data Sources

. . . . . . . . . . . .

81

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.6.1

Further Research Needed . . . . . . . . . . . . . . . . . . . . . . . .

84

4.6.2

References for Further Research

84

Discussion Remarks

. . . . . . . . . . . . . . . . . . . .

Clustering Distributed Data Streams

87

5.1

Chapter Overview

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

5.2

Motivation and Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

5.3

DGClust  Distributed Grid Clustering

. . . . . . . . . . . . . . . . . . . . .

90

5.3.1

System Overview

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

5.3.2

Notation and Formal Setup . . . . . . . . . . . . . . . . . . . . . . .

91

5.3.3

Local Adaptive Grid . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.3.4

Centralized Frequent State Monitoring . . . . . . . . . . . . . . . . .

94

5.3.5

Centralized Online Clustering . . . . . . . . . . . . . . . . . . . . . .

95

xxv

CONTENTS 5.3.6 5.4

5.5

5.6

6

System Outcome

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

DGClust Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.4.1

Time and Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4.2

Communication

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

DGClust Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5.1

Scalability and Parameter Sensitivity . . . . . . . . . . . . . . . . . . 101

5.5.2

Inter-Parameter Dependencies

5.5.3

Application to Physiological Sensor Data Streams . . . . . . . . . . . 109

Remarks and Future Work

. . . . . . . . . . . . . . . . . . . . . 107

. . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Distributed Clustering of Streaming Data Sources

113

6.1

Chapter Overview

6.2

Motivation and Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.3

Centralized Clustering of Streaming Data Sources . . . . . . . . . . . . . . . 116

6.4

6.5

6.6

6.7

xxvi

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.3.1

Requirements for Clustering Streaming Data Sources

. . . . . . . . . 117

6.3.2

Compliance of Existing Approaches . . . . . . . . . . . . . . . . . . . 119

Distributed Clustering of Streaming Data Sources . . . . . . . . . . . . . . . 120 6.4.1

Issues in Distributed Settings . . . . . . . . . . . . . . . . . . . . . . 121

6.4.2

Desiderata for Distributed Clustering of Streaming Sources . . . . . . 125

Memoryless Fading Window for Stream Processing

. . . . . . . . . . . . . . 126

6.5.1

Weighted Sliding Windows

. . . . . . . . . . . . . . . . . . . . . . . 126

6.5.2

Memoryless Fading Windows . . . . . . . . . . . . . . . . . . . . . . 128

L2GClust  Local to Global Clustering

. . . . . . . . . . . . . . . . . . . . . 138

6.6.1

Sketching Streaming Sensor Data using Fading Averages

6.6.2

Local Approximations of the Global Clustering . . . . . . . . . . . . . 141

6.6.3

Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . 143

L2GClust Experimental Evaluation

. . . . . . . 139

. . . . . . . . . . . . . . . . . . . . . . . 144

6.7.1

Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.7.2

Data Description

6.7.3

Network Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

CONTENTS

6.8

7

Studied Parameters

. . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.7.5

Comparison

6.7.6

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Remarks and Future Work

. . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Discussion 7.1

7.2

7.3

7.4

8

6.7.4

157

On Advantages of Distributed Procedures 7.1.1

Distributed Clustering of Data Streams

7.1.2

Distributed Clustering of Streaming Data Sources . . . . . . . . . . . 160

8.2

. . . . . . . . . . . . . . . . 159

On Sensor Network Comprehension . . . . . . . . . . . . . . . . . . . . . . . 163 7.2.1

Comprehension by Clustering Streaming Sensor Data

7.2.2

Comprehension by Clustering Streaming Sensors . . . . . . . . . . . . 164

7.2.3

Directions for Further Sensor Network Comprehension . . . . . . . . . 165

On Evaluating Unsupervised Stream Learning

. . . . . . . . . 163

. . . . . . . . . . . . . . . . . 166

7.3.1

Learning Model Self-Evaluation . . . . . . . . . . . . . . . . . . . . . 167

7.3.2

Evaluation Strategies

. . . . . . . . . . . . . . . . . . . . . . . . . . 167

On Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.4.1

Review on Distributed Stream Clustering . . . . . . . . . . . . . . . . 169

7.4.2

The DGClust Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.4.3

The L2GClust Algorithm

Concluding Remarks 8.1

. . . . . . . . . . . . . . . . . . . 159

. . . . . . . . . . . . . . . . . . . . . . . . 170

171

Main Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.1.1

Systematic Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

8.1.2

DGClust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8.1.3

Requirements for Clustering Ubiquitous Streaming Sources . . . . . . 175

8.1.4

Memoryless Fading Windows . . . . . . . . . . . . . . . . . . . . . . 175

8.1.5

L2GClust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Main Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 8.2.1

Distributed Clustering from Ubiquitous Data Streams . . . . . . . . . 176

xxvii

CONTENTS

8.3

8.2.2

Distributed Clustering of Streaming Data

. . . . . . . . . . . . . . . 177

8.2.3

Data Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . 177

8.2.4

Distributed Clustering of Streaming Data Sources . . . . . . . . . . . 178

Final Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Bibliography

181

A

195

B

DGClust Experimental Evaluation A.1

Sample Data Description

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

A.2

Algorithm Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

A.3

Detailed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

L2GClust Experimental Evaluation

211

B.1

Sample Data Description

B.2

Algorithm Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

B.3

Detailed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

xxviii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

List of Tables 2.1 2.2

Dierences between traditional (batch) and streaming data processing. . . . .

18

Dierences between batch and streaming learning that may aect the way evaluation is performed

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.1

Complete query used in the search for articles indexed in ISI Web of Knowledge. 67

4.2

Chronological list of retrieved articles . . . . . . . . . . . . . . . . . . . . . .

70

4.3

Aggregated overview of included studies' characteristics by area of clustering

71

4.4

Aggregated overview of included studies' approach by area of clustering

73

4.5

Aggregated overview of included studies' reported advantages and disadvantages by outcome and area of clustering

5.1 6.1

. . .

. . . . . . . . . . . . . . . . . . . .

DGClust Evaluation: Parameter description and corresponding values

74

. . . . 102

Compliance of existing centralized algorithms for clustering of streaming data sources with the enunciated requirements. . . . . . . . . . . . . . . . . . . . 120

6.2

Illustrative examples of the denition of the fading factor . . . . . . . . . . . 138

B.1

L2GClust Evaluation: Clustering validity results, in terms of

κ ˆ statistic, global,

positive and negative agreement, and the percentage of of nodes with total agreement (k B.2

= {2, 3, 4})

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

L2GClust Evaluation: Clustering validity results, in terms of

κ ˆ statistic, global,

positive and negative agreement, and the percentage of of nodes with total agreement (k

= {5, 6, 7})

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

xxix

List of Figures 2.1

Illustrative example of the comparison of error evolution as estimated by holdout and prequential strategies

2.2

5.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

Articles selection process. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Example of a mote sensor network and a plot of the data each univariate sensor is producing over time . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

44

Real-world examples of how human behavior directly inuences sensor data time patterns

4.1

. . . . . . . . . . . . . . . . . . . . . . .

90

Example of a 2-sensor network, with data distribution being sketched for each sensor

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

5.3

DGClust: problems, methods and solutions . . . . . . . . . . . . . . . . . . .

92

5.4

DGClust: Online partitional clustering of most frequent states

. . . . . . . .

97

5.5

DGClust: Example of nal denition for 2 sensors data, with 5 clusters . . . .

98

5.6

DGClust Evaluation: Impact of the number of sensors on loss . . . . . . . . . 104

5.7

DGClust Evaluation: Impact of the number of sensors on communication and processing clustering updates . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.8

DGClust Evaluation: Averaged loss for each combination of parameters and scenarios, for DGClust and Continuous K-Means . . . . . . . . . . . . . . . . 108

5.9

DGClust Evaluation: Average loss for each xed parameter . . . . . . . . . . 108

5.10 DGClust Evaluation: Average loss comparison for dierent dimensions and number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.11 DGClust Evaluation: Performance in terms of loss . . . . . . . . . . . . . . . 111 5.12 DGClust Evaluation: Performance in terms of communicated values, evicted states, cluster centers adaptations and number of guaranteed top-m . . . . . 111

xxxi

LIST OF FIGURES 5.13 DGClust: Main results achieved by applying this strategy. . . . . . . . . . . . 112 6.1

Example of a mote sensor network and a possible clustering denition of the series produced by each sensor

6.2 6.3

. . . . . . . . . . . . . . . . . . . . . . . . . 121

Comparison between traditional and weighted moving averages . . . . . . . . 129 Illustrative example of dierent window models: exponentially weighted and fading

sliding, linearly weighted,

. . . . . . . . . . . . . . . . . . . . . . . 130

6.4

Illustrative example of the convergence of the ballast weight

6.5

Comparison between

6.6

Toy example of a sensor network

6.7

L2GClust: Procedure executed at each local node . . . . . . . . . . . . . . . 143

6.8

L2GClust: Procedure executed at each local node . . . . . . . . . . . . . . . 145

6.9

α-weighted

moving average and the

. . . . . . . . . 133

α-fading

average

. 137

. . . . . . . . . . . . . . . . . . . . . . . . 140

L2GClust Evaluation: Evolution of the average proportion of agreement between sensors and the global clustering denition

6.10 L2GClust Evaluation: Sensitivity of

κ ˆ

. . . . . . . . . . . . . . . 149

to the number of sensors, according to

dierent number and overlap of clusters

. . . . . . . . . . . . . . . . . . . . 150

6.11 L2GClust Evaluation: Sensitivity of the proportion of agreement (validity) to the number of sensors, according to dierent number and overlap of clusters 6.12 L2GClust Evaluation:

151

Sensitivity of the proportion of positive agreement

(compatness) to the number of sensors, according to dierent number and overlap of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.13 L2GClust Evaluation:

Sensitivity of the proportion of negative agreement

(separability) to the number of sensors, according to dierent number and overlap of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.14 L2GClust Evaluation:

Sensitivity of perfectness to the number of sensors,

according to dierent number and overlap of clusters

. . . . . . . . . . . . . 154

A.1

DGClust Evaluation: Loss to the real centroids

A.2

DGClust Evaluation: Loss to the real centroids (2 sensors)

. . . . . . . . . . 198

A.3

DGClust Evaluation: Loss to the real centroids (4 sensors)

. . . . . . . . . . 199

A.4

DGClust Evaluation: Loss to the real centroids (8 sensors)

. . . . . . . . . . 199

A.5

DGClust Evaluation: Loss to the real centroids (16 sensors) . . . . . . . . . . 200

A.6

DGClust Evaluation: Loss to the real centroids (32 sensors) . . . . . . . . . . 200

A.7

DGClust Evaluation: Loss to the real centroids (64 sensors) . . . . . . . . . . 201

xxxii

. . . . . . . . . . . . . . . . 198

LIST OF FIGURES A.8

DGClust Evaluation: Loss to the real centroids (128 sensors) . . . . . . . . . 201

A.9

DGClust Evaluation: Transmitted values . . . . . . . . . . . . . . . . . . . . 202

A.10 DGClust Evaluation: Transmitted values (2 sensors) . . . . . . . . . . . . . . 202 A.11 DGClust Evaluation: Transmitted values (4 sensors) . . . . . . . . . . . . . . 203 A.12 DGClust Evaluation: Transmitted values (8 sensors) . . . . . . . . . . . . . . 203 A.13 DGClust Evaluation: Transmitted values (16 sensors) . . . . . . . . . . . . . 204 A.14 DGClust Evaluation: Transmitted values (32 sensors) . . . . . . . . . . . . . 204 A.15 DGClust Evaluation: Transmitted values (64 sensors) . . . . . . . . . . . . . 205 A.16 DGClust Evaluation: Transmitted values (128 sensors) A.17 DGClust Evaluation: Number of clustering updates

. . . . . . . . . . . . 205

. . . . . . . . . . . . . . 206

A.18 DGClust Evaluation: Number of clustering updates (2 sensors) . . . . . . . . 206 A.19 DGClust Evaluation: Number of clustering updates (4 sensors) . . . . . . . . 207 A.20 DGClust Evaluation: Number of clustering updates (8 sensors) . . . . . . . . 207 A.21 DGClust Evaluation: Number of clustering updates (16 sensors)

. . . . . . . 208

A.22 DGClust Evaluation: Number of clustering updates (32 sensors)

. . . . . . . 208

A.23 DGClust Evaluation: Number of clustering updates (64 sensors)

. . . . . . . 209

A.24 DGClust Evaluation: Number of clustering updates (128 sensors) . . . . . . . 209 B.1

L2GClust Evaluation: Sanity (2 clusters) . . . . . . . . . . . . . . . . . . . . 216

B.2

L2GClust Evaluation: Sanity (3 clusters) . . . . . . . . . . . . . . . . . . . . 216

B.3

L2GClust Evaluation: Sanity (4 clusters) . . . . . . . . . . . . . . . . . . . . 217

B.4

L2GClust Evaluation: Sanity (5 clusters) . . . . . . . . . . . . . . . . . . . . 217

B.5

L2GClust Evaluation: Sanity (6 clusters) . . . . . . . . . . . . . . . . . . . . 218

B.6

L2GClust Evaluation: Sanity (7 clusters) . . . . . . . . . . . . . . . . . . . . 218

B.7

L2GClust Evaluation: Validity (2 clusters)

. . . . . . . . . . . . . . . . . . . 219

B.8

L2GClust Evaluation: Validity (3 clusters)

. . . . . . . . . . . . . . . . . . . 219

B.9

L2GClust Evaluation: Validity (4 clusters)

. . . . . . . . . . . . . . . . . . . 220

B.10 L2GClust Evaluation: Validity (5 clusters)

. . . . . . . . . . . . . . . . . . . 220

xxxiii

LIST OF FIGURES B.11 L2GClust Evaluation: Validity (6 clusters)

. . . . . . . . . . . . . . . . . . . 221

B.12 L2GClust Evaluation: Validity (7 clusters)

. . . . . . . . . . . . . . . . . . . 221

B.13 L2GClust Evaluation: Compactness (2 clusters)

. . . . . . . . . . . . . . . . 222

B.14 L2GClust Evaluation: Compactness (3 clusters)

. . . . . . . . . . . . . . . . 222

B.15 L2GClust Evaluation: Compactness (4 clusters)

. . . . . . . . . . . . . . . . 223

B.16 L2GClust Evaluation: Compactness (5 clusters)

. . . . . . . . . . . . . . . . 223

B.17 L2GClust Evaluation: Compactness (6 clusters)

. . . . . . . . . . . . . . . . 224

B.18 L2GClust Evaluation: Compactness (7 clusters)

. . . . . . . . . . . . . . . . 224

B.19 L2GClust Evaluation: Separability (2 clusters) . . . . . . . . . . . . . . . . . 225 B.20 L2GClust Evaluation: Separability (3 clusters) . . . . . . . . . . . . . . . . . 225 B.21 L2GClust Evaluation: Separability (4 clusters) . . . . . . . . . . . . . . . . . 226 B.22 L2GClust Evaluation: Separability (5 clusters) . . . . . . . . . . . . . . . . . 226 B.23 L2GClust Evaluation: Separability (6 clusters) . . . . . . . . . . . . . . . . . 227 B.24 L2GClust Evaluation: Separability (7 clusters) . . . . . . . . . . . . . . . . . 227 B.25 L2GClust Evaluation: Perfectness (2 clusters)

. . . . . . . . . . . . . . . . . 228

B.26 L2GClust Evaluation: Perfectness (3 clusters)

. . . . . . . . . . . . . . . . . 228

B.27 L2GClust Evaluation: Perfectness (4 clusters)

. . . . . . . . . . . . . . . . . 229

B.28 L2GClust Evaluation: Perfectness (5 clusters)

. . . . . . . . . . . . . . . . . 229

B.29 L2GClust Evaluation: Perfectness (6 clusters)

. . . . . . . . . . . . . . . . . 230

B.30 L2GClust Evaluation: Perfectness (7 clusters)

. . . . . . . . . . . . . . . . . 230

xxxiv

"Indeed, self-organization presupposes that the system knows that it knows something. This awareness may in turn be exploited to represent knowledge about knowledge. In fact, if machines can learn, they could potentially learn how to learn, and consequently teach more eectively."

Christophe Giraud-Carrier and Tony Martinez (1994)

1

Introductory Note Thesis Context and Outline

1. Introductory Note

Thesis Context and Outline Books have the same enemies as people: re, humidity, animals, weather, and their own content.  Paul Valery, French critic & poet (1871-1945)

In this chapter we present a high-level outline of the thesis, from the rationale to the outcomes, the context that led to this research, and the main and parallel contributions that resulted from the author's work during the time frame of the doctoral project.

1.1

Context Note

Research careers usually start during or after undergraduate degree completion. In 2003, the nal assignment the author was required to endure towards nishing his undergraduate degree in Computer Science at the Faculty of Sciences of the University of Porto was an internship at EFACEC - Electronic Systems, a large company with high national and international impact on industrial areas such as electricity management and distribution. The aim of the project was to develop a neural network engine that would be included in their load forecast component, especially for one hour ahead electricity demand prediction. During the internship at EFACEC, the author started his research work in the machine learning eld, which in time led him to collaborate with João Gama, one of the world-wide main researchers in the area of learning from data streams, today his main adviser. After the development of a working prototype, the forecast module was latter evaluated on charge transfer scenarios (Rodrigues & Gama, 2004). This internship would become the main start up of the author's research career. By the end of the internship, the author became involved in research project ALES, where research was being conducted in the areas of decision trees for data streams (Gama et 2005) and concept drift detection (Gama et

al.,

al., 2004), and it was clear that the electricity

demand problem was paradigmatic of a stream learning problem.

Hence, two research

problems rised which focused the interest of both the research and the industry community: clustering load demand proles from electricity demand data streams; and online learning from the streaming data for prediction using online neural networks. The two problems were the main subjects of RETINAE, a national consortium project between the University of Porto and EFACEC, during the years of 2005-2007. Meanwhile, in his Master's thesis (Rodrigues, 2005), the author developed a new algorithm for clustering streaming data sources, with a natural application to the electrical load demand proling problem (Rodrigues et al., 2008d). The RETINAE project also aimed at predicting electrical load using online techniques over

3

1. INTRODUCTORY NOTE data streams, so that a prediction was available at any time.

The development of this

work led to the implementation of an entire system for analysis and prediction (Rodrigues & Gama, 2009). These works were the main research projects that the author was involved in in between his master's and his doctoral thesis. By the end of RETINAE project, the author felt the need to expand his research into a new paradigm of learning, which would consider data streams being produced and processed in distributed scenarios such as sensor networks.

At this point, a co-advisership started with

Luís Lopes, who was working on programming languages and communication systems for sensor networks. This way, a proposal for a PhD grant was approved by FCT (Portuguese Science and Technology Foundation), which aimed exactely at expanding machine learning to ubiquitous scenarios. The fact that, during the course of this PhD project, the author's appointment as an Invited Assistant Lecturer has moved from the Faculty of Sciences to the Faculty of Medicine of the University of Porto has widened his perspective on medical informatics, machine learning and its applications in Medicine. The drawback of having such a wide range of areas of interest was, undoubtly, a certain lack of focus on the thesis main subjects, which might have been addressed less thoroughly. Hopefully future work will cover for this.

1.2

Thesis Contributions

Knowledge discovery from ubiquitous data streams has become a major goal for all sorts of applications. When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques, such as clustering.

Then, two sub-problems exist:

clustering streaming data examples and clustering streaming data sources.

The former

searches for dense regions of the data space, identifying hot-spots where data sources tend to produce data, while the later nds groups of sources that behave similarly through time. This thesis tries to answer three main research questions:

RQ1 Do distributed stream clustering algorithms improve data mining results when applied to ubiquitous data streams scenarios?

RQ2 Does local discretization and representative clustering improve validity, communication and computation loads when applied to distributed sensor data streams?

RQ3 Can a fully distributed clustering algorithm achieve global clustering denition of the entire network without a centralized server?

4

1.2. THESIS CONTRIBUTIONS Although directed towards three research questions, the thesis actually includes ve main contributions, since three contributions emerged when searching for answers to the third research question:

RQ1 a description of the literature on distributed clustering of ubiquitous data streams; RQ2 an algorithm for clustering distributed sensor data streams; RQ3 the denition of requirements for algorithms willing to address distributed clustering of ubiquitous streaming data sources; RQ3 a memoryless fading window model for streaming data processing; and RQ3 an algorithm for distributed clustering of ubiquitous streaming data sources.

1.2.1

RQ1: Clustering form Ubiquitous Data Streams

A systematic review was designed to check existent scientic evidence of the existance of valuable distributed algorithms addressing clustering for ubiquitous data streams. Clear search strategy, elegibility criteria, and evaluation methodology was endured to cut the bias on the review. Analysis of the included studies is focused on the type of problem addressed in it: clustering ubiquitous data streams or clustering ubiquitous streaming data sources, to extract dierences between the two research areas.

Studied variables are both publication-related (author,

year, country, venue, impact), proposal-related (setting, aim, distributed procedures, data partition) and evaluation-related (performed comparison and main outcomes evaluated).

1.2.2

RQ2: Clustering Distributed Data Streams

Regarding clustering of distributed data streams, we propose DGClust, a new distributed grid clustering algorithm for data produced on wide sensor networks.

Its core is based on

online discretization of data (to reduce communication), frequent state monitoring (to cut computation), and online partitional clustering (to keep high validity and adaptivity). Each local sensor receives a, possibly innite, data stream from a given source, locally being incrementally discretized into a univariate adaptive grid. Each new data point triggers a cell in this grid, reecting the current state of the data stream at the local site.

Whenever a

local site changes its state, that is, the triggered cell changes, the new state is communicated to a central site. Furthermore, the central site keeps the global state of the entire network where each local site's state is the cell number of each local site's grid. It is expected that

5

1. INTRODUCTORY NOTE only a small number of this global states are frequently triggered by the network.

Thus,

parallel to the aggregation, the central site keeps a small list of counters of the most frequent global states. Finally, the current clustering denition is dened and maintained by a simple adaptive partitional clustering algorithm applied on the frequent states central points.

1.2.3

RQ3: Distributed Clustering of Streaming Sources

Given the novelty of this problem, at least for ubiquitous scenarios, several contributions emerged from research.

First, basic requirements had to be dened.

Then, new stream

processing techniques were developed. Finally, a local algorithm was dened.

1.2.3.1

Requirements

The basic requirements usually presented for clustering data streams are that the system must include a compact representation of clusters, must process data in a fast and incremental way and should clearly show changes in the clustering structure. Nevertheless, there are some conceptual dierences when addressing multiple streaming sources. This way, requirements for clustering streaming sources, either centralized or distributed, are enunciated and discussed.

1.2.3.2

Memoryless Fading Windows

In most applications recent data is the most relevant one. Usually in streaming settings, the concept generating data evolves smoothly, so old data is less but still important. Even within a sliding window, the most recent data point is usually more important than the last one which is about to be discarded.

Given its particular characteristics, we propose to use an

exponentially-weighted approach, where the weight of a data point decreases exponentially with time. Usual ubiquitous streaming algorithms run with limited memory and processing power. We propose a memoryless approximation of the weighted window model for data stream summarization, which does not need to keep any data points to approximate a measure that would need

O(w)

data points to be stored. We also present a precise way of dening the

α-fading average is an a known ε-based bound.

factor so that our of size

6

w,

within

approximation of the

α-weighted

α

fading

moving average

1.3. PARALLEL CONTRIBUTIONS 1.2.3.3

L2GClust

A local algorithm is proposed to perform clustering of sensors on ubiquitous sensor networks, based on the moving average of each node's data over time. There are two main characteristics.

On one hand, each sensor node keeps a sketch of its own data.

On the other

hand, communication is limited to direct neighbors, so clustering is computed at each node. The moving average of each node is approximated using memoryless fading average, while clustering is based on the furthest-point algorithm applied to the centroids computed by the node's direct neighbors.

Each sensor acts as data stream source but also as a processing

node, keeping a sketch of its own data, and a denition of the clustering structure of the entire network of data sources.

1.3

Parallel Contributions

Usual work towards a PhD tends to be quite focused on the topic, and leads to precise contributions within the topic. However, during his PhD project, the author endured several parallel research paths, which led to dierent contributions:

1. learning from ubiquitous data streams (thesis main theme); 2. learning from data streams (consolidation and expansion of previous work); 3. learning from data streams (dissemination of basic knowledge); 4. other collaborations in learning from data streams;

The rst area of contributions is probably the most relevant for this exposure.

The au-

thor addressed three dierent problems: synthetizing evidence on distributed learning from ubiquitous data streams; clustering of distributed data streams; and distributed clustering of streaming data sources. These are detailed in

Chapters 4, 5 and 6, but were already partially

published as:

• Pedro Pereira Rodrigues, João Gama, João Araújo and Luís Lopes. Clustering Streaming Sensors .

In

Proceedings of the Fourth International Workshop on Knowledge

Discovery from Sensor Data (SensorKDD'10),

pages 3544, Washington, DC, USA. July 2010.

• Pedro Pereira Rodrigues, João Gama and Raquel Sebastião. tuos Settings .

In

Network Comprehension by

Memoryless Fading Windows in Ubiqui-

Proceedings of the First Ubiquitous Data Mining Workshop,

pages 2327, ECAI.

Lisboa, Portugal, August 2010. 7

1. INTRODUCTORY NOTE • Pedro Pereira Rodrigues, João Gama and Luís Lopes.

Clustering Distributed Sensor Data Streams .

In

Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases ECML PKDD 2008,

Volume 5212 of Lecture Notes in Articial Intelligence, pages 282297, ISBN 978-

3-540-87480-5, Springer-Verlag, Antwerp, Belgium. September 2008.

• Pedro Pereira Rodrigues, João Gama and Luís Lopes. Comprehension . Data,

Chapter 6 of

Knowledge Discovery for Sensor Network

Intelligent Techniques for Warehousing and Mining Sensor Network

ISBN 978-1-605-66328-9. IGI Global, January 2010.

• Pedro Pereira Rodrigues, João Gama and Luís Lopes. Chapter 4

Knowledge Discovery from Sensor Data,

Requirements for Clustering Streaming Sensors .

Volume 6 of

Industrial Innovation Series,

pages

3553. ISBN 978-1-42008-232-6, CRC Press, December 2008.

• Pedro Pereira Rodrigues, João Gama and Luís Lopes. of Streaming Sensors .

Issues and Directions for Distributed Clustering

In Cesar Analide, Paulo Novais and Pedro Henriques, editors, Simpósio Doutoral

em Inteligência Articial,

ISBN 978-989-95618-1-6, pages 7786.

APPIA. Guimarães, Portugal.

December 2007.

The second set of contributions is mainly related to the consolidation and expansion of the work developed during the author's Master's thesis and the refered in

Chapter 2,

RETINAE

project. These are

as a support for this thesis rationale, but were mainly published as:

• Pedro Pereira Rodrigues and João Gama. streams . Intelligent Data Analysis

A system for analysis and prediction of electricity-load

13(3):477496, ISSN 1088-467X, June 2009.

• Pedro Pereira Rodrigues, João Gama and João Pedro Pedroso.

Hierarchical Clustering of Time-Series

Data Streams . IEEE Transactions on Knowledge and Data Engineering

20(5):615627, ISSN 1041-

4347, May 2008.

• João Gama and Pedro Pereira Rodrigues. Chapter 8 of

Electricity Load Forecast using Data Streams Techniques .

Knowledge Discovery from Sensor Data,

Volume 6 of

Industrial Innovation Series,

pages

131147. ISBN 978-1-42008-232-6, CRC Press, December 2008.

• Pedro Pereira Rodrigues and João Gama. Mining . Science,

In

Knowledge Discovery from Sensor Data,

Volume 5840 of

Lecture Notes in Computer

ISBN 978-3-642-12518-8, pages 175189. Springer Verlag, 2009.

• Pedro Pereira Rodrigues and João Gama.

Robust Division in Clustering of Streaming Time Series .

In

Proceedings of the 18th European Conference in Articial Intelligence - ECAI 2008,

of

Frontiers in Articial Intelligence and Applications,

Press. Patras, Greece. July 2008. 8

A Simple Dense Pixel Visualization for Mobile Sensor Data

Volume 178

pages 172176, ISBN 978-1-586-03891-5, IOS

1.3. PARALLEL CONTRIBUTIONS • Pedro Pereira Rodrigues and João Gama.

Semi-fuzzy Splitting in Online Divisive-Agglomerative

Clustering .

In Progress in Articial Intelligence, 13th Portuguese Conference on Articial Intelligence,

EPIA 2007,

Volume 4874 of

Lecture Notes in Articial Intelligence,

pages 133144, ISBN 978-3-540-

77000-8, Springer-Verlag. Guimarães, Portugal. December 2007.

• João Gama and Pedro Pereira Rodrigues. Discovery in Databases:

PKDD 2007.

Knowledge Discovery in Databases,

Stream-Based Electricity Load Forecast .

In

Knowledge

11th European Conference on Principles and Practice of

Volume 4702 of

Lecture Notes in Articial Intelligence,

pages

446453, ISBN 978-3-540-74975-2, Springer Verlag. Warsaw, Poland, September 2007.

The third area focused on disseminating basic knowledge on learning from data streams, which is also present in

Chapter 2,

mainly through the publication of state-of-the-art book

chapters:

• João Gama and Pedro Pereira Rodrigues.

An Overview on Mining Data Streams . Data Mining,

Foundations of Computational Intelligence  Volume 6: Computational Intelligence,

Chapter 2 of

Volume 206 of

Studies in

pages 2945. ISBN 978-3-642-01090-3, Springer Verlag, 2009.

• João Gama and Pedro Pereira Rodrigues.

Data Streams .

Data Warehousing and Mining, Second Edition,

Chapter LXXXV of

Encyclopedia of

Volume II, pages 561565. ISBN 978-1-60566-010-3,

Information Science Reference, August 2008.

• João Gama and Pedro Pereira Rodrigues.

Learning from Data Streams .

pedia of Data Warehousing and Mining, Second Edition,

Chapter CXLV of

Encyclo-

Volume III, pages 11371141. ISBN 978-1-

60566-010-3, Information Science Reference, August 2008.

• Pedro Pereira Rodrigues and João Gama.

Clustering Techniques in Sensor Networks .

Learning from Data Streams: Processing Techniques in Sensor Networks,

Chapter 9 of

pages 125142. ISBN 978-

3-540-73678-3, Springer Verlag, September 2007.

• João Gama and Pedro Pereira Rodrigues.

Data Stream Processing .

Data Streams: Processing Techniques in Sensor Networks,

Chapter 3 of

Learning from

pages 2539. ISBN 978-3-540-73678-3,

Springer Verlag, September 2007.

Finally, the author also addressed dierent research problems, some of which were mainly conducted by a co-author but on which the author has nevertheless given his contribution. These actually focused on self-awareness of machine learning, which was also a topic of the initial PhD grant proposal, and include: data stream processing, monitoring distributions for concept drift detection, reliability of streaming predictions, and evaluation of stream learning algorithms.

9

1. INTRODUCTORY NOTE • João Gama, Raquel Sebastião and Pedro Pereira Rodrigues. Algorithms .

In

Issues in Evaluation of Stream Learning

Proceedings of the 15th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining,

pages 329338, ISBN 978-1-60558-495-9, ACM. Paris, France, June-

July 2009.

• João Gama, Pedro Pereira Rodrigues and Raquel Sebastião. Data Streams .

In

Evaluating Algorithms that Learn from

Proceedings of the 24th Annual ACM Symposium on Applied Computing,

pages

1496-1500, ISBN 978-1-60558-166-8, ACM. Honolulu, Hawaii, USA, March 2009.

• Raquel Sebastião, João Gama, Pedro Pereira Rodrigues and João Bernardes. Histogram Distribution for Change Detection in Data Streams . Data,

Volume 5840 of

In

Monitoring Incremental

Knowledge Discovery from Sensor

Lecture Notes in Computer Science,

ISBN 978-3-642-12518-8, pages 2542.

• Raquel Sebastião, João Gama and Pedro Pereira Rodrigues.

Change detection in climate data over

Springer Verlag, 2010.

the Iberian Peninsula . Workshops,

In

Proceedings of the 9th IEEE International Conference on Data Mining

ISBN 978-0-7695-3902-7, pages 248253. IEEE Computer Society Press. ICDM 2009,

Miami, FL, USA. December 2009.

• Pedro Pereira Rodrigues, João Gama and Zoran Bosni¢. Predictions in Data Streams . Workshops,

Online Reliability Estimates for Individual

In Proceedings of the 8th IEEE International Conference on Data Mining

ISBN 978-0-7695-3503-6, pages 3645. IEEE Computer Society Press. ICDM 2008, Pisa,

Italy. December 2008.

The impact of the author in the precise research eld was also supported by co-chairing scientic events focused on learning from data streams, and the participation in the organization of several machine learning conferences and workshops, both as program committee member and as publicity chair.

1.4

Thesis Outline

The structure of this thesis tries to highlight the dierent phases of a research project. Starting by giving a rationale and dening the research question, the document then explores background knowledge to support the main contributed research, which is afterwards subject of discussion. exposition.

10

Main ndings and recommendations are pointed out at the end of the

1.4. THESIS OUTLINE 1.4.1

Rationale and Aim (Chapters 2 and 3)

Setting the focus on the scenarios found on recent applications of machine learning, where data ows continuously from distributed sources, and which require distributed procedures,

Chapter 2

presents the rationale behind this research on learning from ubiquitous data

streams. Although directed towards an overview of main issues related to the main topic of the thesis, this chapter also supports itself on a set of side-step contributions the author has succesfully published during the time frame of this doctoral project, which are presented as case reports at the suited sections. Based on this rationale,

Chapter 3

clearly and succintly

states the research questions addressed by this work, the aims that it tries to achieve, and a basic summary of the methodology involved in the process of research, enabling a quick and focused reference for the topics addressed in the following chapters.

1.4.2

Contributed Research (Chapters 4, 5 and 6)

These are the main chapters of the thesis, in the sense that they expose the main contributions to the research eld of learning from ubiquitous data streams. They concentrate the author's

Chapter 4 presents Chapter 5 presents a new

answers to the research questions enunciated in the previous chapter. a systematic review on clustering from ubiquitous data streams.

algorithm to perform clustering of data streams, produced distributedly on sensor networks.

Chapter 6

addresses the problem of clustering distributed data stream sources, presenting

both the requirements and a new algorithm willing to deal with such a demanding problem, including a new method to process streaming data on resource-restricted settings.

1.4.3

Discussion and Remarks (Chapters 7 and 8)

This research is not without limitations. In

Chapter 7 we discuss the strenghs and limitations

of our work, focusing on the answers given to the proposed research questions. Also discussed are the implications of the contributions on specic domains of application, sensor network comprehension and the evaluation of unsupervised learning algorithms.

Chapter 8 concludes

the exposition, clearly stating the main ndings and recommendations extractable it.

1.4.4

Addenda

Some experiments were just summarized in the main exposition of the thesis.

Addendum A

and

Addendum B

Hence,

present detailed analysis of the results for DGClust and

L2GClust, the two algorithms proposed in

Chapter 5

and

Chapter 6

.

11

2

Rationale From stream learning to ubiquitous learning

2. Rationale

From stream learning to ubiquitous learning There are only two kinds of scholars; those who love ideas and those who hate them. Émile Chartier, French philosopher, journalist, & pacist. (1868-1951)

Nowadays, applications include distributed data sources which produce innite streams of data at high-speed. In streaming settings, traditional data gathering processes would create databases with tendentiously innite length.

Moreover, data gathering and analysis has

become ubiquitous, in the sense that our world is evolving into a setting where all devices, as small as they may be, will be able to include sensing and processing ability. Thus, if data is to be gathered centrally, this scenario also points out to databases with tendentiously innite width. The usually known workbench where all data is available at all time is no longer valid. The current setting of having a web of sensory and processing devices represents a new knowledge discovery environment, possibly not completely observable, that is much less controlled by both the human user and a common centralized control process. This ubiquitous scenario raises several obstacles to the usual knowledge discovery work ow, enforcing the need to develop new techniques, with dierent conceptualizations and adaptive decisionmaking, being subject to the same interactions required by previous static and centralized applications.

2.1

Chapter Overview

The motivation for this thesis work grows from dierent problems spreaded across dierent research areas within machine learning. Hence, we must skim through the state-of-the-art literature by pointing out the main tasks, features and issues related with those areas. Next section introduces basic knowledge and issues related to scenarios where data is produced in a stream at high speed. Then, in Section 2.3, we investigate summarization techniques for data streams that enable the application of stream learning algorithms. The main focus of this dissertation is clustering from data streams.

Hence, Section 2.4 presents current

knowledge on this research area, focusing on the dierence between clustering data streams and clustering streaming data sources.

Then, Section 2.5 focus on the issues of learning

from data streams, including incremental and decremental learning, and how to monitor and evaluate the learning process. But current applications include distributed data sources. Data is no longer just innite in length; it may be tendentiously innite in width too. Section 2.6

15

2. RATIONALE presents some examples of such scenarios, while Section 2.7 takes the nal push forward towards the implementation of machine learning algorithms that learn from ubiquitous data streams, supporting the contribution of the following chapters. All sections in this chapter are not only supported on state-of-the-art literature, but also on case reports of original research which the author, in collaboration with other researchers, endured in parallel with the main track of this thesis.

2.2

The Data Stream Paradigm

Many sources produce data continuously.

Examples include sensor networks, wireless net-

works, radio frequency identication (RFID), customer click streams, telephone records, multimedia data, scientic data, sets of retail chain transactions, etc. (Gama & Rodrigues, 2007).

These sources are called data streams.

A data stream is an ordered sequence of

instances that can be read only once or a small number of times (Guha et

al., 2003), using

limited computing and storage capabilities. These sources of data are characterized by being open-ended, owing at high-speed, and generated by non stationary distributions. What distinguishes current data from earlier one are automatic data feeds. We do not just have people who are entering information into a computer.

Instead, we have computers

entering data into each other (Muthukrishnan, 2005). Thus, there are applications in which the data is modeled best not as persistent tables but rather as transient data streams.

2.2.1

Data Stream Models

A key idea about what makes data streams dierent from conventional relational models is that operating in the data stream model does not preclude the use of data in conventional stored relations. Some relevant dierences include (Babcock et al., 2002):



The data elements in the stream arrive online.



The system has no control over the order in which data elements arrive, either within a data stream or across data streams.



Data streams are potentially unbounded in size.



Once an element from a data stream has been processed it is discarded or archived. It cannot be retrieved easily unless it is explicitly stored in memory, which is small on the size of the data streams.

16

2.2. THE DATA STREAM PARADIGM In the stream model (Muthukrishnan, 2005) the input elements sequentially, item by item and describe an underlying function on how



ai

describes

A,

Stream models dier

insert-only or time series model:

ai

is produced, it can not be changed;

insert-delete or turnstile model: observations



A.

arrive

being distinguished among:

once an observation



a1 , a2 , . . . , aj , . . .

ai

can be deleted or updated; or

acumulative or cash-register model: each

ai

is an increment to

A,

i.e.

Ai = Ai−1 + ai .

Examples of stream applications for the three models include time-series generated by sensor networks (insert-only), radio frequency identication (insert-delete), and monitoring the maximum or minimum of the sum of quantities transactioned per entity (acumulative) (Gama & Rodrigues, 2007). Following the concept of Data Base Management Systems, systems that implement, monitor and manage data streams, irrespectively of the implied model, are called Data Streams Management Systems (Muthukrishnan, 2005).

2.2.2

Data Streams Management Systems

A problem that clear illustrates the issues in stream process, is to nd the maximum or the minimum of the

w

most recent values on a data stream (Datar et

available memory is higher than

w,

al., 2002). When the

all the recent elements t in memory, so the problem is

trivial and we can nd the exact solution.

When

w

is greater than the available memory,

there is no algorithm that provides an exact solution, because it is not possible to store all the data elements (Gama & Rodrigues, 2007). For example, suppose that the sequence is monotonically decreasing and we are monitoring the maximum value: the rst element in the window is always the maximum. As new data arrives, the exact answer requires maintaining all the

w

elements in memory.

From the point of view of a data stream management system several research issues emerge (Babcock et al., 2002). Blocking query operators are query operators that are unable to produce the rst tuple of the output before they have seen the entire input. In the stream setting, continuous queries using block operators are problematic, so the semantics of such streaming operators is an active research area.

This type of queries requires techniques for storing

summaries or synopsis information about previously seen data, assuming that there is a tradeo between the size of summaries and the ability to provide precise answers, especially because

17

2. RATIONALE Table 2.1: Dierences between traditional (batch) and streaming data processing.

Nr of passes on data Available time Available memory Results accuracy

Batch

Stream

Multiple Unlimited Unlimited Accurate

One (or less) Restricted Restricted Approximate

the time dimension is an important characteristic in streams. A summary of the dierences between traditional (batch) and streaming data processing is presented in Table 2.1 (Gama & Rodrigues, 2007).

2.2.3

Learning from Data Streams

In streaming scenarios, data ows at huge rates, reducing the ability to store and analyze it. The challenge problem in learning from data streams is the ability to permanently maintain an accurate decision model.

This issue requires learning algorithms that can modify the

current model whenever new data is available at the rate of data arrival. Learning techniques which operate through xed training sets and generate static models are obsolete in these contexts.

Faster answers are usually required, keeping an anytime model of the data and

enabling better decisions, possibly forgetting older information (Gama & Rodrigues, 2009). As for their batch relatives, learning from data streams problems can also be segmented into supervised and unsupervised learning problems (we will not cover semi-supervised learning in this exposure). Supervised learning tasks are those where there are a set of variables that might be denoted as independent or inputs, which are measured or preset, and which might have some inuence on one or more dependent variables or outputs (Hastie et

al., 2001).

Unsupervised learning tasks are those where there is a set of observations of a random pvector

X

with a joint probability

P (X),

and the goal is to directly infer the properties of this

probability density without a supervisor giving a degree-of-error for each observation (Hastie et

al., 2001). Simply put, all variables are seen as factors for the joint probability, although

they might not be independent among each other (Domingos & Pazzani, 1997). Hulten et al. (2001)

presented some desirable properties for data stream learning systems.

Overall, they should process examples at the rate they arrive, use a single scan of data and xed memory, maintain a decision model at any time and be able to adapt the model to the most recent data. The sequences of data points are not independent, and are not generated by stationary distributions.

18

We need dynamic models that evolve over time and are able

2.3. SUMMARIZATION OF STREAMING DATA to adapt to changes in the distribution generating examples.

If the process is not strictly

stationary (as most of real-world applications), the target concept may gradually change over time. Hence, data stream mining is an incremental task that requires incremental learning algorithms that take drift into account (Gama et al., 2004).

2.3

Summarization of Streaming Data

Data streams are unbounded in length. It is impractical to store all observations to execute queries that reference past data.

High-quality approximate answers can be an acceptable

solution. In such cases, queries are executed over a summary, a compact data-structure that captures data distribution. Certainly, large summaries provide more precise answers. Either way, there is a trade-o between the size of summaries and the overhead to update summaries and the ability to give precise answers. In both cases, they must use data structures that can be maintained incrementally. In this section will shall focus on synopsis and histograms, given their simple approximation capabilities and low resource requirements. Several techniques have been developed for storing summaries or synopsis information about past data. Datar et al. (2002) presented simple counting problems that clearly illustrates the issues in data stream research, such as given a stream of bits (0's and of the number of

1's in the last N

1's),

keep up a count

elements seen in the stream, or given a stream of elements

that are positive integers within a known range, maintain at every time instant the sum of the last

N

elements seen in the stream, or even to nd the number of distinct values in a

stream of values with a known domain. All these problems have an exact solution if we have enough memory to store all the elements in the sliding window. Stream methods should oer approximate answers using the reduced resources, especially useful if the associated error is in an admissible boundary. Approximation methods in the data streams setting can be dened using the (ε, δ )-approxi-

ε < 1 and δ < 1 1 − δ is within relative error ≤ ε. Time and space required to compute an answer depends on ε and δ . Some results on tail inequalities provided mation schema (Gama & Rodrigues, 2007): given any positive number compute an estimate that with probability

by statistics are useful to accomplish this goal. The basic general bounds on the tail probability of a random variable (that is, the probability that a random variable deviates greatly from its expectation) include the Markov, Chebyshev and Cherno inequalities (Motwani & Raghavan, 1995). However, usual stream data points are not produced at random according to a stable distribution, diminishing the possibility of probabilistic error bounds. Nevertheless, to achieve approximate results a clear denition of how would the exact result be computed must be considered to enable a validating comparison.

19

2. RATIONALE 2.3.1

Elementary Statistics

One of the statistics with higher usability/complexity ratio in statistics and machine learning is the sample mean of a variable is well-known:

X.

In data streams, the recursive version after

n observations,

¯ ¯ n = (n − 1) × Xn−1 + xn X n

(2.1)

In fact, to incrementally compute the mean of a variable, we only need to keep up in memory the number of observations (n) and the sum of the values seen so far

P

xi .

Some simple

mathematics allow to dene an incremental version of the variance (hence, the standard deviation). In that case we need to store three quantities: the number of data points sum of the

n

points

P

xi ;

and the sum of the squares of the

σn2

P =

n

data points

P

n;

the

x2i .

xi )2 n

P

x2i − ( n−1

(2.2)

Another useful measure that can be recursively computed is the correlation coecient between two variables

X

and

Y.

Additionaly to each variable statistics, one needs to maintain the

sum of each observation cross-product

P

i (xi yi ). The exact correlation coecient is:

P P P (xi yi ) − xin yi qP ρn (a, b) = qP P 2 x2i − ( nxi ) yi2 −

(

P

yi )2 n

(2.3)

The main interest of these formulas is that they allow the support of exact statistics over an eventually innite sequence of numbers without storing in memory all the numbers. Although these statistics are used everywhere they are of limited use in the stream problem setting given their use of all previous data.

2.3.2

Sampling for Data Reduction

In streaming settings, we are in presence of a potentially innite number of examples. Sometimes, sampling a large set is enough to full the task at hand, even if the task implies machine learning. Sampling involves loss of information: some tuples are selected for processing, while others are skipped.

Instead of dealing with an entire data stream, we can sample instances at

periodic intervals.

If the rate of arrival data in the stream is higher than the capacity to

process data, sampling is used as a method to slow down data (Gama & Rodrigues, 2007). But the rationale for sampling must be clear, in order not to loose relevant information.

20

2.3. SUMMARIZATION OF STREAMING DATA Traditional sampling algorithms require the knowledge of the number of tuples. This way, they are not applicable to the streaming setting.

Data streams require sequential random

sampling. The reservoir sampling technique (Vitter, 1985) is the classic algorithm to maintain an online random sample. The basic idea consists of maintaining a sample of size

s,

called

the reservoir. As the stream ows, every new element has a certain probability of replacing an old element in the reservoir. Extensions to maintain a sample of size sliding window of the

k

over a count-based

n most recent data items from data streams appear in (Babcock et

al.,

2002). Sampling is a general method for solving roblems with huge amounts of data, and has been used in most streaming problems. An active research area is the design of sampling-based algorithms that can produce approximate answers with guarantee on the error-bound (Gama & Rodrigues, 2007). As a matter of fact, sampling works by providing a compact description of a much larger data. Alternative ways to obtain compact descriptions consist of synopsis, histograms and frequent item sets.

2.3.3

Window Models

In most streaming applications, recent data is the most relevant one. To target this subset of data, a popular approach consists of dening a time window covering the most recent data. Actually, time windows are a commonly used approach to solve queries in open-ended data streams. Instead of computing an answer over the whole data stream, the query (or operator) is computed, eventually several times, over a nite subset of tuples. In this model, a time stamp is associated with each tuple. The time stamp denes when a specic tuple is valid (e.g. inside the window) or not. Several window models have been used in the literature. The most relevant are: landmark, sliding and time-biased windows.

2.3.3.1

Landmark Windows

Landmark windows (Gehrke et

al., 2001) identify relevant points (the landmarks) in the

data stream and the aggregate operator uses all record seen so far after the landmark. Successive windows share some initial points and are of growing size. In some applications, the landmarks have a natural semantic. For example, in daily basis aggregates the beginning of the day is a landmark.

Examples of the use of such windows appear in (Rodrigues &

Gama, 2009; Rodrigues et

al., 2008d), where this approach presented multiple advantages.

In the former reference, the approach resulted not only in the aggregation of high-speed data in natural time windows, but also the aggregation of data coming from multiple sources at dierent rates (Rodrigues & Gama, 2009). In the latter, the improvement was in two main

21

2. RATIONALE characteristics of stream learning systems:

update time and memory consumption.

Both

reduce whenever the hierarchical clustering structure grows, which is a major achievement accomplished by the used landmark windows, since only dissimilarities at the leaves must be computed (Rodrigues et

al., 2008d).

This way, every time the system grows it becomes

faster, overcoming the bottleneck of having to compute all dissimilarities at root level, which is known to have quadratic complexity on the number of streams (Rodrigues & Gama, 2007).

2.3.3.2

Sliding Windows

Most of the time, we are only interested in computing statistics in the strictly recent past. The simplest approach are sliding windows of xed size

w.

These type of windows are

similar to rst in, rst out data structures: whenever an element in the window, another element

xi−w

is forgotten.

xi

is observed and inserted

This is probably the most common

approach to algorithms focusing on evolving recent data. However, due to the need to forget old observations, we need to maintain in memory all the observations inside the window. Nevertheless, sliding windows approaches have been successfully applied in summarizing recent data.

For example, a recent work showed that, when dealing with evaluation of

stream learning algorithms the window-size does not matter too much: the prequential error estimated over a sliding-window always converge fast to the holdout estimate, being on the other hand better suited for data streams (Gama et

al., 2009b). On the other hand, the

use of sliding windows enables a smooth visualization and analysis of the quality of dierent measures and the evolution of that quality over time, for example, estimates of the reliability of single predictions (Rodrigues et al., 2008a).

2.3.3.3

Time-Biased Window Models

Previous windows models use a catastrophic forget, that is, any past observation either is in the window or it is not inside the window.

Usually in streaming settings, the concept

generating data evolves smoothly, so old data is less but still important (Gama et al., 2004). A smoother approach are tilted time windows, where time scale is compressed. The most recent data are stored inside the window at the nest detail (granularity). Oldest information is stored at a coarser detail, in an aggregated way, with the level of granularity depending on the application. Tilted time windows can be designed in several ways. Han and Kamber (2001)

present two possible variants:

windows. taxonomy:

natural tilted time windows, and logarithm tilted

In the rst case, data is stored with granularity according to a natural time last hour at a granularity of fteen minutes (4 points), last day in hours (24

points), last month in days (32 points) and last year in months (12 points). In the case of

22

2.3. SUMMARIZATION OF STREAMING DATA logarithmic tilted windows, given a maximum granularity with periods of

t,

the granularity

decreases logarithmically as data is older. As time goes by into the past, the window stores the last time period

t,

the one before that, and consecutive aggregates of less granularity (2

periods, 4 periods, 8 periods, etc.). In data streams scenarios, recent data is usually more important than old data (Gama & Rodrigues, 2007).

Even within a sliding window, the most recent data point is usually

more important than the last one which is about to be discarded.

This way, a simple

approach could consider giving weights to data points depending on their age within the sliding window.

Several weighting models could apply:

linear, loglinear, etc.

Given its

particular characteristics, a good approach for data streams uses an exponential approach, where the weight of a data point decreases exponentially with time. Let observations of a given variable The

α-weighted

X

i

be the number of

from which we are monitoring a sliding window of size

w.

window is the set of points

 X˙ α,w (i) = αi−j xj | j ∈]i − w, i], 0 < α < 1 . The main advantages of this window model are two-fold. First, compared to traditional sliding windows, more importance is given to recent data points, as the weight of each observation decreases exponentially with time. Second, compared to other weighting approaches, it can be maintained on the y.

At rst, we are tempted to say that the set

computed at every new observation

i.

X˙ α,w (i)

must be

However, given its exponential denition, we can

compute it using the recursive form

X˙ α,w (i) = {xi } ∪ α × X˙ α,w (i − 1) \ {x˙ i−w } . The main feature of the weighted sliding window model is the use of smooth forgetting proposed recently as a fading factor (Gama et al., 2009b). In fact, this approach was rstly proposed to deal with evaluation of learning algorithms, and the usage of all previous data. However, its application to window models presents advantages to the processing of data streams, as its recursive form enables the computation of weighted data points on the y.

2.3.3.4

Memoryless Fading Windows

To avoid keeping all data in the window when computing statistics which are based on sums of the data points, and in order to include a smooth forgetting of information, the previous approach can be applied to achieve an approximated value for the elementary statistics on a data stream. This is strongly related to a weighted sum of the points in the sliding window, with more weight being given to most recent data points. The fading factors are memoryless,

23

2. RATIONALE an important property in streaming scenarios (Gama et

al., 2009b). Using the exponential

weights introduced in the weighted window model, but applying then to all data points seen so far, similar statistics can be computed, to which we call fading statistics. The application of fading factors (which approximate the fading window model) has been used in recent works. For example, given the fact that the prequential error (Dawid, 1984) is based on the sum of errors along the stream, fading factors can be applied to achieve a memory-less aaproach to its computation over a sliding window.

In a recent work, the

authors have shown that the fading prequential error converges to the holdout estimate and is equivalent to the prequential error on a sliding window (Gama et al., 2009b). Also, on the same recent work, the authors also embedded fading factor techniques on statistical tests for comparing stream classication problems and change detection. Overall, the authors reported that the use of fading factors on the McNemar and Page-Hinkley tests gave results similar to the use of sliding windows (Gama et al., 2009b).

2.3.4

Online Histograms

Histograms are summarization techniques that can be used to approximate the frequency distribution of element values in a data stream.

A histogram is dened by a set of non-

overlapping intervals, each interval being dened by the boundaries and a frequency count. The basic algorithm to build histograms consists of sorting the values of the random variable and places them into bins. Then, it counts the number of data samples in each bin. The height of the bar drawn on the top of each bin is proportional to the number of observed values in that bin. A histogram is dened by a set of

k

non-overlapping intervals and each interval is dened by

its boundaries and a frequency count. The most used histograms are either equal width, where the range of observed values is divided into

(bj − bj−1 )),

k

intervals of equal length (∀i, j

or equal frequency, where the range of observed values is

such that the counts in all bins are equal (∀i, j the V-Optimal histogram (Guha et

: (fi = fj )).

: (bi − bi−1 ) = divided into k bins

The reference technique is

al., 2006), which denes intervals that minimize the

frequency variance within each interval. However, in the context of open-ended data streams it is not appropriate to use the traditional histograms to construct a graphical representation of continuous data, because they require the knowledge of all data (Sebastião et al., 2010). Discretization of continuous attributes is an important task for certain types of machine learning algorithms.

Although discretization is a well-known topic in data analysis and

machine learning, most of the works refer to a batch discretization where all the examples are available for discretization.

24

Few works refer to incremental discretization.

However,

2.3. SUMMARIZATION OF STREAMING DATA it is commonly seen as an essential tool for high-speed processing of data streams with limited memory resources (Gama & Rodrigues, 2007). For example, grid clustering algorithms operate on discrete cells to dene dense regions of points (Wang et al., 1997). The Partition Incremental Discretization (PiD) algorithm (Gama & Pinto, 2006) is a recent example of a streaming algorithm that tries to achieve the aforementioned goal. It consists of two layers; the rst layer simplies and summarizes the data, while the second layer constructs the nal grid. The process of updating the rst layer works online, doing a single scan over the data stream, hence being able to process innite sequences of data, processing each example in constant time and (almost) constant space. The update process of the second layer works online along with the rst layer. For each new example

xi , the system increments the counter

in the second-layer cell where the triggered rst-layer cell is included. The grid this way dened represents an approximated histogram of the variable (Sebastião & Gama, 2007). Another valuable approach are exponential histograms, which are frequently used to solve counting problems (Datar et

al., 2002). Consider the problem of counting the number of

1's

w

in a sliding window of size

over a bit stream. The problem is trivial if the window can

t in memory. Datar et al. (2002) presented an exponential histogram strategy to solve this problem requiring only

O(log(w)) space.

The basic idea consist of keeping buckets of dierent

sizes to hold the data, and monitoring the last bucket's size and the total size of the buckets. The main property of the exponential histograms, is that the size grows exponentially, i.e.

20 , 21 , 22 , ..., 2h .

To store

w

elements in the sliding window, only

O(log(w)/ε)

are needed

to maintain the moving sum and the error estimating. Datar et al. (2002) proved that the error of this approach is bounded within a relative error

2.3.5

ε.

Monitoring Frequent Items

The problem of nding the most frequent items in a data stream to nd the elements with

0 ≤ φ ≤ 1.

ei

whose relative frequency

fi

S

of size

N

is, roughly put,

is higher than a user specied support

φN ,

Given the space requirements that exact algorithms addressing this problem

would need (Charikar et al., 2002), several algorithms were already proposed to nd the top-

k et

frequent elements, being roughly classied into counter-based and sketch-based (Metwally al., 2005).

Counter-based techniques keep counters for each individual element in the

monitored set, which is usually a lot smaller than the entire set of elements. When an element is seen which is not currently being monitored, dierent algorithms take dierent actions in order to adapt the monitored set accordingly.

Sketch-based techniques provide less rigid

guarantees, but they do not monitor a subset of elements, providing frequency estimators for the entire set. For a deeper study on this subject, Cormode & Hadjieleftheriou (2010) presented an extensive review of methods for mining frequent items in data streams.

25

2. RATIONALE 2.3.5.1

Counter-based Techniques

Simple counter-based algorithms such as Sticky Sampling and Lossy Counting were proposed in (Manku & Motwani, 2002), which process the stream in reduced size. from keeping a lot of irrelevant counters. counters for monitoring

k

Frequent (Demaine et

Yet, they suer

al., 2002) keeps only

k

elements, incrementing each element counter when it is observed,

and decrementing all counters when a unmonitored element is observed. elements are replaced by new unmonitored element.

Zeroed-counted

This strategy is similar to the one

applied by Space-Saving (Metwally et al., 2005), which gives guarantees for the top-m most frequent elements.

2.3.5.2

Sketch-based Techniques

Sketch-based algorithms usually focus on families of hash functions which project the counters into a new space, keeping frequency estimators for all elements. The guarantees are less strict but all elements are monitored. The CountSketch algorithm (Charikar et al., 2002) solves the problem with a given success probability, estimating the frequency of the element by nding the median of its representative counters, which implies sorting the counters. Also, GroupTest method (Cormode & Muthukrishnan, 2003) employs expensive probabilistic calculations to keep the majority elements within a given probability of error. Although generally accurate, its space requirements are large and no information is given about frequencies or ranking.

2.4

Cluster Analysis over Data Streams

The main focus of this thesis are clustering techniques for ubiquitous data streams.

We

shall explore the clustering problem in its streaming applications, keeping in mind its further application to ubiquitous scenarios.

Clustering is probably the most frequently used data

mining algorithm, used as exploratory data analysis (Halkidi et

al., 2001), consisting on

the process of partitioning data into groups, where elements in the same group are more similar than elements in dierent groups. It is known that solving a clustering problem is the equivalent to nding the global optimal solution of a non-linear optimization problem, hence NP-hard, suggesting the use of optimization heuristics (Bern & Eppstein, 1996).

26

2.4. CLUSTER ANALYSIS OVER DATA STREAMS 2.4.1

Clustering in Streaming Scenarios

The main problem in applying clustering to data streams is that systems should consider data evolution, being able to compress old information and adapt to new concepts.

The

range of clustering algorithms that operate online over data streams is wide, including partitional (Bradley et

al., 1998; O'Callaghan et

al., 2002), hierarchical (Zhang et

al.,

1997; Aggarwal et al., 2003), density-based (Ester et al., 1996; Sheikholeslami et al., 1998) and grid-based (Wang et

al., 1997; Park & Lee, 2004) methods.

A common connecting

feature is the denition of unit cells or representative points, from which clustering can be obtained with less computational costs (Aggarwal et al., 2003). However, two dierent clustering problems exist:

clustering data streams and clustering

streaming data sources.

The next two sections present current knowledge on each of the

two dierent problems.

There are other research issues still open on cluster analysis from

data streams.

For example, monitoring cluster transitions (Oliveira & Gama, 2010) is a

valuable task that can give insights on the evolution of clusters (Spiliopoulou et

al., 2006).

Furthermore, the denition of clear strategies for monitoring and evaluating the unsupervised learning process, with valuable validity indices, is still uncharted territory, which has been only targeted for supervised tasks (Gama et

al., 2009b). Future research is required to address

these issues.

2.4.1.1

Clustering Data Streams

Clustering data streams is the task of clustering data owing from a continuous stream, based on data points similarity (Aggarwal et data over time (Barbará, 2002; Guha et

al., 2003), aiming to discover structures in

al., 2003).

Algorithms usually search for dense

regions of the data space, identifying hot-spots where streaming data sources tend to produce data (Rodrigues et al., 2010b). For example, in a sensor network, sensor 1 is at high values when sensor 2 is in mid-range values, and this happens more often than any other combination. Clustering streaming examples over time present adaptivity issues that might be optimized by evolutionary clustering (Chakrabarti et al., 2006). However, the need to detect and track changes in clusters is not enough, and is also often required to provide some information about the nature of changes (Spiliopoulou et al., 2006).

2.4.1.2

Clustering Streaming Data Sources

Clustering streaming data sources is the task of clustering dierent sources of data streams, based on the data series similarity (Rodrigues et al., 2008d). Algorithms aim to nd groups

27

2. RATIONALE of data sources that behave similarly through time. For example, in the same network, sensors 1 and 2 are highly correlated in the sense that when one's values are increasing the other's are also increasing. This is highly related with whole clustering of time series, so most of the existent techniques can be successfully applied, but only if incremental versions are possible.

2.4.2

Clustering Data Streams

Assuming that available datasets for clustering problems are small, previous algorithms tend to maximize the quality of the solution. However, nowadays the existence of datasets of a considerably larger size than the time and memory capabilities given by those algorithms forced the emergence of research focused on clustering algorithms that read each example only once. As the expansion of the Internet continues and ubiquitous computing becomes a reality, we can expect that such data volumes will become the rule rather than the exception (Domingos & Hulten, 2000). Having this in mind, a streaming clustering algorithm must observe certain characteristics (Barbará, 2002; Bradley et

al., 1998):



inclusion of a compact representation;



quick and incremental processing of new examples;



execution of only one pass (or less) over the dataset;



anytime available response, with information on progress, time left, etc;



ability of suspension, stop and resume, for incremental processing;



ability of incorporating new data on a previously dened model;



running inside a limited RAM buer;

One problem that usually arises with this sort of models is the denition of a minimum number of observations that are necessary to assure convergence. Techniques based on the Hoeding bound (Hoeding, 1963) can be applied to solve this problem, and have in fact been successfully used in online decision trees (Domingos & Hulten, 2000; Hulten et 2001; Gama et

al.,

al., 2005).

One of the rst clustering systems developed, being both incremental and hierarchical, was the COBWEB, a conceptual clustering system that executes a hill-climbing search on the space of hierarchical categorization (Fisher, 1987). This method incrementally incorporates objects in a probabilistic categorization tree, where each node is a probabilistic concept representing a class of objects. The gathering of this information is made by means of the categorization

28

2.4. CLUSTER ANALYSIS OVER DATA STREAMS process of the object down the tree, updating counts of sucient statistics while descending the nodes, and executing one of several operations: classify an object according to an existent cluster, create a new cluster, combine two clusters or divide one cluster into several ones. In the following, we analyze clustering algorithms separately based on two discriminant characteristics: strategy and unit of analysis. partitional and hierarchical methods.

Regarding strategy, we distinguish between

With respect to the unit of analysis, we distinguish

between point-based and grid-based clustering.

2.4.2.1

Partitioning Methods

Bradley et.

al. (Bradley et

al., 1998) proposed the Single Pass K-Means, an algorithm

that aims at increasing the capabilities of k-means for large datasets.

The main idea is

to use a buer where points of the dataset are kept in a compressed way. this algorithm appear in (Farnstrom et

al., 2000) and (O'Callaghan et

Extensions to

al., 2002).

The

STREAM (O'Callaghan et al., 2002) system has the goal of minimize the sum of the squared dierences (as in k-means) keeping as restriction the use of available memory. processes data into batches of

m points.

STREAM

These points are stored in a buer in main memory.

After lling the buer, STREAM clusters the buer into points in the buer by retaining only the

k

k

clusters. It then summarizes the

centroids along with the number of examples

in each cluster. STREAM discards all the points but the centroids weighted by the number of points assigned to it. The buer is lled in with new points and the clustering process is repeated using all points in the buer.

This approach results in a one-pass algorithm

constant-factor approximation algorithm (O'Callaghan et

al., 2002). The main problem is

that STREAM never considers data evolution. The resulting clustering can become dominated by the older, outdated data of the stream. An interesting aspect of this algorithm is the ability to compress old information, a relevant issue in data stream processing.

2.4.2.2

Hierarchical Methods

One major achievement in this area of research was the Balanced Iterative Reducing and Clustering using Hierarchies system (Zhang et al., 1997). The BIRCH system builds a hierarchical structure of data, the CF-tree, a balanced tree where each node is a tuple (Clustering Feature). A clustering feature contains the sucient statistics for a given cluster: the number of points, the sum of each feature-values and the sum of the squares of each feature-value. Each cluster feature corresponds to a cluster. They are hierarchically organized in a CF-tree. Each non-leaf node in the tree aggregates the information gathered in the children nodes. This algorithm tries to nd the best groups with respect to the available memory, while

29

2. RATIONALE minimizing the amount of input and output. The CF-tree grows by aggregation, getting with only one pass over the data a result with complexity appears in (Aggarwal et

O(N ).

Another use of the CF-tree

al., 2003). More than an algorithm, the CluStream is a complete

system composed by two components, one online and another oine (Aggarwal et

al.,

2003). Structures called micro-clusters are locally kept, having statistical information of data. These structures are dened as a temporal extension of clustering feature vectors presented in (Zhang et

al., 1997), being kept as images through time, following a pyramidal form.

This information is used by the oine component that depends on a variety of user-dened parameters to perform nal clustering by an iterative procedure. The CURE system (Guha et

al., 1998), Clustering Using REpresentatives, performs a

hierarchical procedure that assumes an intermediate approach between centroid based and all-point based techniques. In this method, each cluster is represented by a constant number of points well distributed within the cluster, that capture the extension and shape of the cluster. This process permits the identication of clusters with arbitrary shapes. The CURE system also diers from BIRCH in the sense that, instead of pre-aggregate all the points, this system gathers a random sample of the dataset, using Cherno bounds (Motwani & Raghavan, 1995) in order to obtain the minimum number of examples.

2.4.2.3

Point-based Clustering

Several algorithms operate over summaries or samples of the original stream. al. (Bradley et

Bradley et

al., 1998) proposed the Single Pass K-Means, increasing the capabilities

of k-means for large datasets, by using a buer where points of the dataset are kept in a compressed way.

The STREAM (O'Callaghan et

al., 2002) system can be seen as an

extension of (Bradley et al., 1998) which keeps the same goal but has as restriction the use of available memory. retaining only the

k

After lling the buer, STREAM clusters the buer into

clusters,

centroids weighted by the number of examples in each cluster.

process is iteratively repeated with new points. et

k

The

The BIRCH hierarchical method (Zhang

al., 1997) uses Clustering Features to keep sucient statistics for each cluster at the

nodes of a balanced tree, the CF-tree. Given its hierarchical structure, each non-leaf node in the tree aggregates the information gathered in the descendant nodes. This algorithm tries to nd the best groups with respect to the available memory, while minimizing the amount of input and output. Another use of the CF-tree appears in (Aggarwal et

al., 2003).

A dierent strategy is used in another hierarchical method, the CURE system (Guha et

al.,

1998), where each cluster is represented by a constant number of points well distributed within the cluster, which capture the extension and shape of the cluster. This process allows the identication of clusters with arbitrary shapes on a random sample of the dataset, using

30

2.4. CLUSTER ANALYSIS OVER DATA STREAMS Cherno bounds in order to obtain the minimum number of required examples. The same principle of error-bounded results was recently used in VFKM to apply consecutive runs of kmeans, with increasing number of examples, until the error bounds were satised (Domingos & Hulten, 2001). This strategy supports itself on the idea of guaranteeing that the clustering denition does not dier signicantly from the one gather with innite data. Hence, it does not consider data evolution.

2.4.2.4

Grid-based Clustering

The main focus of grid-based algorithms is the so-called spatial data, which model the geometric structure of objects in space.

These algorithms divide the data space in small

units, dening a grid, and assigning each object to one of those units, proceeding to divisive and aggregative operations hierarchically. These features make this type of methods similar to hierarchical algorithms, with the main dierence of applying operations based on a parameter and not the dissimilarities between objects. A sophisticated example of this type of algorithms is STING (Wang et

al., 1997), where

the space area is divided in cells with dierent levels of resolution, creating a layered structure.

The main features and advantages of this algorithm include being incremental and

able of parallel execution. methods (Ester et

Also, the idea of dense units, usually present in density-based

al., 1996), has been successfully introduced in grid-based systems.

The CLIQUE algorithm tries to identify sub-spaces of a large dimensional space which can allow a better clustering of the original data (Agrawal et al., 1998). It divides each dimension on the same number of equally ranged intervals, resulting in exclusive units.

One unit is

accepted as dense if the fraction of the total number of points within the unit is higher than a parameter value. A cluster is the largest set of contiguous dense units within a subspace. This technique's main advantage is the fact that it automatically nds subspaces of maximum dimensionality in a way that high density clusters exist in those subspaces. Statistical Grid-based Clustering system (Park & Lee, 2004) was especially designed for data stream applications, where clusters are constituted by adjacent dense cells.

It works by

applying three dierent divisive methods, based on the statistics of objects belonging to each

µ-partition, which divides one cluster in two setting the border at the mean of the parent group; σ -partition, which divides the group in two, one with 68% of the objects belonging to [µ − σ, µ + σ] (assuming a normal distribution of objects) and another with the remainder cell:

tail objects; and a third method which includes the ecient features of the previous, hybridpartition. However, in distributed systems, the increase in communication given the need to keep sucient statistics may be prejudicial.

31

2. RATIONALE 2.4.3

Clustering Streaming Data Sources

The traditional knowledge discovery environment, where data and processing units are centralized on controlled laboratories and servers, is now completely transformed into a web of sensorial devices, some of them enclosing processing ability. The task of clustering streaming data sources is not widely studied, so we should start by formally introduce it. Data streams usually consist of variables producing examples continuously over time. The basic idea behind clustering streaming data sources is to nd groups of sources that behave similarly through time, which is usually measured in terms of distances between the streams. Clustering data sources has been already studied in various elds of real world applications. Many of them, however, could benet from a data stream approach. For example:



in electrical supply systems, clustering demand proles (ex: industrial or urban) decreases the computational cost of predicting each individual subnetwork load (Rodrigues & Gama, 2009);



in medical systems, clustering medical sensor data (like ECG, EEG, etc.) is useful to determine correlation between signals (Sherrill et al., 2005);



in nancial markets, clustering stock prices evolution helps on preventing bankruptcy (Mantegna, 1999);

All of these problems address data coming from a stream at high rate. This way, data stream approaches should be considered to solve them. The goal of a clustering system for streaming data sources is to nd (and make available at any time t) a partition

P

of those sources, where sources in the same cluster tend to be more

alike than sources in dierent clusters. In partitional clustering, searching for result at time

t

should be a matrix

belongs to cluster

cj

P

of

n×k

values, where each

Pij

k

clusters, the

is one if source

xi

and zero otherwise. Specically, we can inspect the partition of sources

in a particular time window from starting time

s

until current time

t,

which would give a

temporal characteristic to the partition. In a hierarchical approach to the problem, the same possibilities apply, with the benet of not having to previously dene the target number of clusters, thus creating a structured output of the hierarchy of clusters.

2.4.3.1

Compare and Contrast

Clustering streaming data sources is in fact an emerging area of research that is closely connected to two other elds: clustering of time series, for its application in the variable

32

2.4. CLUSTER ANALYSIS OVER DATA STREAMS domain; and clustering of streaming examples, for its applications to data owing from highspeed productive streams.

Clustering Time Series

Clustering time series can be seen as the batch parent of clus-

tering streaming data sources, as it encloses the principles of comparing variables instead of examples. However, a lot of research has been done on clustering subsequences of time series, which has raised some controversy in the data mining community (Keogh et

al., 2003; Idé,

2006). Nevertheless, clustering streaming data sources approaches whole clustering instead, so most of the existent techniques can be successfully applied, but only if incremental versions are possible. In fact, this is the major drawback of existent whole clustering techniques. Most of the work on clustering time series assumes the series are known in all their extent, failing to cope with innite number of observations usually inherent to streaming data. Incremental versions of this type of models are good indicators of what can be done to deal with the presented problem.

Clustering Streaming Examples

Clustering examples on a streaming environment is al-

ready widely spread in the data mining community as a technique used to discover structures in data over time (Barbará, 2002; Guha et al., 2003). This technique presents some proximity to our task, as they both require high-speed processing of examples and compact representation of clusters.

Many times, however, example clustering systems use representatives such as

means or medoids in order to reduce dimensionality. This reduction is not so straightforward in systems developed to cluster streaming data sources streams, since the reduction in dimensionality based on representatives would have eect on the variables dimension instead of the examples dimension. Thus, to solve our problem, systems must dene and maintain sucient statistics used to compute (dis)similarities, in an incremental way, which is not an absolute requirement to the previous problem.

Therefore, few of the previously proposed

models can be adapted to this new task.

2.4.3.2

Existing Approaches

An area that is close to clustering streaming data sources is known as incremental clustering of time series. In fact, those approaches comply with almost all features needed to address our problem. Their only restriction is their application to time series, assuming a xed ordering of the examples. It is not unusual, in streaming environments, for examples to arrive without a specic ordering, creating a new drawback to time series related methods.

33

2. RATIONALE Composite Correlations

Wang and Wang introduced an ecient method for monitoring

composite correlations, i.e., conjunctions of highly correlated pairs of streams among multiple time series (Wang & Wang, 2003). They use a simple mechanism to predict the correlation values of relevant stream pairs at the next time position, using an incremental computation of the correlation, and rank the stream pairs carefully so that the pairs that are likely to have low correlation values are evaluated rst.

They have shown that the method signicantly

reduces the total number of pairs for which it is needed to compute the correlation values due to the conjunctive nature of the composites.

Nevertheless, this incremental approach

has some drawbacks when applied to data streams, and it is not a clustering procedure.

Online K-Means

Although several versions of incremental, single-pass or online k-means

may exist, this is a proposal that clearly aims at clustering streaming data sources (Beringer & Hüllermeier, 2006). The basic idea of Online K-Means is that the clusters centers computed at a given time are the initial clusters centers for the next iteration of K-Means. Some of the requirements are clearly met with this approach:



an ecient preprocessing step is applied, which includes an incremental computation of the distance between data streams, using a Discrete Fourier Transform approximation of the original data; the K-Means procedure applied here is quadratic in the number of clusters, but linear in the number of elements of each block of data used at each iteration;



at any time, the clusters centers are maintained, and a cluster denition might be extracted as a result;



as a mean of structural drift detection, the authors propose a fuzzy approach, which allows a smooth transition among clusters over time;



the authors do not include ways to include new streams over time, and this is the main drawback of this approach;



the algorithm includes a procedure to dynamically update the optimal number of clusters at each iteration, by testing if increasing or decreasing the number of clusters by one unit would produce better results according to a validation index;

This method's eciency is mainly due to a scalable online transformation of the original data which allows for a fast computation of approximate distances between streams.

COD

Clustering On Demand (COD) is a framework for clustering streaming data sources

which performs one data scan for online statistics collection and has compact multi-resolution

34

2.4. CLUSTER ANALYSIS OVER DATA STREAMS approximations, designed to address the time and the space constraints in a data stream environment (Dai et

al., 2006).

It is divided in two phases:

a rst online maintenance

phase providing an ecient algorithm to maintain summary hierarchies of the data streams and retrieve approximations of the sub-streams; and an oine clustering phase to dene clustering structures of multiple streams with adaptive window sizes. Specically:



the system encloses an online summarization procedure that updates the statistics in constant time, based on Wavelet-based tting models;



statistics are maintained at all time, creating a compact representation of the global similarities, which reduces the responsiveness of the system as it implies a clustering query to extract the active clustering denition;



the mechanism for detection of change and trends is completely oine, and humanbased, which clearly diminishes the adaptability of the procedure;



the oine characteristic of the clustering process allows the integration of new streams, without much complexity addition, but yet no procedure was introduced to deal with this problem;



given the oine query procedure, each query may ask for a dierent number of

k

objective clusters; nevertheless, this is a user-dened feature, reducing the adaptability of the process;

A good feature of the system is that it allows the user to query with dierent window sizes (dierent resolutions) without having to summarize the streams from scratch, and in time linear to the number of streams and data points.

ODAC

The Online Divisive-Agglomerative Clustering (ODAC) is an incremental approach

for clustering streaming data sources using a hierarchical procedure (Rodrigues et al., 2008d). It constructs a tree-like hierarchy of clusters of streams, using a top-down strategy based on the correlation between streams.

The system also possesses an agglomerative phase

to enhance a dynamic behavior capable of structural change detection.

The splitting and

agglomerative operators are based on the diameters of existing clusters and supported by a signicance level given by the Hoeding bound (Hoeding, 1963). The main characteristics of the system are constant memory and constant time in respect to the number of examples. In ODAC, system space complexity is constant on the number of examples, even considering the innite amount of examples usually present in data streams. An important feature of this algorithm is that every time a split is performed on a leaf with

n

variables, the global number of dissimilarities needed to be computed at the next iteration

35

2. RATIONALE diminishes at least

n−1

(worst case scenario) and at most

n/2

(best case scenario). The

time complexity of each iteration of the system is constant given the number of examples, and decreases with every split occurrence, being therefore capable of addressing data streams. Regarding the usage of ODAC on scenarios where clustering of streaming sources is required, we see that:



the update time and memory consumption does not depend on the number of examples, as it gathers sucient statistics to compute the correlations within each cluster; moreover, anytime a split is reported, the system becomes faster as less correlations must be computed;



the system possesses an anytime compact representation, since a  binary (Rodrigues et

al., 2008d), fuzzy (Rodrigues & Gama, 2007) or ternary (Rodrigues & Gama,

2008)  hierarchy of clusters is available at each time stamp, and does not need to store anything more than the sucient statistics and the last example to compute the rst-order dierences;



an agglomerative phase is included to react to structural changes; these changes are detected by monitoring the diameters of existing clusters;



this online system was not designed to include new streams along the execution; however, it could be easily extended to cope with this feature;



given its hierarchical core, the system possesses an inherently adaptable conguration of clusters;

Overall, this is one of the systems clearly proposed to address clustering of streaming data sources. It copes with high-speed production of examples and reduced memory requirements, with constant time update. It also presents adaptability to new data, detecting and reacting to structural drift (Rodrigues & Gama, 2007).

2.5

Issues in Learning from Data Streams

Knowledge discovery systems are usually constrained by three limited resources: time, memory and sample size. In traditional applications sample size limitation was proved to be dominant, since it often leads to overtting (Gratch, 1996; Webb, 1995). Nowadays, time and memory seem to be the bottleneck for machine learning applications, mainly the last one. For this reason, traditional memory-based models with multiple passes over the data are virtually unusable for achieving high performances, failing to adapt to the high-speed production of examples (Domingos & Hulten, 2000).

36

2.5. ISSUES IN LEARNING FROM DATA STREAMS The main features inherent to stream learning algorithms are that the system should be capable of process data incrementally, evolving over time, while monitoring the evolution of its own learning process and self-diagnosis this process.

Learning algorithms dier in the

extent of self-awareness they oer in this diagnosis. For example, while supervised learning algorithms may have some snapshot of the truth, unsupervised learning algorithms sail on sight, with no feedback from the real-world other than the actual data.

Hence, dierent

levels of awareness exist. First, the tted model (either supervised or unsupervised) is inherently unstable. Even when the model converges, there is uncertainty in the response of the model to data, so mechanisms that assess this instability of the model would give auxiliary information for self-diagnosis of the learning process (Bosni¢ & Kononenko, 2009). Secondly, the actual tness of the model to the data is an obvious assessment of the quality of the model (Mitchell, 1997; Hastie et al., 2001). If real-world feedback is possible (supervised learning), then the learning system can inspect the actual tness of the model in terms of some loss function. With respect to unsupervised learning, only internal validity can be used, although the target diagnosis is the same. Finally, and a research hot-topic, when the actual tness of the model to the data becomes unstable over time, then the learning system can hypothesize that the concept generating the data is changing, detecting concept drift (Gama et

al., 2004).

Next sections expose each of these moments where this self-diagnosis could take place.

2.5.1

Incremental and Decremental Learning

We refer to incremental learning when the algorithm has the ability to update the decision model whenever new information is available (Berthold & Hand, 1999). This is a undisguisable characteristic that stream learning algorithms must observe (Gama & Rodrigues, 2009). However, this ability to incorporate new information must be complemented with the ability to forget past information (Kifer et al., 2004). We refer to it as decremental learning (Cauwenberghs & Poggio, 2000). Only coupling incremental and decremental learning can models be considered up-to-date with respect to recent known data. The incremental and decremental features require a permanent maintenance and updating of the decision model as new data is available. Of course, there is a trade-o between the cost of update and the gain in performance we may obtain, and this trade-o might depend on learning algorithms exhibiting dierent proles. Consider the task of learning within small sliding windows. Algorithms with strong variance management are quite ecient for small

37

2. RATIONALE training sets. Very simple models, such as naive Bayes (Domingos & Pazzani, 1997), using few free-parameters can be quite ecient in variance management, and eective in incremental and decremental operations, being a natural choice in the sliding window framework. The main problem with simple representation languages is the bound in generalization performance they can achieve, since they are limited by high bias. When dealing with methods addressing the complete data stream, large volumes of data require ecient bias management. Complex tasks requiring more complex models increase the search space and the cost for structural updating.

These models, require ecient control strategies for the trade-o

between the gain in performance and the cost of updating with new examples (Gama & Rodrigues, 2009). As for their batch relatives, this trade-o is also dependent on the type of learning task. Next sections describe issues related with both supervised and unsupervised learning from data streams. Unsupervised learning tasks are those where there is a set of observations of a random pvector

X

with a joint probability

P (X),

and the goal is to directly infer the properties of this

probability density without a supervisor giving a degree-of-error for each observation (Hastie et

al., 2001). Simply put, all variables are seen as factors for the joint probability, although

they might not be independent among each other (Domingos & Pazzani, 1997).

Exam-

ples of unsupervised learning systems that operate over data streams are frequent items monitoring (Cormode & Muthukrishnan, 2003; Metwally et

al., 2005), association rules

mining (Webb, 2006), centralized data clustering (Zhang et

al., 1997; Aggarwal et

al.,

al., 2001; Gaber & Yu, 2006; Datta et

al.,

2003), distributed data clustering (Kargupta et 2006; Cormode et

al., 2007), centralized clustering of streaming data sources (Beringer &

Hüllermeier, 2006; Dai et

al., 2006; Rodrigues et

streaming data sources (Klusch et

al., 2008d), and distributed clustering of

al., 2003; Bandyopadhyay et

al., 2006; Yin & Gaber,

2008). Clustering is possibly the most popular unsupervised learning task in data mining (Han & Kamber, 2001).

Given the application to data streams, the output of clustering systems

is usually given by

k

centers, rather that an assignment of clusters to the data.

Hence,

evolving clustering systems must possess a compact representation of clusters and be capable of incorporating (or deleting) data on a previously dened model, being therefore able of incremental processing (Barbará, 2002; Bradley et

al., 1998).

The previously mentioned Online Divisive-Agglomerative Clustering (ODAC) is a clear example of the incremental and decremental issues in clustering from data streams (Rodrigues et al., 2008d). It gathers sucient statistics to incrementally compute the correlations among data sources within each cluster, possessing an anytime compact representation of clusters, since a binary hierarchy of clusters is available at anytime.

38

Hence, with new data being

2.5. ISSUES IN LEARNING FROM DATA STREAMS available, clusters are incrementally updated, possibly being rened (by splitting existing clusters).

The system includes structural change detection (Rodrigues & Gama, 2007),

these changes being detected by monitoring the diameters of existing clusters.

Regarding

the decremental features of online learning, we should point out that, after each split or aggregation, new clusters start new sucient statistics, hence forgetting old information. The incremental and decremental features of ODAC enable it to keep an up-to-date hierarchical denition of clusters, presenting adaptability to new data, while detecting and reacting to structural drift. Furthermore, this strategy allows it to cope with high-speed production of examples and reduced memory requirements, with constant update time (Rodrigues et

al.,

2008d).

2.5.2

Managing Model Instability

Several predictive systems are nowadays vital for operations and decision support. The quality of these systems is most of the time dened by their average accuracy which has low or no information at all about the estimated error of each individual prediction. The generalization ability of learning methods, based on global average accuracy, usually yields the drawback of uncertain and, therefore, worse individual predictions. The expected error of a prediction is a very relevant point in many sensitive applications, such as medical diagnosis or industrial applications for forecast and control, where learning algorithms must present predictions to a user. Users seem more comfortable with both a prediction and an error estimate for that prediction. The exibility of the representational power of online learning systems implies error variance. In stationary data streams the variance shrinks when the number of examples goes to innity. In a dynamic environment where the target function changes smoothly, and where even abrupt changes can occur, the variance of predictions is problematic. Given this, one way to assess the reliability of the prediction is to test the instability of the model, by means of its sensitivity (i.e., the eect on the prediction) when small changes are applied in the predictive setting. Four uncertainty-enforcing properties of data streams are relevant to discuss here, exposing the features that stream learning algorithms must possess in order to be able to cope with them (Rodrigues et al., 2008a):

Input Data Uncertainty

There is uncertainty in the information provided by the previously

known input data (e.g., if a sensor reads 100, most of times the real-value is around 100: it could be 99 or 101); a reliable predictor should be robust to slight changes in the input vector.

39

2. RATIONALE Output Data Uncertainty

The same uncertainty in the data can also exist in the objective

output value; a reliable predictor should be robust to slight changes in the learned output value.

Model Parameters Uncertainty

As online systems learn with time, in dynamic settings,

and without strict convergence measures, predictions may be produced by unstable models; a reliable predictor should be robust to slight changes in the model's parameters.

Data Subspace Uncertainty

The amount of data in dierent subspaces (a subspace refers

to the subset of learning examples which are related by locality) might induce uncertainty in the learning model for each dense region; a reliable predictor should be robust to slight changes within the same subspace.

Research has followed the path of a group of approaches that generate perturbations of initial learning set to improve accuracy of nal aggregated prediction.

Bagging (Breiman, 1996)

and boosting (Drucker, 1997) are well-known and possibly the most popular in this eld. They have been show to improve generalization performance compared to individual models. While bagging works by learning dierent models in dierent regions of the input space (by sampling original data set), boosting focus on those regions which are not so well covered by the learned model.

These techniques perturb the entire learning data set that is fed

to individual models, thus operating by creating dierent learning models. When changing the model is unapplicable, performing several simple perturbations of an individual example, obtaining their predictions using the learned model, and aggregating these predictions has been proved to improve results by reducing the variance of predictions (Geurts, 2001). Online versions of bagging and boosting have been proposed (Oza & Russell, 2001) and fulll some requirements to address streaming data. Since the prediction quality can be evaluated through its dierent quality aspects (accuracy, error, availability, computational complexity, etc.) we use the term reliability to say that the system evaluates a prediction quality aspect, which is related to its accuracy. The term prediction reliability can therefore stand for estimation of prediction accuracy or prediction error (Bosni¢ & Kononenko, 2009).

This conforms with the denition of reliability as the

ability to perform certain tasks conforming to required quality standards (Crowder et 1991).

Usually, this standard is the prediction accuracy.

al.,

The prediction-reliability pair is

seen as a more robust output and with deeper information about the prediction. Instead of estimating the aggregated accuracy of the whole predictive model, the individual reliability estimates enable the user to make a distinction between better and worse predictions (Bosni¢ & Kononenko, 2007). A major advantage of this approach is that the calculation of individual predictions' reliability estimates does not require true label values, opposed to averaged

40

2.5. ISSUES IN LEARNING FROM DATA STREAMS accuracy estimates for the whole model, which require an independent set of test examples for their computation. In streaming scenarios, the estimation of prediction reliability takes increased importance as faster decisions are many times required (Rodrigues et al., 2008a).

2.5.3

Concept Drift and Novelty Detection

Whenever data ows over time, it is highly improvable the assumption that the examples are generated at random according to a stationary probability distribution (Basseville & Nikiforov, 1987). At least in complex systems and for large time periods, we should expect changes (smooth or abrupt) in the distribution of the examples.

A natural approach for

these incremental tasks is adaptive learning algorithms, incremental learning algorithms that take into account concept drift. Concept drift (Klinkenberg, 2004) means that the concept about which data is being collected may shift from time to time, each time after some minimum permanence.

Changes may occur in the context of learning, due to changes

in hidden variables, or in the characteristic properties of the observed variables (Gama & Rodrigues, 2009). Most learning algorithms use blind methods that adapt the decision model at regular intervals without considering whether changes have really occurred. Much more interesting are explicit change detection mechanisms. The advantage is that they can provide meaningful description (indicating change-points or small time-windows where the change occurs) and quantify the degree of change. Basically, they may follow two dierent approaches: either monitoring the evolution of performance indicators adapting techniques used in statistical process control (Gama et

al., 2004), or monitoring the evolution of a distance function

between two distributions: data in a reference window and in a current window of the most recent data points (Kifer et al., 2004; Sebastião & Gama, 2007). The main research issue is to develop methods with fast and accurate detection rates with few false alarms (Ikonomovska et al., 2009). Also, the level of granularity of decision models is a relevant property (Gaber et

al., 2004a), because it can allow partial, fast and ecient

updating in the decision model instead of rebuilding a complete new model whenever a change is detected. More than detecting change, detecting novelty is a major issue in stream learning, as concepts evolve and we must distinguish between changing concepts and new concepts (Spinosa et

al., 2007). Finally, the ability to recognize seasonal and re-occurring

patterns is an open issue (Gama & Kosina, 2009). The notion of concept drift applied to clustering analysis is not directly derived from concept drift on the variables domain, as clustering structure may not be aected by variable's dynamics (Rodrigues & Gama, 2007). Detecting concept drift as usually conceived for one

41

2. RATIONALE time/order varying variable is not the same as detecting concept drift on the clustering structure of several time/order varying variables. These are usually points in the stream of data where the clustering structure gathered with previous data is no longer valid, since it no longer represents the new relations of dissimilarity between the streams. In unsupervised learning tasks, such as clustering, usual change detection techniques monitor the distribution of data in dierent subsets of data (or sliding windows). However, change can also be detected by monitoring the evolution of performance indicators. Data is collected over time, and so the structure correlation among points or variables evolve. The clustering structure must adapt to this type of changes. Agglomerative Clustering system (Rodrigues et

2.5.4

A recent example is the Online Divisiveal., 2008d).

Evaluation of Stream Learning Algorithms

In the recent past, machine learning has mostly centered on one-shot data analysis from homogeneous and stationary data, and on centralized algorithms. Moreover, most of machine learning and data mining approaches assume that examples are independent, identically distributed and generated from a stationary distribution.

Furthermore, a large number of

learning algorithms assume that computational resources, such as memory or time, are unlimited.

In that context, standard data mining techniques use nite training sets and

generates static models (Gama et

al., 2009b).

Nowadays we are faced with tremendous amount of distributed data that could be generated from the ever increasing number of smart devices. In most cases, this data is transient, and may not even be stored permanently. Most recent supervised learning algorithms (Domingos & Hulten, 2000; Hulten et et

al., 2001; Babcock et

al., 2002; Gama et

al., 2003; Gama

al., 2005; Rodrigues & Gama, 2009) maintain a decision model that continuously evolve

over time, taking into account that the environment is non-stationary and computational resources are limited. Likewise, recent unsupervised learning algorithms (Zhang et al., 1997; Barbará, 2002; Aggarwal et al., 2003; Cormode et al., 2007; Rodrigues et al., 2008d) have traversed the same pathway into the stream learning paradigm. One important issue, not yet conveniently addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. There are no golden standards for assessing performance in non-stationary environments. The main dierences in evaluating stream learning algorithms as opposed to batch learning algorithms are sketched in Table 2.2. Learning systems generate compact representations of what is being observed. They should be able to improve with experience and continuously self-modify their internal state.

As

their representation of the world is approximate, evaluation is used in two contexts: inside

42

2.5. ISSUES IN LEARNING FROM DATA STREAMS Table 2.2: Dierences between batch and streaming learning that aect the way evaluation is performed.

While batch learners build static models from nite, static, i.i.d.

data sets,

stream learners need to build models that evolve over time, being therefore dependent on the order of examples, are generated from a continuous non-stationary ow of non-i.i.d data.

Data Size

Batch

Stream

Finite data set

Continuous ow Non-i.i.d. Non-stationary Incremental Evolving Dependent

i.i.d.

Data Distribution Data Evolution Model Building Model Stability Order of Observations

Static Batch Static Independent

the learning system to assess hypothesis, and as a wrapper over the learning system to estimate the power of a particular algorithm in a given problem. For predictive learning tasks

y are functions of input x, i.e. y = f (x). So the learning goal is to induce a function f , where yˆ = fˆ(x), so that fˆ ∼ f . In this setting, the most relevant dimension is the generalization error. It is an estimator of the dierence between fˆ and the unknown f , and an estimate of the loss Li = g(fˆ(xi ), f (xi )) that can be expected when applying the model to future examples.

(classication and regression), the rationale is that output variables values

Usual machine learning theory states that classication error should be computed on a holdout data set, in order to avoid overtting and improve generalization of the learned model (Mitchell, 1997; Hastie et error.

al., 2001) given that it is an unbiased estimate of the

However, in data streams the denition of holdout sets is dubious.

Furthermore,

the computation of average error in evolving data streams lacks the tting of the streaming dynamics (Gama & Rodrigues, 2007). The fact that decision models evolve over time has strong implications in the evaluation techniques assessing the eectiveness of the learning process. Another relevant aspect is the resilience to overtting as each example is processed once. Literature presents two viable alternatives to evaluate a stream learning model:

Holdout

Apply the current decision model to an independet test set kept aside, at regular

time intervals (or set of examples). The loss estimated in the holdout is an unbiased estimator  most common approach to evalute stream learners (Domingos & Hulten, 2000; Hulten et

Prequential

al., 2001; Gama et

al., 2003; Gama et

al., 2005).

Predictive Sequential (Dawid, 1984), where the error of a model is computed

from the sequence of examples  recent approach that can improve the evaluation of stream learning algorithms by evolving with the process itself (Castillo & Gama, 2005; Gama et

al., 2009b).

43

2. RATIONALE

Figure 2.1: Illustrative example of the comparison of error evolution as estimated by holdout and prequential strategies, in a stationary stream (Waveform data set).

The learning

algorithm is VFDT (Domingos & Hulten, 2000).

The prequential framework evaluates learning from the sequence of examples (Dawid, 1984). For each example

xi

in the stream, the actual model makes a prediction

yˆi

based only on the

example attribute-values. The prequential-error is computed based on an accumulated sum of a loss function between the prediction and observed values:

S(i) =

P

i

L(yi , yˆi ).

A main

discriminating characteristic is that, in the prequential framework, we do not need to know the true value

yi ,

for all points in the stream. The framework can be used in situations of

limited feedback, by computing the loss at observation

i

after seeing

n

L(i)

and

examples

S(i) for points ¯n (i) = S(i) . is S n

where

yi

is known. The mean

Prequential evaluation provides a learning curve that monitors the evolution of learning as a process. Using holdout evaluation, we can obtain a similar curve by applying, at regular time intervals, the current model to the holdout set. However, both estimates can be aected by the order of the examples. Moreover, it is known that the prequential estimator is pessimistic: under the same conditions it will report somewhat higher errors (see Figure 2.1), which is a good characteristic as it will always give an upper bound on the estimate of the generalization error. The prequential error estimated over all the stream might be strong inuenced by the rst part of the error sequence, when few examples have been used for train the classier. Hence, a recent work has proposed the following strategy: to compute the prequential error using a forgetting mechanism, like windows of the most recent observed errors (Section 2.3.3.2) or fading factors that weight previous errors using a decay factor (Section 6.5.2). In this work, the authors observe that (Gama et

44

al., 2009b):

2.5. ISSUES IN LEARNING FROM DATA STREAMS •

computing the prequential error

Sw (i)

w

of one model on a sliding window of size

converges to the holdout estimate;



computing the prequential error

Sα (i)

of one model using fading factors with factor

α

converges to the holdout estimate;



from the previous two, generalization error could be estimated using prequential error either on recent data

Sw (i)

or using fading factors

Sα (i);



A and B should be assessed analysing the curve  (A,B) (A,B) B A (i) = (i) on recent data, or applying fading factors Qα (i)/Sw (i) = log Sw Qw  B A log Sα (i)/Sα (i) ;



comparison between two classiers

comparison between two general models

A

and

B

under the 0-1 loss function could be

assessed using the McNemar statistical test on the most recent data (or using fading factors), since relevant information exists only on summative quantities

n0,1

and

n1,0 ,

the number of examples misclassied by A and not by B, and the number of examples misclassied by B and not by A;



concept drift could be detected using Page-Hinkley test using fading factors as smoothing factor.

The main positive feature of fading factors is that they are memoryless, hence improving their applicability in the streaming setting, especially when we need to keep track of most recent data. Using illustrative examples and experimental work with sensitivity analysis to parameters, the authors have concluded that:



generalization error should be estimated using prequential error using fading factors

Sα (i), •

to cut the eect of long-term data in the prequential error;

the performance of two classiers should be assessed using the

Q

statistic with fading

factors, as this gives a clear sign on the comparative assessment between the two over time;



the performance of two classiers ounder the 0-1 loss could be assessed using the McNemar statistic with fading factors, since this reduces the eect of long-term data on statistical tests;



concept drift detection using Page-Hinkley test and fading factors achieve faster detection rates, yet being resilient to false alarms when there are no drifts.

More than the technical contribution, the work is a contribution to a discussion in the goodpractices on performance assessment and dierences in performance when learning dynamic models that evolve over time (Gama et al., 2009b).

45

2. RATIONALE

2.6

Issues in Learning from Ubiquitous Data Streams

Today, data is distributed in nature. Detailed data for almost any task is collected over a broad area, and streams in at a much greater rate than ever before. In particular, advances in miniaturization, the advent of widely available and cheap computer power, and the explosion of networks of all kinds led to life inanimate things.

Simple objects that surround us are

gaining sensors, computational power, and actuators, and are changing from static, inanimate objects into adaptive, reactive systems.

Sensor networks and social networks are present

everywhere (Gama & Rodrigues, 2007). A large number of applications are distributed. For example, sensors in electrical networks are geographically distributed. Data collected from these sensors have not only a time dimension but also a space dimension. Furthermore, sensor nodes are usually resource limited. While the time dimension is critical for synchronization aspects, the space dimension makes critical aspects like bandwith, and battery power. In this section, we shall expose the characteristics that ubiquitous data streams inherently possess that make them harder to learn from: distributed data sources and integration, data visualization, data quality, and distributed processing and learning.

2.6.1

Ubiquitous Streaming Data Sources

Data is ubiquitous. For example, data is generated by web clicks or network package routing on each machine, GPS devices produce and process data locally, peer-to-peer applications even disregard centralized server processing, cellphone applications produce data from phone calls, text messaging and wireless connections, deep sky data is now being generated by telescope ensembles.

The amount of data being produced in information and industrial

systems is so high, that no single database can hold all information.

Rather, this data is

produced, possibly stored and denitely should be processed in distributed locations. Paradigmatic examples of distributed streaming data sources are sensor networks and health information systems. Also, mobile computing devices like PDAs, cell-phones, wearables, and smart cards are playing an increasingly important role in our daily life.

The emergence of

powerful mobile devices with reasonable computing and storage capacity is ushering an era of advanced data and computationally intensive mobile applications (Kargupta et al., 2002). Our world is evolving into a setting where all devices, as small as they may be, will be able to include sensing and processing ability (Rodrigues et al., 2010b). However, dierent types of devices present dierent levels of resources and care should be taken in data mining methods that aim to extract knowledge from such restricted scenarios (Gaber et al., 2004b).

46

2.6. ISSUES IN LEARNING FROM UBIQUITOUS DATA STREAMS 2.6.1.1

Sensor Networks

Sensors are usually small, low-cost devices capable of sensing some attribute of a physical phenomenon. In terms of hardware development, the state-of-the-art is well represented by a class of multi-purpose sensor nodes called motes (Culler & Mulder, 2004). In most of the current applications sensor nodes are controlled by module-based operating systems such as TinyOS (TinyOS, 2000) and are programmed using arguably somewhat ad-hoc languages such as nesC (Gay et

al., 2003). Recent middleware developments such as Deluge (Hui &

Culler, 2004) and Agilla (Fok et

al., 2005), and programming languages and enviroments

such as Regiment (Newton & Welsh, 2004) and EnviroSuite (Luo et

al., 2006), provide

higher level programming abstractions including massive code deployment where needed.

Sensor networks are composed of a variable number of sensors (depending on the application), which have several features that put them in an entirely new class when compared to other wireless networks, namely: (a) the number of nodes is potentially very large and thus scalability is a problem, (b) the individual sensors are prone to failure given the often challenging conditions they experiment in the eld, (c) the network topology changes dynamically, (d) broadcast protocols are used to route messages in the network, (e) operate with limited power, computational, and memory capacity, and (f ) do not have global identiers (Akyildiz et al., 2002).

Sensor network applications are, for the most part, data-centric in that they focus on gathering data about some attribute of a physical phenomenon. The data is usually returned in the form of streams of simple data types without any local processing. In some cases more complex data patterns or processing is possible. Data aggregation is used to solve routing problems (e.g. implosion, overlap) in data-centric networks (Akyildiz et

al., 2002). In this approach,

the data gathered from a neighborhood of sensor nodes is combined in a receiving node along the path to the sink. Data aggregation uses the limited processing power and memory of the sensing devices to process data online.

Sensors send information at dierent time scales and formats: some sensors send information every minute, others send information each hour, etc.; some send the absolute value of the variable periodically, while others only send information when there is a change in the value of the variable.

All this happens in adversary conditions where they are prone to noise,

weather conditions, battery conditions, etc. The available information is therefore noisy. To cut the impact of noise, missing values, and dierent granularity, data is usually aggregated and synchronized in time windows. Nevertheless, sensors send data asynchronously, and the server needs to aggregate received data in natural time windows, which should be computed in an incremental way, requiring a single scan over the incoming data.

47

2. RATIONALE Sensor networks can operate with dynamic network topologies and evolvable concepts producing data. In real-world applications, data ows at huge rates, with information being usually forwarded throughout the network into a common sink node, being afterwards available for analysis.

Sensors' data may be subject of delays due to network infrastructure or sensor

failure, so the data tuple might arrive at the server with several minutes of delay, or not at all. The server should wait for delayed data, but it cannot wait forever. On the other hand, sensors' clock might be slightly ahead the server's clock, so data points could arrive before the announced time. The sensor networks created by this setting can easily include thousands of sensors, each one being object of predictive analysis. Sensors are most of the times limited in resources such as memory and computational power, and communication between them is easily narrowed due to distance and hardware limitations.

Moreover, given the limited resources and fast

production of data, information has to be processed in quasi-real-time, creating a scenario of multi-dimensional streaming analysis.

2.6.1.2

Health Information Systems

In healthcare scenarios, data is often produced as high-speed data streams (e.g. care units), thus requiring stream processing techniques.

intensive

In other cases, speed is not the

issue, but the fact that the amount of data being recorded is unbearable. In this settings, traditional databases do not deliver. Data streams management systems can present a viable solution to health information systems, especially regarding the latency of the service, and biomedical signal processing.

The main problem is that clinical care increasingly requires

healthcare professionals to access patient record information that may be distributed across multiple heterogeneous sites (Kalra, 2006). In a healthcare organization, processes are usually supported by several tools and there is a need to integrate these tools to achieve an integrated and seamless process (Land & Crnkovic, 2003). However, information technologies tend to combine dierent modules or subsystems, resulting in the coexistence of several information systems aiming at a best-of-breed approach. The need to integrate Electronic Health Records (EHR) for healthcare delivery, management, or research is widely acknowledged. Nevertheless, people will not willingly give up the standalone information system they use today because they fear data loss, loss of specic system functions customized to their needs, loss of control of their data (feeling that it represents their gold mine for research purposes), and they also have some pride about their own software implementation. This fact has an important impact on the heterogeneity found when trying to integrate data from dierent health information systems (Cruz-Correia et al., 2009).

48

2.6. ISSUES IN LEARNING FROM UBIQUITOUS DATA STREAMS Consistently combining data from heterogeneous sources takes substantial amounts of eort because the individual feeder systems usually dier in several aspects, such as functionality, presentation, terminology, data representation, and semantics. It is still a challenge to make EHRs interoperable because good solutions to the preservation of clinical meaning across heterogeneous systems remain to be explored (Kalra, 2006). The fact that there are many solutions to health systems integration using dierent standards and data architectures may prove to be the greatest obstacle to semantic interoperability (Ferranti et al., 2006). The main problem for the lack of integration seems to be the poor incentive among healthcare institutions for overall integration (Herzlinger, 2004). Patients themselves, who have a strong incentive for integrated EHRs, are becoming the main drivers for the proliferation of the information systems integration and Personal Health Records (MacStravic, 2004). As patients become more aware and subsequently empowered in a patient-driven healthcare system, they will demand integrated EHRs as tools to help them manage their own health (Kukafka & Morrison, 2006).

Furthermore, the increasing usage of mobile devices, such as smart

phones, both by patients and health professionals, creates new requirements for integrating, visualizing, managing and mining health information, also giving birth to new research problems to ubiquitous data stream mining for healthcare.

2.6.2

Ubiquitous Streaming Data Visualization

For data mining to be eective, it is important to include the human in the data exploration process and combine the exibility, creativity, and general knowledge of the human with the enormous storage capacity and the computational power of today's computers (Keim, 2002).

The main purpose of data visualization is to gain insight about the information

space, mapping data onto graphical features, providing qualitative overview of large data sets, while searching for patterns or trends, or other artifacts in data. Data in a database or data warehouse can be viewed in various visual forms at dierent levels of abstraction, with dierent combinations of attributes or dimensions (Han & Kamber, 2001). Visualization techniques enable the user to overcome major problems of automatic machine learning methods, such as presentation and interpretation of results, lack of acceptance of the discovered ndings, or limited condence in these (Aguilar-Ruiz & Ferrer-Troyano, 2005). Moreover, they represent a major improvement in learning systems as they can combine human expert knowledge, human perception and human reasoning, with computational power, computational data mining and machine learning. Streaming time series are sometimes hard to interpret, especially if visualization is produced in low-resolution screens usually embedded in small mobile devices (Rodrigues & Gama, 2010).

49

2. RATIONALE

Figure 2.2: Real-world examples of how human behavior directly inuences sensor data time patterns: highway trac volume (Claggett & Miller, 2006), home water demand (Walski et

al., 2003), indoor residential sound level (EPA, 1974) and electricity demand (Rodrigues

& Gama, 2009). Not only the evolution is cyclic but also highly correlated among dierent daily-life dimensions.

When applying data mining methods to extract knowledge, or to learn a given concept from the data stream, visualizing the data mining results is also an important feature for human experts. Human expert knowledge is most of the time dicult to include in learning processes. Moreover, if data is not easily visualized, the experts may have even more diculties in analyzing the complete process. Moreover, current applications are being requested to answer queries in low-resolution mobile devices (Kargupta et

al., 2002; Kargupta et al., 2004).

Another key issue is the fact that these streams tend to represent usual physical phenomena, or human interaction with the environment, or even human behavior, which are most of the times cyclic in some extent. Figure 2.2 presents some examples of domains where this happens. In city surroundings, highway trac volume presents usual time patterns (Claggett & Miller, 2006). Given its impact in people location, this is also related with home demand for resources, such as water (Walski et al., 2003) and electricity (Rodrigues & Gama, 2009), and the time pattern of hourly indoor residential sound levels (EPA, 1974). In these domains, not only the evolution of sensed data is cyclic, but also highly correlated among dierent daily-life dimensions. Often, when data presents cyclic behaviors, predictive models learned for these series may also tend to give higher errors in certain recurrent points in time. However, the human-eye is not trained to notice these cycles in a long stream, yielding the need for new visualization techniques. Recently, Rodrigues & Gama (2010) have proposed a simple dense pixel strategy suitable for stream visualization in mobile devices.

2.6.3

Ubiquitous Streaming Data Quality

Another problem with ubiquitous data streams is data quality. impact on knowledge discovery (Cruz-Correia et

Data quality has a huge

al., 2009), so it is highly important to

use only quality data on learning systems. Data points may arrive with error tags, stating

50

2.6. ISSUES IN LEARNING FROM UBIQUITOUS DATA STREAMS that the sensor has information that the produced value is erroneous in some sense. Data quality has a broader impact in online learning systems, where data needs to be used as soon as it is produced. In some cases more complex data patterns or processing is possible, but is important to note that the sequences of data points are not independent, and are not generated by stationary distributions.

However, common applications usually inspect

behaviors of single sensors, looking for threshold-breaking values or failures. To increase the ability to understand the inner dynamics of the entire network, deeper knowledge should be extracted from it. For example, in the RETINAE system, data is aggregated in landmark windows. However, data points may arrive with error tags. The nal aggregated value is a (possibly weighted) average of non-tagged points. If no quality data points exist in the current landmark window, the system imputes the aggregated point, based on some strategy, but tags it as missing value. Missing values are imputed using the so called homologue value, that is, the value of the sensor one hour or one week before. But imputed data points are only used as input data in order to produce valuable predictions. They are not used as real output data in the process of learning (Rodrigues & Gama, 2009).

2.6.4

Ubiquitous Data Clustering

Since current applications generate many pervasive distributed computing environments, data mining systems must nowadays be designed to work not as a monolithic centralized application but as a distributed collaborative process. The centralization of information yields problems not only with resources such as communication and memory, but also with privacy of sensitive information. Instead of centralizing relevant data in a single server and afterwards perform the data mining operations, the entire process should be distributed and, therefore, paralleled throughout the entire network of processing units. A recent example of such techniques was proposed by Subramaniam et al., where an online system uses density distribution estimation to detect outliers in distributed sensor data (Subramaniam et al., 2006). If data is being produced in multiple locations, on a wide sensor network, dierent frameworks could be applied for example clustering on these streams. A rst approach could consist of a centralized process that would gather data from sensors, even if just a small sample, analyzing it afterwards in a unique multivariate stream. As previously stated, this model tends to be unapplicable as sensor networks grow unbounded.

Another approach could process in two

levels, clustering the data produced by each sensor, separately, and then compiling the results on a centralized process which would dene the nal clusters based on the clusters transmitted by each sensor. A strategy of cluster ensembles (Strehl & Ghosh, 2002) would operate in this way. This approach would, in fact, decrease the amount of data transmitted in the network.

51

2. RATIONALE The ubiquitous setting created by sensor networks implies dierent requirements for clustering methods. Given the processing abilities of each sensor, clustering results should be preferably localized on the sensors where this information becomes an asset.

Thus, information

query and transmission should only be considered on a restricted sensor space, either using ooding-based approaches, where communication is only considered between sensors within a spherical neighborhood of the querier/transmitter, or trajectory-based approaches, where data is transmitted step-by-step on a path of neighbor sensors. A mixture of this approaches is also possible for query re-transmission (Sadagopan et al., 2005). Distributed data mining appears to have the necessary features to apply clustering to streaming data produced on sensor networks (Park & Kargupta, 2002). Although few works were directly targeted at data clustering on sensor networks, some distributed techniques are obvious starters. Recent research developments in clustering are directed towards distributed algorithms for continuous clustering of examples over distributed data streams. In (Datta et al., 2006) the authors present a distributed majority vote algorithm which can be seen as a primitive to monitor a k-means clustering over peer-to-peer networks. The k-means monitoring algorithm has two major parts: monitoring the data distribution in order to trigger a new run of k-means algorithm and computing the centroids actually using the k-means algorithm. The monitoring part is carried out by an exact local algorithm, while the centroid computation is carried out by a centralization approach. The local algorithm raises an alert if the centroids need to be updated. At this point data is centralized, a new run of k-means is executed, and the new centroids are shipped back to all peers. Cormode et al. (Cormode et

al., 2007) proposed dierent strategies to achieve the same

goal, with local and global computations, in order to balance the communication costs. They considered techniques based on the furthest point algorithm (Gonzalez, 1985), which gives a approximation for the radius and diameter of clusters with guaranteed cost of two times the cost of the optimal clustering. They also present the parallel guessing strategy, which gives a slightly worse approximation but requires only a single pass over the data. They conclude that, in actual distributed settings, it is frequently preferable to track each site locally and combine the results at the coordinator site. These methods of combining local and central processing are paradigmatic examples of the path that distributed data mining algorithms should traverse. Kargupta et al. presented a collective principal component analysis (PCA), and its application to distributed cluster analysis (Kargupta et al., 2001). In this algorithm, each node performs PCA, projecting the local data along the principal components, and applies a known clustering algorithm on this projection. Then, each node sends a small set of representative data points to the central site, which performs PCA on this data, computing global principal components. Each site projects its data along the global principal components, which were sent back by

52

2.6. ISSUES IN LEARNING FROM UBIQUITOUS DATA STREAMS the central node to the rest of the network, and applies its clustering algorithm. A description of local clusters is resent to the central site which combines the cluster descriptions using, for example, nearest neighbor methods. However, these techniques still consider a centralized process to dene the clusters, which could become overloaded if nodes were required to react to the denition of clusters, forcing the server to communicate with all local nodes. Klusch et al. proposed a kernel density based clustering method over homogeneous distributed data (Klusch et

al., 2003), which, in fact, does not nd a single clustering denition for all

data set. It denes local clustering for each node, based on a global kernel density function, approximated at each node using sampling from signal processing theory. These techniques present a good feature as they perform only two rounds of data transmission through the network. Other approaches using the K-Means algorithm have been developed for peer-topeer environments and sensor networks settings (Bandyopadhyay et al., 2006). Recently, the development of global frameworks that are capable of mining data on distributed sources is rising.

Taking into account the lack of resources usually encountered on sensor

networks, Resource-Aware Clustering (Gaber & Yu, 2006) was proposed as a stream clustering algorithm for example clustering that can adapt to the changing availability of dierent resources. The system is integrated in a generic framework that enables resource-awareness in streaming computation, monitoring main resources like memory, battery and CPU usage, in order to achieve scalability in distributed sensor networks, by adapting the parameters of the algorithm. Data arrival rate, sampling and number of clusters are examples of parameters that are controlled by this monitoring process. Previous works tend to concentrate a small part of computation on local devices which may communicate with a centralized control station. An example is the VEDAS system, which aims at mobile vehicle monitoring (Kargupta et

al., 2004).

In this case, the distributed

system monitors several characteristics of the vehicles, alerting for signicant changes in their behaviour, based on local data mining models. The system may also interact with the control station to alert the network or improve its model. Learning localized alternative cluster ensembles is a related problem recently targeted by researchers.

Wurst et.

al.

(2006) developed the LACE algorithm , which collaboratively

creates a hierarchy of clusters, in a distributed way.

This approach considered nodes as

distributed users, who labeled data according to their own clustering denition, and applied ensemble techniques to integrate clusterings provided by dierent sources, reecting locality of data, while keeping user-dened clusters. This trade-o between global and local knowledge is now the key point for example clustering procedures over ubiquitous stream sources. As previously exposed, clustering streaming data sources has been already targeted by researchers, in order to cope with the high-speed production of data.

However, if this data

53

2. RATIONALE is produced by sensors on a wide network, the proposed algorithms tend to deal with them as a centralized multivariate stream, without taking into account the locality of data, the transmission and processing resources of sensors, and the breach in the transmitted data quality. Moreover, these algorithms tend to be designed as a single process of analysis without the distributed feature of already developed example clustering systems. Distributed implementations of well-known algorithms may produce both valuable and impractical systems, so the path to them should be carefully inspected. The analysis of clusters of ubiquitous stream sources should comply not only with the requirements for clustering streaming data sources but also with the available resources and setting of the corresponding sensor network. For example, considering the previous example of electricity distribution networks, if a distributed algorithm for clustering streaming sensors is integrated on each sensor, how can local nodes process data and the network interact to cluster similar behaviors produced by sensors far from each other, without a centralized monitoring process? How many hops should the network need to do that analysis? What is the relevance of this information? And what is the relation of this information with the geographical location of sensors?

2.7

Desiderata for Ubiquitous Stream Learning

The scope of this work is the study of emerging techniques, capable of dealing with current state-of-the-art stream data mining problems.

In this way, three major areas should

be addressed, considering the current setting of machine learning applications: streaming, ubiquitous and dynamic. First, given the fact that time and memory seem to be the bottleneck for current machine learning applications, traditional memory-based models with multiple passes over the data are unlikely to achieve high performances. Thus, empirical analysis and practical development of data stream mining techniques should be considered. This area must be complemented with a theoretical study on loss bounds and concept drift detection techniques, aiming at robust denition of learning guaranties. Secondly, in recent times, information is generated and gathered from distributed data sources, leading ecient knowledge extraction from ubiquitous data sources to become a major research motivation. Considering current user application requirements and ubiquitous data outbreak with limited resource computing in distributed dynamical environments, the integration between data stream analysis and ubiquitous environments is also an essential issue to take into account. This integration should provide solutions that oer good performance and scalability in ubiquitous data stream mining.

54

2.7. DESIDERATA FOR UBIQUITOUS STREAM LEARNING Ultimately, the increasing number of data stream sources and the improvement of data gathering techniques is allowing the appearance of better machine learning methods, capable of investigating complex models. However, several conditions have to be met by a learning system in order to be capable of addressing data streams. A formal data stream computing theory should be enunciated to increase the robustness of future designs and implementations under the data stream paradigm. Moreover, the formalization of real-time accuracy evaluation will provide self-control for streaming machine learning, touching the global idea of algorithms reecting on their own learning process.

55

3

Research Questions Aims and Methodology Overview

3. Research Questions

Aims and Methodology Overview A witty saying proves nothing. Voltaire, French author, humanist, rationalist, & satirist (1694-1778)

Knowledge discovery from ubiquitous data streams has become a major goal for all sorts of applications. When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques, such as clustering.

Then, two sub-problems exist:

clustering streaming data examples and clustering streaming data sources.

The former

searches for dense regions of the data space, identifying hot-spots where data sources tend to produce data, while the later nds groups of sources that behave similarly through time. In this thesis we try to address three dierent research problems:

rst, we would like to

explore scientic works which address distributed clustering of ubiquitous data streams; then, we focus on a precise problem of clustering distributed data streams from sensor networks; nally, we challenge the problem of distributed clustering of data stream sources. Overall, this thesis tries to answer three main research questions:

RQ1 Do distributed stream clustering algorithms improve data mining results when applied to ubiquitous data streams scenarios? RQ2 Does local discretization and representative clustering improve validity, communication and computation loads when applied to distributed sensor data streams? RQ3 Can a fully distributed clustering algorithm achieve global clustering denition of the entire network without a centralized server?

The next sections present a detailed denition of the research questions addressed in this work, their aims and methodology overview.

59

3. RESEARCH QUESTIONS

3.1

Distributed Clustering of Ubiquitous Data Streams

A rst step in this area is the exploration of scientic works which propose new clustering solutions for scenarios where data is produced as ubiquitous data streams.

Research Question Do distributed stream clustering algorithms improve data mining results when applied to ubiquitous data streams scenarios?

Distributed Clustering from Ubiquitous Data Streams

roblem/Population ntervention Control/Comparison Outcome P I

Ubiquitous Data Streams Distributed Stream Clustering Centralized or Batch Clustering Data Mining Outcomes

Aim Review existent scientic evidence of the existance of valuable distributed algorithms addressing clustering for ubiquitous data streams.

Objectives •

expose scenarios where data is produced as ubiquitous data streams



identify clustering algorithms for ubiquitous data streams



point out advantages and disadvantages of distributed procedures

Methodology Overview Systematic review of papers proposing new distributed clustering algorithms, with descriptive analysis of results.

60

3.2. CLUSTERING DISTRIBUTED SENSOR STREAMS

3.2

Clustering Distributed Sensor Streams

A common problem in sensor networks is the clustering of the data being produced by the network as a whole.

We investigate if local discretization and centralized clustering

of representative data can achieve better results than simple centralized clustering.

Research Question Does local discretization and representative clustering improve validity, communication and computation loads when applied to distributed sensor data streams?

Clustering Distributed Sensor Streams

roblem/Population ntervention Control/Comparison Outcome P I

Distributed Sensor Data Streams Local Discretization and Centralized Representative Clustering Centralized Data Clustering Communication, Computation and Clustering Validity

Aim Perform distributed clustering on data produced by sensor networks.

Objectives •

improve clustering validity when compared to a centralized approach



perform clustering with reduced communication load



perform clustering with reduced computation load



relate data clustering with sensor network comprehension

Methodology Overview Centralized clustering of representative points (frequent states) of the sensor network which are dened by discretized cells of each sensor observations. Evaluation is performed on synthetic data with Gaussian clusters, and physiological sensor data.

61

3. RESEARCH QUESTIONS

3.3

Distributed Clustering of Streaming Data Sources

A relevant but less tackled problem is to accomplish a clustering denition of the data sources producing streams of data on a distributed fashion.

Research Question Can a fully distributed clustering algorithm achieve global clustering denition of the entire network without a centralized server?

Distributed Clustering of Streaming Data Sources

roblem/Population ntervention Control/Comparison Outcome P I

Distributed Streaming Data Sources Distributed Clustering with No Central Server Centralized Clustering on a Central Server Agreement between Local and Global Clustering

Aim Cluster distributed streaming data sources without a centralized control process.

Objectives •

dene the requirements that systems performing clustering of ubiquitous streaming data sources should observe



dene an ecient method to keep data stream sketches on local sites



develop a simple approach to achieve the global clustering denition locally at each node, without a central server



validate if quality clustering can be achieved with such procedure



relate clustering of data sources with sensor network comprehension

Methodology Overview Each data stream source acts as a processing node, keeping a sketch of its own data, and a denition of the clustering structure of the entire network of data sources, which is the only information that is shared among direct neighbors. Evaluation is performed on synthetic data with Gaussian clusters.

62

4

Distributed Clustering from Ubiquitous Data Streams Systematic Review

4. Distributed Clustering from Ubiquitous Data Streams

Systematic Review Beware of the man who works hard to learn something, learns it, and nds himself no wiser than before...  Kurt Vonnegut, Cat's Cradle US novelist (1922-2007)

A rst step in the area is the exploration of scientic works which propose new distributed machine learning solutions for scenarios where data is produced as ubiquitous data streams.

4.1

Chapter Overview

In this chapter, a review on distributed clustering of ubiquitous data streams is presented. In the next section, a small piece of background knowledge and the aim are presented. Section 4.3 presents the methodology implied in this study, while Section 4.4 presents the results of the review, exposing the selection process and analysing the characteristics of each individual study selected in it. Finally, Section 4.6 discusses the impact of the included studies in the research eld.

4.2

Motivation and Aim

When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques (e.g. clustering streaming data sources.

clustering), either clustering streaming data examples or Clustering streaming data examples searches for dense

regions of the data space, identifying hot-spots where data sources tend to produce data. Clustering streaming data sources nds groups of sources that behave similarly through time. The background for this work has already been thouroughly presented in redirect the reader there for literature support.

Chapter 2,

The research question we address in this

chapter is

Do distributed stream clustering algorithms improve data mining results when applied to ubiquitous data streams scenarios?

65

so we

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS The aim of this work is to check existent scientic evidence of the existance of valuable distributed algorithms addressing clustering for ubiquitous data streams, namely:



exposing scenarios where data is produced as ubiquitous data streams;



identifying clustering algorithms for ubiquitous data streams; and



pointing out advantages and disadvantages of distributed procedures.

4.3

Review Methodology

A systematic review was designed to answer the aforementioned research question. Systematic reviews are a specic type of methodology that tries to reduce the author bias usually introduced in the process of traditional bibliographic reviews (Pai et

al., 2004).

In the

next sections we present the specic approach undertaken in each part of the selection and analysis process.

4.3.1

Elegibility Criteria

Each selected article should be scanned for its compliance with the aim of the review. Our specic inclusion criteria were:

1. study addresses a problem where data is produced ubiquitously by distributed data sources in the form of high-speed data streams; 2. a proposal is done of a distributed clustering system that learns from ubiquitous data streams; 3. comparison is made with one of the following: a centralized stream learning method, a distributed batch learner, or another distributed stream learning system; 4. evaluated outcomes include validity, speed, computation load, communication load, resources consumption or general impact.

4.3.2

Search Strategy and Study Selection

On the 17th of March, 2010, a comprehensive search of literature was conducted by the author, who searched ISI Web of Knowledge using the search terms ubiquitous/distributed,

66

4.3. REVIEW METHODOLOGY Table 4.1: Complete query used in the search for articles indexed in ISI Web of Knowledge.

Topic=((ubiq* OR distrib* OR collaborat* OR p2p OR peer-to-peer OR peer2peer OR "peer to peer" OR "peer 2 peer") AND (stream* data OR data stream* OR continuous ow of data OR high-speed data OR incrementa*) AND (clusterin* OR cluster analysis OR automatic classication OR segmentation* OR numerical taxonomy OR botryology OR typological analysis)) Rened by: [excluding] General Categories=(SOCIAL SCIENCES OR ARTS & HUMANITIES) AND [excluding] Document Type=(PATENT) Timespan=All Years.

data streams and clustering, and their derivatives and synonyms. The choice for this reference database was based on the assumption that all major journals and major conferences are indexed there, so there should be few relevant articles missing in the search, aside the unavoidable bias created by the query itself. No date restrictions were applied, but articles were excluded if they were from general categories Social Sciences or Arts & Humanities. No patent was analyzed in this review. Table 4.1 presents the complete search query. A single reviewer scanned all titles and abstracts of retrieved articles.

At this stage, only

articles that were not about data clustering or, if they were, did not address data streams, were excluded, in order to keep a sensitive selection.

After obtaining the full reports of

potencially relevant articles, the author assessed eligibility from the full-text documents.

4.3.3

Statistical Analysis

Descriptive analysis is performed using the adequate summary measures for each variable. Impact of each area is measured using the

h-index

(Hirsch, 2005). Dierence in proportions

is assessed using Wilson's score 2-sample test for equality of proportions with continuity correction (Wilson, 1927; Fleiss et al., 2003). Association between categorical characteristics should be tested using the

χ2

test (Pearson, 1900). However, statistical association between

dichotomic variables which do not verify the assumptions for the Person's

χ2

test is assessed

with the Fisher's exact test (Fisher, 1922). We consider a signicance level of 5%.

4.3.4

Study Characteristics

A protocol was designed to record the following properties of each study:



author, date, country and type of publication;

67

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS •

type of learning problem (clustering data or clustering data sources);



setting (eg. general method, sensor networks, mobile devices, etc.);



aim of the system (clustering per se or use clustering for other outcome);



distributed procedures (e.g.

fully distributed clustering, central ensemble of local

clusters, distributed processing but centralized clustering, etc.);



data partition (homogeneous or heterogeneous);



performed comparison (e.g. distributed vs centralized learner, batch vs stream learner, self-analysis, etc.);



main outcomes evaluated (e.g. validity, speed, computation load, resources, etc.)

4.3.5

Outcome Measures

The primary outcome measure of machine learning systems is directed to the target function that was dened in the machine learning problem (Mitchell, 1997). This way, most proposals assess quality in terms of validity and accuracy. However, when dealing with learning from ubiquitous data streams, other issues become as important, as stressed on



Chapter 2:

processing speed, as data is produced in high-speed streams, with limited time available to process each single data point;



memory used, as streams are unbounded in length, hence systems should operate with limited memory availability;



communication load, as ubiquitous systems need to include interactions among distributed nodes;



other system's resources, such as battery power, are most of the times the major bottleneck for operation on mobile devices.

Other outcomes might arise from research, as some of the studies might in fact use clustering as a tool to meet a domain-specic outcome.

4.4

Search Results

The search query selected 419 papers published between 1990 and 2010 (216 journal articles, 195 meeting papers and 8 reviews).

68

4.4. SEARCH RESULTS 419 papers selected: 216 articles 195 meeting papers 8 reviews

309 papers excluded after title and abstract review: 209 not about data clustering e.g. machines clustering (computer science) star clusters (astronomy) solid state clustering (physics) 100 data clustering but not streaming algorithms e.g. water streams (ecology) light streams (astronomy) video streams (optics)

110 papers not excluded after abstract review.

27 papers excluded prior to full text review: 14 13

83

full text not available (2 possibly ubiquitous) previous versions of other included articles

papers considered for full text review.

66 papers excluded after full text review: 54 12

17

stream data clustering but not ubiquitous data not about streaming data clustering

papers included in review: 3 14

articles meeting papers

Figure 4.1: Articles selection process.

4.4.1

Selection Process

Figure 4.1 presents the entire selection process from the search query to the nal included sample. After title and abstract analysis, 309 papers were excluded, either because they do not address data clustering (209 papers) or, if they do, they do not address data streams (100 papers). It is interesting to note that no article prior to 1997 was selected, exactly the year of publication of a seminal paper presenting BIRCH, one of the rst successful algorithms for clustering data streams (Zhang et al., 1997). This way, 110 papers were considered for inclusion after abstract and title review.

From

these, 14 had no full text available (but only two of them seem possibly about ubiquitous data streams), and 13 were previous versions of other included articles. Hence, a total of 83 papers were fully reviewed. From these, most were excluded due to not addressing ubiquitous data (54 papers) and 12 were actually not about streaming data clustering. Finally, 17 papers were included in the review, only three of which were articles published in international journals. Table 4.2 presents a chronological list of the main characteristics of the included studies.

69

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS Table 4.2: Chronological list of retrieved articles: reference, country of rst author's ailiation,

nd

type of document (A: journal article or P: meeting paper), ISI citations (as of May 22

,

2010), addressed problem (CD: clustering data or CS: clustering data sources, specic setting (SN: sensor networks, MD: mobile devices or WD: web data). Type

Reference

Country

A

(Alaybeyoglu et al., 2010) (Wol et al., 2009) (Gil-Costa et al., 2008) (Rodrigues et al., 2008b) (Yin & Gaber, 2008) (Zhang et al., 2008) (Cormode et al., 2007) (Horovitz et al., 2007) (Zhou et al., 2007) (Bandyopadhyay et al., 2006) (Qiang et al., 2006) (George & Merugu, 2005) (Qiang et al., 2005) (Son et al., 2005) (Chow et al., 2004) (Rabbat & Nowak, 2004) (Shek et al., 1999)

Turkey Israel Argentina Portugal Australia USA USA Australia China USA China USA China USA Hong Kong USA USA

× ×

4.4.2

P

× × × × × × ×

Cites

1

× × × × × × ×

2

4 9

× × × × × × × × ×

Setting

CS

SN

×

×

×

× ×

× × ×

15

×

Problem

CD

MD

WD

×

×

×

× ×

× ×

× ×

×

×

Overview of Included Studies

In the following, we shall analyse the included studies by type of problem addressed in it: clustering ubiquitous data streams (clustering data) or clustering ubiquitous streaming data sources (clustering data sources). presented in Table 4.3.

Articles characteristics by type of learning problem are

As expected by prior knowledge of the eld, most of the papers

address clustering data (76%) rather than clustering data sources (29%), which is the rst statistically signicant dierence found in the results (p=0.016, Wilson's score 2-sample test with continuity correction). We should stress that one of the articles was actually labeled as addressing both problems (George & Merugu, 2005), so the totals in the results may not add-up to the sum of both parcels. Regarding year of publication, we can see how this is a rather recent research area, with only 18% of the papers being published before 2005, and with increasing trend at least for the problem of clustering data (the reduced incidence for years 2009-2010 might be due to delay in indexing). From our sample we can also note that most of the papers were published by

70

4.4. SEARCH RESULTS

Table 4.3:

Aggregated overview of included studies' characteristics by area of clustering:

clustering data or data sources. Analyzed variables are the year, place and type of publication, the impact of contributions in the research community, and the specic setting of the proposal.

Clustering Data

13 (76)

0.016

n (% within row) 2009-2010 2007-2008 2005-2006 < 2005

1 (50) 6 (86) 5 (100) 1 (33)

n (% within row) America Asia Europe Oceania

7 4 1 1

n (% within row) Journal Article Meeting Paper

How many?

n (% within total)

Clustering Sources

Total

5 (29)

0.016

0.4272 0.6032 0.2612 0.1212

1 1 1 2

(50) (14) (20) (67)

0.5152∗ 0.3382 1.0002 0.1912

2 7 5 3

(12) (41) (29) (18)

(88) (80) (50) (50)

0.5772 1.0002 0.4272 0.4272

2 1 1 1

(20) (20) (50) (50)

1.0002 1.0002 0.5142 0.5142

8 5 2 2

(47) (29) (12) (12)

2 (67) 11 (78)

1.0002

1 (33) 4 (28)

1.0002

3 (18) 14 (82)

1

1

17 (100)

When?

Where?

How published?

n (% within row) Cited Papers Citations, median (range) h-index

4 (80) 5.5 (1-15) 2

1.0002

2 (40) 3.0 (2-4) 2

n (% within row) General Method Wireless/Sensor Nets Mobile Devices Web Data

7 (100) 3 (60) 1 (33) 2 (100)

0.1042 0.5382 0.1212 1.0002

0 (0) 2 (40) 2 (67) 1 (50)

What impact?

0.6002

5 (29) 4.0 (1-15) 3

Which setting?

1 2

0.044

2

0.6002 0.1912 1.0002

7 5 3 2

(41) (29) (18) (12)

Wilson's score 2-sample test for equality of proportions with continuity correction Fisher's exact test

71

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS researchers working in America (47%) or in Asia (29%), with statistically signicant minority prevalence of journal papers of 18% (p=0.015, Wilson's score 1-sample test with continuity correction), not signicantly dierent between both areas (p=1.000, Fisher's exact test). Regarding the impact of the retrieved articles, we should stress that the citation values are only indicative as, as we observed in some of the articles, the citation is not correctly included in the paper, leading to a missing citation; for example, (Rodrigues et cites (Cormode et

al., 2007) but the citation was not indexed.

al., 2008b)

Nevertheless, from these

results, we can note that both areas present similar h-index (Hirsch, 2005). The reader can also note that none of the included studies which addressed clustering of data sources was proposed as a general method, always bounding the scope of the proposal to a specic setting, such as sensor networks (40%) or mobile devices (40%). In fact, association between proposing a general method and addressing clustering of data sources was found to be statistically signicant (p=0.044, Fisher's exact test). On the other side, most of the papers addressing clustering of data were proposed as a general method (54%), while 23% of the contributions were in the scope of wireless/sensor networks. Regarding the actual focus of the studies, Table 4.4 presents the analysis of the proposed contributions by type of problem. Statistical dierences were found between the two areas regarding the aim of the proposal.

While in clustering streaming data 77% of the papers

addressed clustering per se, only 20% of the papers on clustering streaming data sources would act likewise, with the remaining using clustering as a basis for another outcome. Association between proposing a clustering algorithm per se and addressing clustering of streaming data sources was found statistically signicant (p=0.028, Fisher's exact test), supporting the idea that researchers in this area are focusing on domain-specic problems, while clustering streaming data is already established as a relevant problem per se. Another key issue in the analysis of the retrieved studies is the amount and type of distributed procedures applied by each of them. Most of the studies applied some kind of distributed procedure within clustering (76%), especially for the cases of clustering streaming data (85%), while the remaining applied only a distributed processing of the data, leaving the clustering procedure for a centralized server. However, the statistical evidence of the association between this variable and addressing clustering of streaming data was not entirely conclusive within the 5% signicance level (p=0.053, Fisher's exact test).

Nevertheless, more than half of

the studies which applied distributed clustering performed only a central ensemble of local clusters (54%) rather than actually applying a fully distributed procedure (46%). Nonetheless, all the papers addressed distributed data, 71% considering homogenously distributed data (92% in clustering streaming data) while 29% considered heterogeneously distributed data (80% in clustering streaming data sources). Regarding this characteristic,

72

4.4. SEARCH RESULTS

Table 4.4: Aggregated overview of included studies' approach by area of clustering: clustering data or data sources. Analyzed variables are the setting, the specic aim of the proposal, how clustering is distributed in the system, how partitioned data is across processing nodes, the performed comparison and the outcomes measured for evaluation.

Clustering Data

Total

13 (76)

0.016

1

5 (29)

0.016

1

17 (100)

n (% within row) Clustering per se Clustering as basis for other outcome

10 (91) 3 (50)

0.0992

1 (9) 4 (67)

0.028

2

11 (65) 6 (35)

n (% within row) Central Ensemble of Local Clusters Actually Distributed Clustering Centralized Clustering

6 (86) 5 (83) 2 (50)

0.0532∗

2 (29) 1 (17) 2 (50)

0.5382∗

n (% within row) Homogeneous Clustering Heterogeneous Clustering

12 (92) 1 (8)

0.002

2

1 (20) 4 (80)

0.010

n (% within row) Centralized Stream Clustering Centralized Batch Clustering No Comparison Distributed Stream Clustering Other Comparison Distributed Batch Clustering

7 (88) 3 (50) 3 (75) 1 (50) 2 (100) 0 (0)

0.5772 0.0982 1.0002 0.4272 1.0002 0.2352

1 (12) 3 (50) 1 (25) 1 (50) 1 (50) 1 (100)

0.2942 0.2802 1.0002 0.5152 0.5152 0.2942

8 (47) 6 (35) 4 (24) 2 (12) 2 (12) 1 (6)

6 7 7 7 3 2

0.5772 1.0002 0.5772 0.5772 0.5382 0.2192

3 2 1 2 2 2

1.0002 0.6202 0.2942 1.0002 0.6002 0.5382

9 9 8 8 5 4

How many?

n (% within total)

Clustering Sources

Which aim?

How distributed?

7 (41) 6 (35) 4 (24)

How partitioned?

2

12 (71) 5 (29)

Which comparison?

Which measured outcomes?

Scalability and Sensitivity Communication Load Clustering Validity System Speed Other Resources Consumption 1 2

n (% within row) (67) (78) (88) (88) (60) (50)

(33) (22) (12) (25) (40) (50)

(53) (53) (47) (47) (29) (24)

Wilson's score 2-sample test for equality of proportions with continuity correction Fisher's exact test ∗ Recoded into 'Distributed Clustering' and 'Centralized Clustering'

73

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS Table 4.5: Aggregated overview of included studies' reported advantages and disadvantages by outcome and area of clustering:

clustering data or data sources.

Analyzed outcomes

include: scalability and sensitivity, communication, validity, speed and resources consumption.

How many?

n (% of total)

n (% of outcome) Scalable to # Clusters Scalable to # Nodes Scalable to # Dimensions Communication Load n (% of outcome) Improved Clustering Validity n (% of outcome) Improved Similar System Speed n (% of outcome) Improved Similar Resources Consumption n (% of outcome) Improved Worsened

Clustering Data

Clustering Sources

Total

13 (76)

5 (29)

17 (100)

2 (33) 2 (33) 1 (17)

2 (67) 2 (67)

4 (44) 4 (44) 1 (11)

7 (100)

2 (100)

9 (100)

1 (14) 6 (86)

1 (100)

1 (12) 7 (88)

Scalability and Sensitivity

5 (71) 1 (14)

1 (50)

5 (63) 1 (13)

2 (100)

1 (50) 1 (50)

3 (75) 1 (25)

Proportions should be considered on the number of studies addressing each outcome (see Table 4.4)

clustering streaming data is being addressed for homogeneously distributed data (p=0.002, Fisher's exact test) while clustering streaming data sources is for heterogeneously distributed data (p=0.010, Fisher's exact test). Considering the main outcomes evaluated by the papers, and the comparison used to evaluate them, it is interesting to note that few papers addressed resources consumption (24%), as most were focused on scalability and sensitivity analysis, and communication load (53%), followed by clustering validity and processing speed (47%). As far as a comparison as been reported, centralized methods have been widely used, rather than other distributed systems. In fact, only two studies used a distributed stream clustering system for comparison. Worse than that, four studies did not include any external comparison. From the outcomes, advantages and disadvantages of distributed procedures could be extracted.

Table 4.5 presents the results for each outcome.

Overall, the studies report

that distributed methods improve communication ratios, processing speed and resources consumption, while achieving similar clustering validity than the comparison.

74

4.5. DESCRIPTION OF INCLUDED STUDIES

4.5

Description of Included Studies

The quality and impact of dierent contributions is rather heterogeneous in the two clustering problems.

This section exposes the main characteristics of each study, focusing on what

they have targeted, proposed, and how they have evaluated the contribution.

Given their

dierences, studies are presented separately according to the clustering sub-problem they address: data points or data sources.

4.5.1

Clustering Ubiquitous Streaming Data Points

Clustering data points is probably the most common sub-problem of clustering, and the ubiquitous streaming setting is not exception (Rodrigues & Gama, 2007), so several recent studies have addressed this problem.

From this review, the most recent one is the one

presented by Wol et al. (2009), where a generic local approach is proposed for mining data streams in distributed settings, with applications to clustering of homogeneously distributed data.

In fact, all but one retrieved articles address homogeneously distributed data.

exception is the work of Rodrigues et al.

The

(2008b), where each node of a sensor network

produces a univariate stream of data, and the aim is to cluster the examples produced as the global multivariate state of the network composed by all sensor streams. An interesting result is the fact that only one article addressed performing clustering on mobile devices (Horovitz et al., 2007), focusing on the narrow resource requirements implied in such scenarios. Given the dierent impact they imply, we shall analyse studies according to them being either generalized clustering methods or domain-specic approaches.

4.5.1.1

Generalized Methods for Distributed Clustering of Data Streams

In this area, the proposal of general clustering algorithms is diverse, with most of the studies trying to escape the scope bottleneck. Examples of such proposals appear in (Zhang et al., 2008) and (Zhou et

al., 2007), and in the double approach of Qiang et al. (2005, 2006).

But the two main breakthroughs in the proposition of generalized methods for clustering ubiquitous data streams were published by Cormode et al. (2007) and Wol et al. (2009). Cormode et al. (2007) proposed dierent strategies to achieve clustering of homogeneously distributed data streams, with local and global computations, in order to balance the communication costs. They considered techniques based on the Furthest Point algorithm (Gonzalez, 1985), which gives an approximation for the radius and diameter of clusters with guaranteed cost of two times the cost of the optimal clustering. They also present the Parallel Guessing strategy, which gives a slightly worse approximation but requires only a single pass over

75

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS the data.

They studied both the eect of global clustering using distributed procedures

(Global-FP and Global-PG) and the eect of merging local clusterings at the coordinator site (Local-FP and Local-PG). Evaluation was performed on both real and synthetic data sets, comparing among the four distributed approaches and the centralized batch counterparts. Outcomes focused on cluster validity, in terms of the Euclidean radius, and communication ratios. Overall, the authors observed that, in terms of validity, local algorithms present results similar to the centralized counterparts. On the other hand, PG variants usually outperformed FP variants, in terms of communication, being less sensitive to the number of clusters and sites: PG are linear with the number of clusters while FP are quadratic; global algorithms are linear with the number of sites while local algorithms present worse-case linear or sublinear trends with the number of sites. Empirically, for Local-PG, the communication was kept as low as 0.1% and 0.5% of the full data. A small criticism is that some of the experiments were actually the clustering of time series subsequences, which has been shown to produce non-meaningful results (Keogh et

al., 2003).

Nevertheless, they conclude that, in actual

distributed settings, it is frequently preferable to track each site locally and combine the results at the coordinator site, supporting the approach of central ensembles of local clusterings. More than a generalized clustering algorithm, Wol et al. (2009) proposed a local approach for mining data streams in distributed settings, with obvious applications to clustering. They presented a generic local algorithm for computing a given function on the averages of the input vectors, which are being produced homogeneously across distributed nodes. Possible functions included in the article were the L2 norm threshold, the vector mean monitoring and k-means monitoring.

Although theoretically supported, the proposal was also heavily

evaluated using experimental work, with synthetic data, focusing on the correctness of the results (distance to optimal centroids) and the communication involved, comparing the eect of sampling on both centralized and distributed algorithms. They observed that, at least for the k-means monitoring problem, most of the validity error of the distributed algorithm is due to sampling and not due to decentralization, which speeks in favour of local algorithms for clustering. With respect to communication, the local approach was shown to decrease it to values around 15% for high sample sizes.

Besides the support on local algorithms

for clustering, they also opened the path for research on dening the hardness of locally computing a given function, which is still to be studied. Zhou et al.

(2007) proposed the CluDistream, a test-and-cluster variant of Expectation-

Maximization (Dempster et al., 1977) for distributed scenarios, where local sites keep track of current models, and only cluster a new chunk of data if the current model does not explain the data. Central coordinator keeps track of the models at each local site, merging all the models into a global model, but communication between local sites and the coordinator is reduced due to only transmitting the model parameters, and only when the data distribution changes at the local site.

76

Evaluation was performed on both real and synthetic Gaussian

4.5. DESCRIPTION OF INCLUDED STUDIES data sets, to assess benets in terms of cluster validity (using average log likelihood of the model), communication cost, CPU time and memory comsumption, when compared to scalable EM, a centralized subsample-based stream clustering algorithm (Bradley et

al.,

2000). The proposed algorithm presents a lower trend in communication, with the increasing number of updates, while presenting same or higher average log likelihood than the compared sample-based EM. With respect to processing time and memory, CluDistream outperforms the comparison, while being linearly sensitive to the number of clusters and data dimensionality. Overall, the authors conclude that the test-and-cluster framework has less communication cost, memory consumption and CPU time, but higher clustering quality in terms of average log likelihood. Later on, Zhang et al. (2008) also aimed at reducing communication when they presented a colection of algorithms for

(1 − ε)-approximate

k-median clustering on distributed data

streams, for three dierent network topologies: topology oblivious, height-aware and pathaware. The system sets its basis on the in-network aggregation approach (Madden et

al.,

2002), where summaries of data are computed at local sites, which are then merged at higher-level nodes.

Evaluation was performed on both real and synthetic data sets, that

mainly assessed communication reduction (with respect to the stream size) given an admissable validity error and number of sites. They observed that the total communication of topology oblivious algorithm is more than height-aware algorithm, and height-aware algorithm is more than path-aware algorithm (max per node transmission of 20%), indicating a polylog relationship between data transmission and total stream size, being asymptotically linear with the number of sites. Furthermore, all algorithms respect the theoretical bounds the authors have proposed. Approximate approaches have been also previously proposed by Qiang et al. (2005, 2006), who presented two similar (but rather unsupported by experimental evidence) grid-based methods for incremental and parallel clustering for very large databases. The rst one (WINP) uses a space-sliding window to detect approximate locations of clusters before the accurate clustering processing takes place (Qiang et

al., 2005).

The second one (SMIP) identies

high-density cells, which are used to dene a densitiy threshold to lter noise in the data space (Qiang et al., 2006). Then the system uses this grid-space approach to homogeneously split the data across the processors, afterwards performing a centralized merged of the local clusters.

N

Authors argue that WINP and SMIP computation complexities are

O(N )

where

is the length of the stream, but evaluation was really thin, performed on two-dimensional

synthetic data (besides one experiment with 5-dimensional data), where a comparison of the validity of the centralized incremental approach is only indicated with toy examples, and only processing speed is compared with DBSCAN (Ester et al., 1996) which was not quite a fair comparison given the extent of centralized clustering algorithms that could be tested, e.g. STREAM (O'Callaghan et al., 2002) or CluStream (Aggarwal et al., 2003).

77

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS From the considered generalized proposals, especially the ones from 2007 onwards, it seems that both fully distributed clustering and central ensemble of local clusterings deliver when dealing with ubiquitous streaming data points. Furthermore, specic domains of application can allow algorithms to better t in to the addressed problem.

4.5.1.2

Domain-Specic Methods for Distributed Clustering of Streams

Given their specic constraints, wireless and sensor networks are one of the most targeted scenarios producing ubiquitous data streams.

Hence, proposals in this setting can give

relevance to important aspects of the clustering process. Rodrigues et al. (2008b) proposed the Distributed Grid Clustering (DGClust) system, where each node of a sensor network produces a univariate stream of data, and the aim is to cluster the examples produced as the global multivariate state of the network composed by all sensor streams. The system works by using local discretization of streams (local states) to reduce communication in the network, and centralized clustering of most frequent global states to reduce dimensionality of the clustering problem. DGClust is

O(log p),

O(p)

using

space, where

p

Local site processing time in

is the maximum number of bins in the

discretization histogram of any node. Central monitoring of frequent states is done in space and updated in

d

O(m)

time, where

m

O(md)

is the number of frequent states to monitor, and

the number of sensors (data dimensions) in the network. The clustering update procedure

runs in

O(km)

time, wher

k

is the number of clusters. Communication between local sites

and central server, everytime it is necessary, is in the worst case scenario

w

O(wd),

where

is the maximum number of local states of any node. Hence, worst-case communication

should be

O(wdc),

where

c

is a small fraction of the data stream points actually triggering

this communication. Evaluation was performed on synthetic and real data sets, measuring validity in terms of loss to real centroids (synthetic data) and batch learned centroids (real physiological data), and reduction in terms of processing time and communication with central server, when compared with a variant of STREAM (O'Callaghan et al., 2002), a centralized stream clustering algorithm which would require full data transmission. The authors report that the system transmits only around 30% of the total amount of values, while cluster processing was reduced to 10% of the data points. Nonetheless, with the right parameter choice, DGClust could outperform the centralized clustering in terms of loss to the batch centralized centroids. Authors conclude that the grid-based approach improved the reduction in all measured resources, usually limited in sensor networks. Two other related studies addressed clustering data produced by sensor networks.

Rabbat

& Nowak (2004) proposed a in-network distributed optimization algorithm, showing experimental evidence of its applicability with simulated applications in robust estimation, source

78

4.5. DESCRIPTION OF INCLUDED STUDIES localization, cluster analysis and density estimation. Regarding clustering, the authors derived an incremental version of the Distributed EM (Nowak, 2003) but tested it only on a simple example. Nevertheless, they theoretically conclude that for a network of distributed over the unit square (d

= 2)

or cube (d

= 3)

and

m

n

nodes uniformly

measurements per node,

the number of communications required for the distributed algorithm is roughly a factor of

K/(mn1/d )

less than the number required to transmit all the data to a centralized location

for processing, where

K

is the number of processing cycles.

Later on, Son et al. (2005) derived a clustered version of the in-network distributed optimization algorithm proposed by Rabbat & Nowak (2004), showing its application to clustering and the benets of in-cluster approaches for distributed optimization in sensor networks. The authors report that, for a network of reduces from

O(mn),

n

sensors with

in the centralized scenario, to

m observations each, communication √ O( n). Evaluation on synthetic data

focused on energy, convergence speed and communication latency, but the generalization of the results is not very easy. Nonetheless, the authors conclude that, for a network with sensors, by forming



n

n

clusters, transport cost is equivalent to that of the distributed-in-

network scheme, while increasing both accuracy of the estimate and convergence speed, and reducing latency. An abstraction of sensor networks can also be a relevant scenario, which is the case where data is produced on peer-to-peer (P2P) networks. Bandyopadhyay et al. (2006) proposed a P2P K-Means approach to cluster homogeneously distributed data, where nodes collaboratively use cluster information from other nodes, either from immediate neighbor nodes or from randomly sampled from the entire network. The proposal also uses concepts extracted from (Domingos & Hulten, 2001) to bound the number of messages transmitted in the network.

The

O(pkN K(D + 1)P ) where p is the total number of nodes in the network, k is the number of iterations, N is the number of partners that communicate with node U , K is the number of clusters, D is the dimensionality of the data point, and P is the maximum length of all the shortest paths from communication complexity of P2P K-Means is roughly bounded by

the current node to its partners.

Evaluation with synthetic data was performed to assess

benets in terms of accuracy and communication costs when compared with a centralized approach. They conclude that P2P K-Means delivers clusters that are very comparable to the clustering done by centralized K-Means (less than 1% mislabeled points in almost all cases), while exchanging less than 20% of the messages necessary to move the data points into a central node. Ubiquitous streaming data is also being produced in Web-based applications. While sensor networks present diculties related to resources and linkage, the Web presents hard scenarios of scalability, due to the existence of multiple P2P or client-based applications, which often require high-speed processing.

79

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS Gil-Costa et al. (2008) addressed the problem imposed by recent web search engines, where an ecient parallel query processing is required under situations of continuous streams of queries. Although data is initially produced centrally, clusters are kept distributedly across processors, with the update operator (when a new data item arrives) forcing a distributed and incremental clustering procedure to occur. Evaluation was done in two real data sets, measuring cluster validity, running time and processing load (in terms of the number of computed distances), but no external comparison was endured, diminishing the generalization of the results presented in the study.

Another relevant problem in web scenarios is related to recommender systems, where websites may recommend items (e.g. movies or web pages) to users, based on information gathered from both the user and the items. George & Merugu (2005) proposed a co-clustering approach to achieve a collaborative ltering of items. The key idea is to simultaneously obtain user and item neighborhoods and generate predictions based on the average ratings of the co-clusters. Also, the system takes into account the individual biases of the users and items to make valuable predictions.

Evaluation was performed on a real data set, focusing on prediction

accuracy, prediction training and prediction times.

Hence, no clustering-related outcome

was evaluated in this proposal, but processing times were reported to be improved using the co-clustering approach.

Finally, a recent research hot-topic is data mining on mobile devices.

In this review, only

one studied was included which addressed this resource-restricted setting.

Horovitz et al.

(2007) addressed road safety issues, proposing a fuzzy logic approach for ubiquitous data clustering. The clustering is used as data synopsis, feeding a later stage of online monitoring and classication of driving behavior. Clustering is performed locally at each vehicle, and the results are sent to a central server. The server then merges the clustering results and, using fuzzy logic principles and expert knowledge, creates labeled classes of degree of drunkenness, which are then returned to vehicles which classify new unlabeled data.

Evaluation was

performed on synthetic data, focusing only on the accuracy of the classication task, without any considerations on the resources of mobile devices.

Although proposed in specic settings, some of the contributions are extensible to other ubiquitous scenarios. This is one of the scenarios where applied research might in fact boost fundamental research on these topics. From the analysis performed in this section, we believe that the works of (Bandyopadhyay et references for further research.

80

al., 2006) and (Rodrigues et

al., 2008b) are worthy

4.5. DESCRIPTION OF INCLUDED STUDIES 4.5.2

Clustering Ubiquitous Streaming Data Sources

As expected, few papers address clustering of ubiquitous data sources.

Moreover, most

of the existing use clustering as a tool to evaluate other outcomes, rather than evaluating clustering per se.

The exception is the work of Yin and Gaber (2008), which aimed at

performing hierarchical clustering of nodes in a sensor network, based on the similarity among time series produced by each sensor. It was one of the three articles that apply some kind of distributed clustering, whereas the remaining two articles only apply some distributed processing procedure, leaving the clustering task to a central server. The remaining studies retrieved in the search use clustering as a support for evaluating other outcomes. Given the specic settings of application of these studies, the most relevant outcomes evaluated in them are not the quality of clustering, but the eect of the clustering procedure in the target goal.

4.5.2.1

Centralized Clustering of Distributed Streaming Data Sources

The oldest articles included in the review addressing clustering of ubiquitous streaming data sources actually perform centralized clustering. The distributed procedure targets only the processing of the data stream, mainly for data reduction, helping the centralized procedure of actually clustering the sources. Chow et al. (2004) used clustering of mobile peers which have similar moving patterns and exhibit similar data anity in order to improve a cooperative cache management in data retrieval for mobile clients. Collaborative caching (COCA) in mobile hosts makes use of the node's neighborhood to query for data items in their caches before asking the centralized server, hence reducing global communication.

The authors proposed GroCOCA, a group-

based approach where cache queries are only sent to neighbors which belong to the cluster of the querying node, with centralized clustering being based on mobility patterns (geographically close nodes) and data access patterns (operationally close nodes). Evaluation is done in simulated scenarios, focused on latency, server requests ratios and power consumption, comparing with a stream clustering approach (NC), based only on geographical patterns, and a batch clustering of nodes (SK). The authors report that the performance dierence between GroCOCA and SK increases as the server request ratio decreases. Also, in GroCOCA, the accuracy of the clustering algorithm depends on the update rate of the mobility and data access information. Regarding scalability, latency in NC increases with the number of peers, given the increase in workload. However, in COCA and GroCOCA, latency improves slightly with the number of peers, as the probability of nding required items in neighboring peers also increases.

Since there is no collaboration among peers in NC, its performance is not

aected by varying the group size. Oppositely, COCA and GroCOCA do better with larger groups, with GroCOCA performing better than COCA. The authors conclude that COCA

81

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS substantially reduces the access latency and the server request ratio, further improved by GroCOCA, but at the cost of consuming much more power than a conventional caching scheme. The oldest study included in this review also performed clustering of user proles and mobile peers by physical location, this time to improve information sharing and dissemination through data broadcast (Shek et

al., 1999). User proles with common interests are aggregated in

order to optimize available bandwith. Conceptual clustering is performed at a central server, which takes into account tight thresholds to merge, split or create clusters.

Evaluation

was only performed in terms of assessing the sensitivity of these thresholds, using simulated scenarios, with no clear comparison or assessment of cluster quality. Nonetheless, the authors conclude that their strategy satises the unique information dissemination requirements of situation awareness and emergency response applications. We should stress that, if only data processing is to be performed at local sites, clustering could then be targeted by established centralized clustering algorithms that have addressed distributed data as a centralized multivariate data stream (Beringer & Hüllermeier, 2006; Dai et

al., 2006; Rodrigues et

al., 2008d). Nevertheless, dierent scenarios yield dierent

requirements. Thus, the distributed processing could directly inuence the performance of centralized clustering, so research in this area is still valuable.

4.5.2.2

Distributed Clustering of Streaming Data Sources

The most challenging task is denitely to perform distributed clustering of ubiquitous streaming data sources, either by an ensemble of local clusters, or by really distributing the clustering procedure. In this review, three studies were found that used such approaches. Yin and Gaber (2008) proposed the Distributed Single-pass Incremental Clustering (DSIC), a novel algorithm that makes use of network connectivity knowledge to dene local clustering which is then hierarchically agglomerated into higher-level clusters. This hierarchical feature is based on a physical hierarchy of of nodes, previously dened as the basic infrastructure of the network. Time series data is rst compressed using Haar wavelet transform (Li et

al.,

2002) at each node, with the selected wavelet coecients being sent to the physical cluster head.

Then the cluster head reconstructs the time series and performs clustering using

Dynamic Time Warping (DTW) distances (Chan et al., 2003). Finally, the data clusters are merged across physical clusters until a gateway gathers the global model (Yin & Gaber, 2008). Hence, the system applies a central ensemble of local clusters on which, given the focus on the specic setting of sensor networks, the physical locality of data clusters enables the ensemble to work well in the target scenarios.

82

The system is evaluated on synthetic and real data

4.5. DESCRIPTION OF INCLUDED STUDIES from sensor networks, compared to batch centralized and distributed approaches, focusing on cluster validity (silhouette), network communication and resources comsumption (energy). Authors reported taht the proposed model achieved silhouette values equivalent to the ones achieved by a raw centralized approach, with the distributed approach outperforming the centralized counterparts. Also, the raw centralized version is far less energy-ecient than all the other algorithms because the raw time series data needs to be sent back to the gateway for performing global clustering.

Using Haar wavelets in both centralized and distributed

approaches improved the energy-eciency by only transmitting wavelet coecients across the network. Furthermore, the two algorithms perform local clustering, and therefore, only cluster representatives need to be transmitted across the network. The authors conclude that the distributed algorithm can achieve much better clustering quality than the centralized versions, but at the cost of a slight increase in energy consumption. A nice property of the work is that most of the proposal was designed in order to comply with the requirements dened by Rodrigues et al. (2008c), acknowledging the benets of dening clear requirements for such task.

However, the applicability to high-speed streaming data seems dicult to manage,

given the techniques implied in the process.

More recently, Alaybeyoglu et al. (2010) used distributed clustering of sensor network nodes, based on the evolution of readings of those nodes, in order to track objects in the range of the network while, previously, George & Merugu (2005) also used clustering as support for the major outcome, performing clustering of users in a web site to improve a collaborative ltering framework. While the later performed a centralized ensemble of local clusters (see earlier sections), the former actually performed a distributed clustering of sensor nodes, using sensed data information to decide whether or not a node should be included in the clustering that is, at that time, monitoring the moving object.

Node clusters (and their respective

cluster leaders) are dened pro-actively based on a predicted trajectory of the moving object, based on a message-passing approach where the nodes cooperatively elect the cluster leader. Evaluation was performed on simulated scenarios, focused on trajectory error and missing ratio (percentage of clusters that missed the target), comparing also with a static clustering approach. The authors report that, while increasing with object speed, the average distance between predicted and real tracks was always less than 5%, having a worst-case scenario of 10% missing rate, and always outperforming generic and static clustering approaches by at least 1%. The authors then conclude that estimating the future movement of the track and wakening of clusters in this route provides the tracking of very fast objects (up to 100m/s).

Given its specic and minded proposal, we believe that the work of (Yin & Gaber, 2008) is relevant for further research in this area. Nonetheless, more approaches are needed, especially those which are network topology oblivious, being therefore more generalizable to dierent ubiquitous scenarios.

83

4. DISTRIBUTED CLUSTERING FROM UBIQUITOUS DATA STREAMS

4.6

Discussion Remarks

Given the quality of resulting articles, and the impact of those proposals in the research area, but bounded by the limitations included in this review, such as the simple query strategy, with a single database and a single reviewer, some considerations must be presented at this point, especially in directions for further research.

4.6.1

Further Research Needed

Considering the articles that were included in this study, more studies are needed to fulll certain aspects of research. Regarding clustering of ubiquitous streaming data points:



there is lack of clustering algorithms for heterogeneously distributed data streams; heterogeneous data sources are more and more common in real-world applications;



there is lack of studies about the impact of clustering procedures on mobile devices.

With respect to clustering of ubiquitous streaming data sources:



there is lack of clustering algorithms for homogeneously distributed data;



researchers should focus on evaluating clustering per se, instead of considering clustering just a tool for other outcomes; this way, better algorithms could be dened;

Generally, a deep systematic review comparing distributed stream clustering with centralized stream clustering should be endured by a research team, so that most of the bias introduced in this review could be avoided. Also, evaluating unsupervised learning, especially in ubiquitous streaming scenarios is not straightforward.

This way, theoretical studies on this issue are

required. Furthermore, most of the articles are published in conference proceedings, so peerreviewed journal publications are needed to give better support to these research areas.

4.6.2

References for Further Research

From the set of included studies, we point out as relevant papers for further reference:



G. Cormode, S. Muthukrishnan, & W. Zhuang (2007) Conquering the divide: Continuous clustering of distributed data streams IEEE ICDE 2007:10361045

84

4.6. DISCUSSION REMARKS •

R. Wol, K. Bhaduri, & H. Kargupta (2009) A generic local algorithm for mining data streams in large distributed systems IEEE TKDE, 21(4):465478



A. Y. Zhou, F. Cao, Y. Yan, C. F. Sha, & X. F. He (2007) Distributed data stream clustering: A fast EM-based approach IEEE ICDE 2007:711720



S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, & S. Datta (2006) Clustering distributed data streams in peer-to-peer environments Information Sciences, 176(14):19521985



P. P. Rodrigues, J. Gama, & L. Lopes (2008) Clustering distributed sensor data streams ECMLPKDD 2008:282297.



J. Yin, & M. M. Gaber (2008) Clustering distributed time series in sensor networks IEEE ICDM 2008: 678687

As the reader can easily note, these studies were published in highly referenced peer-reviewed journals, or in top international conferences, supporting the idea that researchers should target more selective forums, which improve the overall quality of research being published in the community.

85

5

Clustering Distributed Data Streams The DGClust Algorithm

5. Clustering Distributed Data Streams

The DGClust Algorithm It is better to keep your mouth closed and let people think you are a fool than to open it and remove all doubt.  Mark Twain (1835-1910)

A common problem in sensor networks is the clustering of the data being produced by the network as a whole. In this chapter we focus on the problem of continuously maintaining a cluster structure over the data points generated by the entire network, where each sensor produces a univariate stream of data.

5.1

Chapter Overview

In this chapter, DGClust is proposed, a new clustering system which applies local discretization and centralized clustering of representative points. In the next section, motivation and aims are presented. Then, in Section 5.3 our method is presented, with relevant analysis of the overall processing. Section 5.4 focus on the advantages presented by our proposal in terms of memory and communication resources, especially important in distributed sensor networks. Validation of the system and experimental results on real-world scenarios are presented in Section 5.5. Section 5.6 concludes the chapter, including foreseen future work.

5.2

Motivation and Aim

Nowadays sensor network applications produce innite streams of data. The main issues of the addressed setting are summarised in gure 5.1. The main problem in applying clustering to data streams is that systems should consider data evolution, being able to compress old information and adapt to new concepts (Rodrigues & Gama, 2007). Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. Given the precise problems posed by this sensor network setting, we investigated if local discretization and representative clustering could improve validity, communication and computation loads when applied to distributed sensor data streams.

89

5. CLUSTERING DISTRIBUTED DATA STREAMS

Figure 5.1: Example of a mote sensor network and a plot of the data each univariate sensor is producing over time. The aim is to nd dense regions of the data space. The problems yielded by this particular setting are also summarized.

5.3 DGClust  Distributed Grid Clustering This chapter addresses the problem of continuously monitoring clustering structures of distributed sensor data. Clustering is known to be a NP-hard problem (Bern & Eppstein, 1996), so it is usually hard to get good results with light processing, while communication is one of the most resource-consuming procedures of sensor networks (Chan et

al., 2005). The main

intuition supporting our research is to reduce dimensionality - by monitoring and clustering only frequent states - and communication - by applying online sensor data discretization and controlled transmission to a central server.

5.3.1

System Overview

In this section we present DGClust, a distributed grid clustering system for sensor data streams, which operates with the following procedure. Each local sensor receives data from a given source, producing a univariate data stream, which is potentially innite. Therefore, each sensor's data is processed locally, being incrementally discretized into a univariate adaptive grid. Each new data point triggers a cell in this grid, reecting the current state of the data stream at the local site. Whenever a local site changes its state, that is, the triggered cell changes, the new state is communicated to a central site. Furthermore, the central site keeps the global state of the entire network where each local site's state is the cell number of each local site's grid. Nowadays, sensor networks may include thousands of sensors.

This scenario yields an ex-

ponential number of cell combinations to be monitored by the central site.

However, it

is expected that only a small number of this combinations are frequently triggered by the network, as observed in the example sketched in Figure 5.2. Thus, parallel to the aggregation, the central site keeps a small list of counters of the most frequent global states.

Finally,

the current clustering denition is dened and maintained by a simple adaptive partitional

90

5.3.

DGCLUST

 DISTRIBUTED GRID CLUSTERING

clustering algorithm applied on the frequent states central points. Figure 5.3 summarises the main approach and the outcome of each step of the algorithm.

5.3.2

Notation and Formal Setup

k cluster centers of the data produced by a network of d local sensors. Let X = {X1 , X2 , ..., Xd } be the set of d univariate data streams, each of which is produced by one sensor in one local site. Each local site i keeps a two-layered discretization of the univariate stream, with pi intervals in the rst layer and wi intervals in the second layer, where k < wi α then split cell c, dividing countc evenly by the two update second-layer intervals hb1 , b2 , ..., bwi i send hb1 , b2 , ..., bwi i to central site

cells

end if

b be the layer-two cell triggered countb ← countb + 1 send si ← ht, bi to the central site let

by

c

end while

93

5. CLUSTERING DISTRIBUTED DATA STREAMS 5.3.4

Centralized Frequent State Monitoring

In this work we consider synchronous processing of sensor data. The global state is updated at each time stamp as a combination of each local site's state, where each value is the

s(t) = hs1 (t), s2 (t), ..., sd (t)i. If in that period no information arrives from a given local site i, the central site assumes that site i stays in the previous local state (si (t) ← si (t − 1)). Given that common sensor networks usually imply

cell number of each local site's grid,

asynchronous communication, this problem could be easily coped with using a time frame where the central server could wait for data from the nodes. However, older values that could arrive after this time frame could only be considered for the next time frame, and only if this is globally acceptable in the domain being studied.

A major issue with our setting is that the number

|E|

of cell combinations to be monitored

by the central site is exponential to the number of sensors,

|E| = O(wd ).

However, only a

small number of this combinations represent states which are frequently visited by the whole network. This way the central site keeps a small list,

F,

of counters of the most frequent

global states, whose central points will afterwards be used in the nal clustering algorithm, with

|F | = O(dk β ),

for small

β.

e ∈ E is a frequent element fr whose counter countr currently th estimates that it is the r most frequent state. The system applies the Space-Saving algorithm (Metwally et al., 2005), to monitor only the top-m elements. Basically, if we observe a state s(t) that is monitored in rank r (fr = s(t)), we just increment its counter, countr . If s(t) is not monitored, we replace fm , the element that currently has the least estimated hits, countm , with s(t), and increment countm . For each monitored element fi , we keep track of its over-estimation, εi , resulting from the initialization of its counter when it was inserted into the list. That is, when starting to monitor fi , set εi to the value of the evicted counter. An element fi is guaranteed to be among the top-m elements if its guaranteed number of hits, counti − εi , exceeds countm+1 (in the following referred as counte as it is implemented by storing the count of the last evicted state). Algorithm 2

Each seen global state

presents the necessary procedure to keep the list of most frequent states.

The authors in (Metwally et al., 2005) report that, even if it is not possible to guarantee top-

m

0

elements, the algorithm can guarantee top-m elements, with

values for

m

m0 ≈ m.

Hence, suitable

should be considered. Furthermore, due to errors in estimating the frequencies

of the elements, the order of the elements in the data structure might not reect their exact ranks. Thus, when performing clustering on the top-m elements, we should be careful not to directly weight each point by its rank. The goal of this strategy is to monitor top-m states, using only the guaranteed top-m elements as points for the nal clustering algorithm.

94

5.3.

DGCLUST

 DISTRIBUTED GRID CLUSTERING

Algorithm 2 SpaceSavingUpdate(F , m, s(t))  adapted from (Metwally et al., 2005) Input: set F of frequent states; maximum number m of states to monitor; current global s(t) ← hs1 (t), s2 (t), ..., sd (t)i; updated set F of most frequent global states; updated number of states l; updated count countr and overestimation varepsilonr for current state; count of the last evicted state counte for guarantee assessment; l ← |F | (the current number of monitored states) r ← i : fi == s(t), 1 ≤ i ≤ l if ¬∃r then if l < m then fl+1 ← s(t) εl+1 ← countl+1 ← 0 l ←r ←l+1 state

Output:

1: 2: 3: 4: 5: 6: 7: 8: else 9: counte ← εm ← countm 10: fm ← s(t) 11: r←m 12: end if 13: end if 14: countr ← countr + 1 15: update ranks, moving up hfr , countr , εr i 16: return hF, l, countr , εr , counte i

while

countr > countr−1

One important characteristic of this algorithm is that it tends to give more importance to recent examples, enhancing the adaptation of the system to data evolution. This is achieved by assigning to a new state entering the top-m list one plus the count of hits of the evicted state. Hence, even if this is the rst time this state has been seen in the data, it will be at least as important to the system as the one being discarded.

5.3.5

Centralized Online Clustering

The goal of DGClust is to nd and continuously keep a cluster denition, reporting the cluster centers. Each frequent state

fi

k

represents a multivariate point, dened by the central

si for each m > k , a simple

Xi .

points of the corresponding unit cells

local site

has a top-m set of states, with

partitional algorithm can start, applied to

As soon as the central site

the most frequent states.

In the general task of nding

k

centers given

m

points, there are two major objectives:

minimize the radius (maximum distance between a point and its closest cluster center) or minimize the diameter (maximum distance between two points assigned to the same cluster) (Cormode et

al., 2007).

The Furthest Point algorithm (Gonzalez, 1985) gives a

95

5. CLUSTERING DISTRIBUTED DATA STREAMS guaranteed 2-approximation for both the radius and diameter measures. It begins by picking an arbitrary point as the rst center,

c1 ,

as the point that maximizes its distance After

k

iterations, one can show that the

ci iteratively from the previously chosen centers {c1 , ..., ci−1 }. chosen centers {c1 , c2 , ..., ck } represent a factor 2 then nding the remaining centers

approximation to the optimal clustering (Gonzalez, 1985). See (Cormode et

al., 2007) for

a proof. This strategy gives a good initialization of the cluster centers, computed by nding the center

ki

of each cluster after attracting remaining points to the closest center

algorithm is applied as soon as the system nds a set of

0

m >k

ci .

This

guaranteed top-m states.

It is known that a single iteration is not enough to converge to the actual centers in simple k-means strategies. Hence we consider two dierent states on the overall system operation:

s(t)

converged and non-converged. At every new state

that is gathered by the central site,

if the system has not yet converged, it adapts the clusters centers using the top-m states.

m0

guaranteed

If the system has already converged, two dierent scenarios may occur.

the current state is being monitored as one of the

m0

If

top-m states, then the set of points

actually used in the nal clustering is the same, so the clustering centers remain the same. No update is performed. However, if the current state has just become guaranteed top-m, then the clusters may have change so we move into a non-converged state of the system, updating the cluster centers. Another scenario where the clusters centers require adaptation is when one or more local sites send their new grid intervals, which are used to dene the central points of each state. In this case we also update and move to non-converged state. A dierent scenario is created when a new state enters the top-m, replacing the least frequent one. In this case, some of the previously guaranteed top-m may lose their guarantee. However, if the list of frequent items is small (imposed by resources restrictions) this will happen very frequently so we disregard this scenarios to prevent excessive computation when cluster centers have already converged. Future work will focus on these scenarios for concept drift and cluster evolution purposes. In scenarios where clusters centers adaptation is needed, our system updates the clustering denition by applying a single iteration of point-to-cluster assignment and cluster centers computation. This process (sketched in gure 5.4) assures a smooth evolution of the cluster centers, while it nevertheless adapts them to the most recent data, as old data points tend to be less frequent.

5.3.6

System Outcome

Figure 5.5 presents an example of a nal grid, frequent cells and cluster centers for a specic case with

96

d = 2, k = 5,

for dierent values of

w

and

m.

The exibility of the system

5.4.

Figure 5.4:

DGClust:

DGCLUST

ALGORITHM ANALYSIS

Online partitional clustering of most frequent states.

There are

mainly two modes of operation: converged and non-converged. K-Means iterations are only performed in the non-converged mode, which is triggered by new guaranteed frequent states.

is exposed, as dierent parameter values yield dierent levels of results.

Moreover, the

continuous update keeps track of the most frequent cells keeping the gathered centers within acceptable bounds. A good characteristic of the system is this ability to adapt to resource restricted environments: system granularity can be dened given the resources available in the network's processing sites. Algorithm 3 presents the central adaptive procedure executed at the server site. The algorithm for ClusterCentersUpdate(K ,

F)

is omitted for simplicity

and space saving.

5.4 DGClust Algorithm Analysis Although working altogether in the distributed stream paradigm, there are three dierent levels of process that we should inspect in our proposal, as they may introduce both complexity and error. First, each univariate data stream is discretized, with only the discretized state being forwarded to the central site. At this point, the granularity of each sensor's grid will directly inuence the error in that dimension. Since the construction of the second layer is directly restricted to the intervals dened in layer one, the nal histograms will also be an approximation of the exact histograms that would be dened directly if all data was considered. Nevertheless, with this two-layer strategy the update of the nal grid is straightforward. The

97

5. CLUSTERING DISTRIBUTED DATA STREAMS

Figure 5.5: DGClust: Example of nal denition for 2 sensors data, with 5 clusters. Each coordinate shows the actual grid for each sensor, with top-m frequent states (shaded cells), gathered (circles) and real (crosses) centers, run with: top left

(w = 12, m = 60),

presenting only guaranteed

98

(w = 24, m = 180), top-m.

bottom left

(w = 6, m = 20), top right (w = 24, m = 180)

and bottom right

5.4.

DGCLUST

ALGORITHM ANALYSIS

Algorithm 3 CentralAdaptiveProcessing(L, k, m) Input: list of local sites L = {l1 , l2 , ..., ld }; number Output: set K of k cluster centers;

of clusters

k;

and frequent states

1: F ← {} (set of frequent global states) 2: K ← {} (set of k cluster centers) 3: conv ← f alse (are the centers stable?) 4: for each timestamp t do 5: for each site li ∈ L do 6: if si (t) has not been received then si (t) ← si (t − 1) 7: end for 8: s(t)

← hs1 (t), s2 (t), ..., sd (t)i 9: F, l, counts(t) , εs(t) , counte ← SpaceSavingUpdate(F , m, s(t)) 10: if l > k and K = {} then 11: K ← FurthestPoint (F , k ) (as in (Gonzalez, 1985)) 12: else 13: if not conv or counts(t) − εs(t) = counte + 1 then 14: hK, convi ← ClusterCentersUpdate(K , F ) 15: end if 16: end if 17: end for 18: return K

m;

(Algorithm 2)

layer-two intervals just need to be recomputed when the split operator in layer one is triggered. Moreover, the number of intervals in the second layer can be adjusted individually for each sensor, in order to address dierent needs of data granularity and resources requirements, usually present in current real-world applications (Rodrigues et al., 2008c). In this proposal we address univariate sensor readings. The data stream model we consider in sensor networks assumes that a sensor value represents its state in a given moment in time. If the readings of a local sensor fall consecutively in the same layer-two interval, no sound information would be given to the central site. Thus, local sites only centralize information when a new value triggers an interval dierent from the previously sent to the central server. The central site only monitors the top-m most frequent global states, reducing the dimensionality of the clustering problem, and disregarding infrequent states which could inuence the nal clusters. Finally, the system performs partitional clustering over the

m0

guaranteed top-m frequent

states which is a sample of the actual states, being biased to dense cells. Moreover, although the furthest point algorithm may give guarantees on the initial centers for the clustering of the frequent states, the adaptive update is biased towards small changes in the concept generating the data streams.

99

5. CLUSTERING DISTRIBUTED DATA STREAMS 5.4.1

Time and Space

Each sensor

Xi

produces a univariate adaptive grid.

This process uses the PiD algorithm

ni examples, in O(ni log pi ) time O(log pi ) time and (almost) constant

which, after the initial denition of the two layers based on and

O(pi )

space, is continuously updated in constant

space. Since this is done parallel across the network, the time complexity of the discretization

O(log p) where p = max(pi ), ∀i ∈ {1, 2, ..., d}. The central site aggregates the state of the d local sites. The focus is on monitoring the topm frequent global states, which are kept in O(md) space (the actual m frequent states) and continuously updated in O(md) time (linear search for the current state). The initial clustering of frequent states, and its subsequent adaptation is made in O(kmd) time. of one example in the entire network is

5.4.2

Communication

Data communication occurs only in one direction, between the local sites and the central site. All queries are answered by the central server. Also, this communication does not include sending the original data, rather informing the central server of the current discrete state of the univariate stream of each local site. This feature of only communicating the state when and if it has changed reduces the network's communication requirements. The main reason of this is that, in usual sensor data, sensor readings tend to be highly correlated with previously read value (Rodrigues & Gama, 2009), hence tending to stay in the same discretized state. In the worst case scenario, where all central site, the system processes local site

Xi

d

d

local sites need to communicate their state to the

messages of one discrete value. However, every time the

changes its univariate grid, the central site must be informed on the change so

that it can correctly compute the points used to nd the cluster centers, which imply sending

wi values. In the worst case scenario, the central site may have to receive O(wd) data, where w = max(wi ).

5.5 DGClust Experimental Evaluation Evaluation of streaming algorithms is a hot-topic in research as no standard exists for evaluating models over streams of data (Gama et dene exact evaluation processes.

al., 2009a).

Hence the diculty to

Nevertheless, we conducted a series of experiments to

assess the quality of our proposal. On one hand, we evaluated the system in synthetic data with Gaussian clusters. On the other hand, we tested the system on real world physiological sensor data streams.

100

5.5.

DGCLUST

EXPERIMENTAL EVALUATION

All synthetic scenarios were generated by applying the data generator used in (Domingos & Hulten, 2001), considering each dimension separated across sensors, in univariate streams. The global view of the network scenarios is created by mixtures of means

µi

dimensionality

d

spherical Gaussians, with

k , and the standard (d, k, σ) is created with obtained for d = 2 and

of the network, the number of mixture components

deviation of each sensor stream in each component

100000

k

in the unit hypercube. Each scenario was generated according to three parameters:

σ.

Each scenario

examples. Figure 5.5 showed an example of the nal grid

dierent parameter values. Given the scope of this validation, the system's quality is measured by assigning each of the found cluster centers to the closest real cluster center, using a greedy strategy. The loss of the chosen assignment is then given by

LK =

k X d X (ˆ cij − cij )2

(5.1)

i=1 j=1

cˆij

where

and

cij

are the gathered and real values, respectively, for center

i

in dimension

j.

Besides loss, evaluation in the following sections is also done with respect to two other main outcomes: number of values communicated to the server, to assess benets in communication ratios; and number of clustering updates performed by the central server, to assess the benets of the system in terms of computation reduction.

wi , ∀i ∈ {1, 2, ..., d}, pi = 1000, ∀i ∈ {1, 2, ..., d}.

Studied parameters are the granularity of the univariate adaptive grid, and the number of frequent states to monitor,

m.

We xed

Table 5.1 presents the studied parameters and the domain in which each of them was studied.

5.5.1

Scalability and Parameter Sensitivity

The evaluation of stream learning algorithms is still an open issue (Gama et al., 2009b). In this work, we use the analogy of research questions to clarify the dierent evaluation settings. This way, the rst evaluation of DGClust tries to answer the research question

Does DGClust scale in terms of cluster validity, communication and computation loads, when applied to continuous synthetic data streams generated with Gaussian clusters?

For a rst evaluation, we have created a set of scenarios based on dimensionality in

d ∈ {2, 4, 8, 16, 32, 64, 128}

k=3

clusters, ranging

in order to inspect the impact of the number

of sensors in the quality of the results. For each scenario, 10 dierent datasets were created,

101

5. CLUSTERING DISTRIBUTED DATA STREAMS Table 5.1: DGClust Evaluation: Parameter description and corresponding values considered in each section of the experimental evaluation: experiments with synthetic data for evaluation and sensitivity analysis; study on possible parameter dependencies; and real data from physiological sensors.

Synthetic Data Sensitivity Analysis Parameter Dependencies

Parameter Scenario d

d ∈ {2, 4, 8, 16,

Number of dimensions/sensors

d ∈ {2, 3, 4, 5}

PDMC Data



32, 64, 128} σ

Standard deviation of data distribution

k

Number of clusters

σ = .05

σ = .1



k = 3

k ∈ {3, 4, 5}

k ∈ {2, 3, 4, 5}

p = 1000

p = 1000

p = 1000

α = .05 w ∈ {5, 7, 9, 11,

α = .05

α = .05

w = ωk + 1

w = ωk + 1

ω ∈ {1, 2, 3, 4, 5}

ω ∈ {2, 4, 6, 8, 10}

2 m = φwd | m = dw γ

2 m = dw γ

Local Grid Granularity p

Number of bins in each sensor's rst layer

α

Threshold for splitting each sensor's rst layer

w

Number of bins in each sensor's second layer

ω

How much should

13, 15, 17, 19, 21} w

be inuenced by k?



Frequent States Monitoring

m ∈ {6, 9, 12, 15

m

Number of frequent states to monitor

φ

How much should

m

be inuenced by

d

and

w



φ ∈ {1, 2, 3, 4, 5}



γ

How much should

m

be inuenced by

d

and

w2



γ ∈ {10, 8, 6}

γ ∈ {10, 8, 6, 4, 2}

18, 21, 24, 27, 30, 33, 36, 39, 42, 45}

using

σ = .05,

and the system's sensitivity to parameters was evaluated: the univariate grid

w ∈ {5, 7, 9, 11, 13, 15, 17, 19, 21} and the number of states to monitor m ∈ {6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45}. The main rationale behind choice of values for these parameters is that we are addressing a scenario with k = 3

granularity ranging in raging in the

clusters. This way, it seems empirically right to have an odd number of cells in each sensor's grid, and a multiple of

3 states to monitor (increasing the average number of states monitored

per cluster). The complete set of plots presenting the results for all combinations of parameters is included in Addendum A. There, the reader can nd, for each scenario and combination of parameters, an estimate of the mean value for each outcome, with the corresponding 95% condence interval. Here, we shall present an overview discussion of those results, and focus on some precise results which are relevant to the presentation, such as inspecting the quality of the system with respect to loss (to real centroids), communication (from sensors to the server) and processing (of clustering updates).

102

5.5. 5.5.1.1

DGCLUST

EXPERIMENTAL EVALUATION

Loss to Real Centroids

Figure 5.6 presents a plot of the impact of granularity on loss, which seems evident. If you increase

w

in low values (slightly above

fades for higher values of

w,

k)

there is a reduction in loss. However, this benet

indicating that there is a point where increasing

w

no longer

improves the nal result. The same observation can be made for the number of states to monitor (m) as increasing them above a given value revealed small (or no) improvement in the resulting structure. To evaluate the sensitivity of the system to the number of sensors, we analysed the average result for a given value of granularity (w ), averaged over all values of

m

(as loss seemed to

be only lightly dependent on this factor). In Figure 5.6 we plot these averaged results on a dimensionality scale, with an exponential increase of the number of sensors. We note a slight increase in loss as the number of sensors increase (top plot). However, the plot on the bottom indicates that this increase is due to the additive form of the loss function (Equation 5.1), as normalizing the loss by the number of sensors produced results with no clear trend.

5.5.1.2

Communication to the Central Server

To evaluate the communication requirements, we monitor the amount of transmitted values as a percentage of the total number of examples times the total number of sensors. We should stress that, since data is generated at random from Gaussian distributions, the amount of state switching should be much higher than what could be expected from real sensors, which usually produce highly-auto-correlated data. For this reason, the reduction in communication might not be as strong as expected. Nonetheless, the transmission of values is highly dependent on the granularity, since as more states are possible at each sensor, more state switching should occur, hence more transmission is required. On the other side, this outcome does not depend on the number of states to monitor at the central server, as it is only dened by the number of times a sensor sends data to the central server. We can also note that with higher number of sensors the condence interval for the mean communication narrows, revealing an interesting eect of being easier to estimate communication in wider networks. To evaluate the sensitivity of the system to the number of sensors, we analysed the average result for a given value of granularity (w ), averaged over all values of

m (as communication is

not dependent on this factor). On the top plot of Figure 5.7 we present these averaged results on the same dimensionality scale, with an exponential increase of the number of sensors. The most relevant fact that rises from these results is that the amount of communication does not depend on the number of sensors. This way, the benets of reduced transmission rates are extensible to wide sensor networks.

103

5. CLUSTERING DISTRIBUTED DATA STREAMS

12

Impact of the number of sensors on loss

10

Averaged over values of m for each domain−granularity (d,w) pair

8



● ●

6

● ● ● ●

4

Average Loss



0

2



w=7 w=5 w=13 w=17 w=15 w=21 w=19 w=11 w=9

2

8

16

32

64

128

Number of sensors (d)

Impact of the number of sensors on the normalized loss

0.08



0.06

● ● ● ● ●

0.04

Average Normalized Loss

0.10

Averaged over values of m for each domain−granularity (d,w) pair and normalized by number of sensors

● ●

0.02



w=7 w=5 w=13 w=17 w=15 w=21 w=19 w=11 w=9

2

8

16

32

64

128

Number of sensors (d)

Figure 5.6: DGClust: Evaluation: Impact of the number of sensors (from on loss, for each granularity (from

m = 6

to

m = 45).

to

w = 21),

d=2

to

averaged over all values of

d = 128) m (from

Bottom plot presents the same results on a normalized scale (with

respect to the number of sensors).

104

w=5

5.5. 5.5.1.3

DGCLUST

EXPERIMENTAL EVALUATION

Clusters Centroids Computation Load

Processing clustering updates is probably the most demanding task of the system. Clustering is known to be a NP-hard problem, so it is usually hard to get good results with light processing. To assess the impact of parameters on the processing requirements of the system, we monitor the number of times a new example forced an update of the clustering structure. Results in this matter were challenging. First, we should stress that the number of clustering updates is quite low compare to the total number of data points. Then, with respect to granularity, we could note that there is a clear dependence of the outcome with this parameter. However, it is not the same across dierent dimensionalities. It seems that processing requirements depend on a combination of

w

and

d

altogether (which makes sense since these are the two factors dening the search

domain). A more expected result relates to the number of states to monitor, as we note an increase in processing requirements with the increase of

m,

supporting the notion that a higher number

of states being monitored increases the number of less frequent states being included in the clustering process, hence increasing the number of required updates. To evaluate the sensitivity of the system to the number of sensors, we analysed the average result for a given value of granularity (w ), averaged over all values of

m

(as communication

is not dependent on this factor). On the bottom plot of Figure 5.7 we present these averaged results on the same dimensionality scale, with an exponential increase of the number of sensors.

As expected, an increase in the number of sensors widened the domain of the

problem, hence requiring more clustering updates. Nevertheless, we should point out that the amount of updates is kept extremely low compared to the total number of examples processed by the system.

5.5.1.4

Parameter Sensitivity

From the exposed results, we can stress that increasing the granularity (w ) will only produce better results until a given limit is reached, possibly depending on the number of clusters to nd (k ), above which no benets will be drawn from. Also, an unbounded increase in the number of states to monitor (m) will not, per se, yield better results, hence supporting our initial thought that only a small number of states are actually frequent and relevant for the clustering process. A problem that rises from this is that the number of states the system should monitor seems to depend on both the dimensionality (d) and the granularity (w ).

105

5. CLUSTERING DISTRIBUTED DATA STREAMS

100%

Impact of the number of sensors on communication

80%

● ● ●

70%

● ● ● ●

60%

● ●

w=21 w=19 w=17 w=15 w=13 w=11 w=9 w=7 w=5

40%

50%

Average Communication

90%

Averaged over values of m for each domain−granularity (d,w) pair

2

8

16

32

64

128

Number of sensors (d)

0.15%

0.2%

Averaged over m for each domain−granularity (d,w) pair



0.1%

● ● ● ● ●

0.05%

Average Clustering Updates

0.25%

Impact of the number of sensors on the number of clustering updates

● ●

0%



w=21 w=19 w=17 w=15 w=13 w=11 w=9 w=7 w=5

2

8

16

32

64

128

Number of sensors (d)

Figure 5.7: DGClust Evaluation: Impact of the number of sensors (from on communication (top plot, in % of examples

×

(bottom plot, in % of examples), for each granularity (from over all values of

106

m

(from

m=6

to

m = 45).

d=2

to

d = 128)

sensors) and processing clustering updates

w =5

to

w = 21),

averaged

5.5. 5.5.2

DGCLUST

EXPERIMENTAL EVALUATION

Inter-Parameter Dependencies

From previous results, it seemed clear that as granularity should depend on the number of clusters to nd, the number of states to monitor should also depend on the granularity and the dimensionality of the problem. We look for a good relation between the scenario (k and

d)

and parameters (w and

m). Our assumption is m ∈ O(dk β ), for small β (possibly β = 1). factors ω and φ, where w = ωk + 1 (allows extra m = φwd. and

that parameters should follow

w ∈ O(k)

To study this possibility, we dene two cell which will mostly keep outliers) and

α = 0.05 stating that a rstlayer cell of the univariate grid will be split if it contains more that 5% of the total number of points. The initial range is set to [0, 1]. We set d ∈ {2, 3, 4, 5} and σ = 0.1, varying k ∈ {3, 4, 5}, ω ∈ {1, 2, 3, 4, 5} and φ ∈ {1, 2, 3, 4, 5}. Each scenario was evaluated with results averaged over 10 datasets. All wi are set with the same value w = ωk + 1. We vary k within small values to inspect the ability of the system to nd well-separable clusters. As rst layers in the univariate grids have size

>> 20,

we set

After aggregating all experiments, we computed Pearson's correlation between the parameters (ω and

φ) and the resulting loss. The ω parameter reported (as expected) negative correlation = −0.7524), as better granularity diminishes the error implied by performing

with the loss (ρ

clustering on a grid instead of the real values. However, higher granularity also implies higher values for

m,

so a compromise should be found to optimize computational costs.

After

running some empirical tests (which will be subject of thorough evaluation in the future), we found that

ω

should be larger than

1,

in order to allow the existence of infrequent cells

between frequent ones, hence improving separability.

φ parameter, the study reported a positive correlation with loss (ρ = 0.3832), proving m above a given value will not, by itself, increase the quality of model. Although might go against the empirical intuition, the higher the m the higher the probability of

For the

that growing this

including infrequent states in the clustering algorithm (because these start to be considered guaranteed top-m). This way, we decided to try a dierent approach, considering an upper

dw2 . as m =

bound on

After some simple testing of admissible values for the parameter,

dened

dw2 γ with

m

is then

γ ∈ {10, 8, 6}.

Figures 5.8 and 5.9 present the results gathered for dierent scenarios, comparing with a simple centralized online k-means strategy, to which we refer as Continuous K-Means, which is a simplication of the STREAM algorithm (O'Callaghan et

al., 2002).

This strategy

performs a K-Means at each chunk of examples, keeping only the centers gathered with the last chunk of data weighted by the amount of points that were assigned to each center. Once again we note how hard it becomes to dene a clear relation between

w, d

and

m, 107

5. CLUSTERING DISTRIBUTED DATA STREAMS

1.5 1.0

● ● ●

Continuous K−Means DGClust (omega = 2) DGClust (omega = 4)

0.0

0.5

log(1 + loss)

2.0

Results for Different Granularities (omega) (k in {2, 3, 4, 5, 10}, gamma in {10, 8, 6})

d=2

d=3

d=4

d=5

d = 10

d = 20

1.5 1.0

● ● ● ●

Continuous K−Means DGClust (gamma = 10) DGClust (gamma = 8) DGClust (gamma = 6)

0.0

0.5

log(1 + loss)

2.0

Results for Different Sizes of Frequent List (gamma) (k in {2, 3, 4, 5, 10}, omega in {2, 4})

d=2

Figure 5.8:

10

d=3

DGClust Evaluation:

d=4

d=5

Averaged loss over

d = 10

ω ∈ {2, 4}

d = 20

and

γ ∈ {6, 8, 10},

and

dierent data sets for each combination of parameters and scenarios, for DGClust and

Continuous K-Means.

2.0

Results for DGClust for Different Number of Clusters (omega in {2, 4} and gamma in {10,8,6})

1.5

● ●



0.5

1.0



k=2 k=3 k=4 k=5 k = 10

0.0

log(1 + loss)



d=2

d=3

d=4

d=5

d = 10

d = 20

2.0

Results for CKM for Different Number of Clusters

1.5

● ●



0.5

1.0



k=2 k=3 k=4 k=5 k = 10

0.0

log(1 + loss)



d=2

d=3

d=4

Figure 5.9: DGClust Evaluation: Average loss over over

108

10

dierent datasets.

d=5

d = 10

d = 20

k ∈ {2, 3, 4, 5, 10} for each xed parameter

5.5.

DGCLUST

EXPERIMENTAL EVALUATION

d=5 d=4 d=3

sign(diff) * log(1 + abs(diff))

d = 10

d = 20

DGClust vs Continuous K−Means

● ● ●

d=2

● ●

−0.5

0.0

0.5

k=2 k=3 k=4 k=5 k = 10

1.0

Figure 5.10: DGClust Evaluation: Average loss comparison for dierent dimensions number of clusters

d

and

k.

although for higher values of

k

and

d

we could see some possible progressive paths towards

an improvement in the competitiveness of our proposal, especially expressed in Figure 5.10. However, the identication of general parameters is always discussable.

Although we plan

to do more exhaustive sensitivity tests, in order to achieve at least acceptable ranges for the parameters, we should stress that the exibility included in the system allows for better deployment in sensor networks and resource restricted environments.

5.5.3

Application to Physiological Sensor Data Streams

This evaluation of DGClust tries to answer the research question

Does DGClust outperform Continuous K-Means in terms of cluster validity, communication and computation loads, when applied to data streams from physiological sensors?

109

5. CLUSTERING DISTRIBUTED DATA STREAMS 5.5.3.1

Sample Characteristics

The Physiological Data Modeling Contest Workshop (PDMC) was held at the ICML 2004 and aimed at information extraction from streaming sensors data. The training data set for the competition consists of about 10,000 hours of this data, containing several variables: userID, sessionID, sessionTime, characteristic[1..2], annotation, gender and sensor[1..9]. We have concentrated on sensors 2 to 9, extracted by userID, resulting in several experimental scenarios of eight sensors, one scenario per userID.

5.5.3.2

Evaluation Strategy

For each scenario, we run the system with dierent values for the parameters, and compare the results both with the Continuous K-Means and full data K-Means, the latter serving as real centers denition. Since dierent sensors produce readings in dierent scales, we inspect the distribution of each sensor on an initial chunk of data, dening the initial ranges to percentiles

25%

and

75%.

This process is acceptable in the sensor networks framework as

expert knowledge about the range is usually available. Hence, we are also allowing the system to adapt the local grids accordingly. The system ran with and

k ∈ {2, 3, 4, 5}, ω ∈ {2, 4, 6, 8, 10}

γ ∈ {10, 8, 6, 4, 2}.

5.5.3.3

Results

Figure 5.11 presents performance results in terms of loss. Beyond loss, also monitored was the amount of communication and cluster adaptation in each run. the resulting performance statistics.

Figure 5.12 presents

Given the characteristics of sensor data, subsequent

readings tend to stay in the same interval. Hence the advantage of local discretization: the system transmits only around

30%

of the total amount of values, including transmissions of

recalculated second layer intervals. Also, we should note that only a small part of the points require cluster center adaptation (less than

10%).

Overall we should stress that, given the

exibility of the system, a suitable combination of parameters can yield better results than centralizing the data, while preventing excessive communication in the sensor network.

5.6

Remarks and Future Work

In this chapter, we have presented DGClust, a new distributed grid clustering algorithm for data produced on wide sensor networks. Its core is based on online discretization of data,

110

5.6. REMARKS AND FUTURE WORK

4e+05

+ − o

DGC Worst Loss DGC Best Loss CKM

k=3 k=5

2e+05

Loss

6e+05

Clustering Loss



k=4 ●







0e+00

k=2 ● −



k ∈ {2, 3, 4, 5}

Figure 5.11: DGClust Evaluation: Performance in terms of loss with each

k , ω ∈ {2, 4, 6, 8, 10}

and

γ ∈ {10, 8, 6, 4, 2}.

and, for

The circles refer to the loss achieved

when a centralized online clustering algorithm is applied in the entire data.

Transmitted Values

Evicted States

k=5

|

|

k=5

|

36% (11.4%) k=4

|

|

k=4

|

31.6% (10.5%) k=3

| |

|

k=3

|

|

20

k=2

|

25

30

35

40

45

10

20

40

Clustering Updates

Guaranteed Top−m |

k=5

|

|

|

|

k=4

|

5.9% (4.5%) k=3

|

12.2% (10.2%)

|

k=3

|

4.3% (2.8%)

|

17.4% (14.1%)

|

k=2

|

3.2% (1.6%)

4

60

10.1% (9%)

k=4

2

50

%

k=5

|

30

%

7.8% (6.6%)

k=2

|

18% (12.6%)

|

|

|

28.5% (16.6%)

20.4% (8.3%)

15

|

36.5% (18.5%)

26.8% (9.09%) k=2

|

42.7% (18.6%)

|

24.7% (17%)

6

8

10

12

14

0

10

%

20

30

40

%

Figure 5.12: DGClust Evaluation: Performance in terms of communicated values (% of total examples

×

dimensions), evicted states (% of total examples), cluster centers adaptations

(% of total examples) and number of guaranteed top-m (% of total for

k ∈ {2, 3, 4, 5}

m). Presented values are k , averaged over ω ∈ {2, 4, 6, 8, 10} and γ ∈ {10, 8, 6, 4, 2}. each k is represented by the standard deviation.

and, for each

The dispersion of results for

111

5. CLUSTERING DISTRIBUTED DATA STREAMS

Figure 5.13: DGClust: Main results achieved by applying this strategy.

frequent state monitoring, and online partitional clustering. These techniques jointly work towards a reduction of both the dimensionality and the communication burdens. Figure 5.13 summarises the main results obtained by applying this strategy, from which we set focus on the communication gains and the ability to process and cluster distributed sensor data from a streaming point of view. Experiments are presented in terms of sensitivity tests and application to a real-world data set, from which some advantages of this approach could be exploited. In

Chapter 2 we have already presented an exposition of related areas of research which are

relevant for the problem at stake in this chapter: data streams discretization, clustering data streams and monitoring frequent items from data streams. Regarding distributed clustering,

Chapter 4

presented some algorithms that could be applied to the problem of clustering

distributed sensor data streams. Nevertheless, we believe that the benets of our approach put it at a higher level of applicability. Discussion on sensor network comprehension is presented later in

Chapter 7,

analyzing the connections with clustering streaming data sources.

Current work is concentrated on determining acceptable ranges for the parameters of the system and application to more real-world data.

Future work will focus on techniques to

monitor the evolution of the clusters, taking advantage from the adaptive update of clusters already implemented in the system.

Also, we are preparing the deployment of the system

in real wireless sensor networks, in order to better assess the sensitivity and advantages of the system with respect to restricted resources requirements. Furthermore, the use of fading histograms in each node might improve the adaptative feature of the system.

112

6

Distributed Clustering of Streaming Data Sources Requirements and the L2GClust Algorithm

6. Distributed Clustering of Streaming Data Sources

Requirements and the L2GClust Algorithm

The human mind treats a new idea the same way the body treats a strange protein; it rejects it.  P. B. Medawar, British (Brazilian-born) anatomist (1915-)

Clustering streaming data sources is the task of clustering dierent sources of data streams, based on the data series similarity.

Most of the work in incremental clustering of data

streams has been largely concentrated in clustering the examples rather than the sources. However, the data stream paradigm imposes that clustering of streaming data sources should also be addressed as an online procedure, not only due to the dynamics inherent to streams but also because the relations between them can change over time. Moreover, centralized clustering strategies tend to be inapplicable as usual techniques have quadratic complexity on the number of sources, and several systems include a high number of distributed sources which may grow unbounded.

6.1

Chapter Overview

In this chapter, the task of distributed clustering of streaming data sources is exposed. In the next section, motivation and aims are presented, while Section 6.3 presents denitions and requirements for centralized approaches to the problem. Then, in Section 6.4, a set of denitions and requirements for distributed clustering streaming data sources is discussed. Section 6.5 presents a new time-biased window model, the used for memoryless maintenance of moving averages.

α-fading

window, which is later

Section 6.6 presents the L2GClust

algorithm, a rst local approach to global clustering in ubiquitous settings, while Section 6.7 includes a validation of the system with experimental results on synthetic data with Gaussian clusters. Finally, Section 6.8 presents a quick discussion on the advantages extracted from the analysis of empirical results, and concludes the chapter with main ndings and recommendation for future work.

115

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES

6.2

Motivation and Aim

Nowadays, applications include distributed data sources which produce innite streams of data at high-speed. This ubiquitous scenario raises several obstacles to the usual knowledge discovery work ow, enforcing the need to develop new techniques, with dierent conceptualizations and adaptive decision-making, being subject to the same interactions required by previous static and centralized applications. The aim of this exposure is to clarify why and support how clustering of streaming data sources should be addressed. In this scope, we try to discover if a fully distributed clustering algorithm can achieve the global clustering denition of the entire network without a centralized server. To this extent, we aim at formalizing the extra requirements that should be taken into account in the distributed setting, and their implications for future developments. Other goals are to propose a memoryless window model for keeping a data stream sketch on distributed nodes, and a local clustering strategy to approximate the global clustering of the entire network.

6.3

Centralized Clustering of Streaming Data Sources

Clustering streaming data sources is the task of clustering dierent sources of data streams, based on the data series similarity. Most works on clustering analysis for distributed sources (e.g. sensor networks) actually concentrate on clustering the sources by their geographical position (Chan et

al., 2005) and connectivity, mainly for power management (Younis &

Fahmy, 2004) and network routing purposes (Ibriq & Mahgoub, 2004).

However, in this

topic, we are interested in clustering techniques using the data produced by the sources, instead. The motivation for this is all around us. As networks and communications spread out, so does the distribution of novel and advanced measuring devices. The networks created by current settings can easily include thousands of sources, each one being capable of measuring, analyzing and transmitting data. From another point of view, given the evolution of hardware components, these sources act now as fast data generators, producing information in a streaming environment. Clustering streaming time series has been already studied in various elds of real world applications. However, as explained in Section 2.4.3, algorithms that were previously proposed to the task of clustering time series data streams tend to deal with data as a centralized multivariate stream (Rodrigues & Gama, 2007). In fact, many motivating domains could benet from (and some of them even require) a distributed approach, given their objective application or specialized setting (Rodrigues et al., 2010b).

116

6.3. CENTRALIZED CLUSTERING OF STREAMING DATA SOURCES Clustering streaming data sources is an emerging area of research that is closely connected to two other elds: clustering of time series, for its application in the variable domain; and clustering of data strems, for its applications to data owing from high-speed streams. Moreover, clustering examples over time present adaptivity issues that are also required when clustering streaming series.

Evolutionary clustering tries to optimize these techniques (Chakrabarti

et al., 2006). However, the need to detect and track changes in clusters is not enough, and is also often required to provide some information about the nature of changes (Spiliopoulou et al., 2006).

6.3.1

Requirements for Clustering Streaming Data Sources

The basic requirements usually defended for clustering data streams are that the system must possess a compact representation of clusters, must process data in a fast and incremental way and should clearly identify changes in the clustering structure (Barbará, 2002). Clustering streaming data sources share the same distrusts and, therefore, the same requirements. However, there are some conceptual dierences when addressing multiple streaming sources. Systems that aim to cluster streaming data sources should (Rodrigues et al., 2008c):



process with constant update time and memory;



enable an anytime compact representation;



include techniques for structural drift detection;



enable the incorporation of new relevant sources;



operate with adaptable conguration;

The next sections try to explain to which extent these features are required to eciently cluster streaming data sources.

6.3.1.1

Constant Update Time and Memory

Given the usual dimensionality of data streams, an exponential or even linear growth in the number of computations with the number of examples would make the system lose its ability to cope with streaming data. Therefore, systems developed to address data streams must always process with constant update time.

A perfect setting would be to have a system

becoming faster with new examples. Moreover, memory requirements should never depend on the number of examples, as these are tendentiously innite in number. From another point

117

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES of view, when applying clustering to the streaming data sources, a system could never be supported on total knowledge of available data. Since data is always evolving and multiple passes over it are impossible, all computations should be incrementally conceived.

Thus,

information is updated continuously, with no increase in memory, and this update requires low time consumption.

6.3.1.2

Anytime Compact Representation

Data streams reveal an issue that imposes the denition of a compact representation of the data used to perform the clustering: it is impossible to store all previously seen data, even considering clipping the streams (Bagnall & Janacek, 2005).

In data clustering, a

usual compact representation of clusters is either the mean or the medoid of the elements associated with that cluster. This way, only few examples are required to be stored in order to perform comparisons with new data. However, clustering streaming data sources is not about comparing new data with old data, but determining and monitoring relations between the data sources. Hence, a compact representation must focus on sucient statistics, used to compute the measures of similarity between the streams, that can be incrementally updated at each new example arrival.

6.3.1.3

Structural Drift Detection

Streams present inherent dynamics in the ow of data that are usually not considered in the context of usual data mining. The distribution generating the examples of each stream may (and in fact often does) change over time.

Thus, new approaches are needed to consider

this possibility of change and new methods have been proposed to deal with variable concept drift. However, detecting concept drift as usually conceived for one variable is not the same as detecting concept drift on the clustering structure of several streaming sources (Rodrigues et al., 2008d). Structural drift is a point in the stream of data where the clustering structure gathered with previous data is no longer valid, since it no longer represents the new relations of proximity and dissimilarity between the data sources. Systems that aim at clustering streaming data sources should always include methods to detect (and adapt to) these changes in order to maintain an up-to-date denition of the clustering structure through time.

6.3.1.4

Incorporate New Relevant Streams

In current data streams, the number of data sources and the number of interesting correlations can be large. However, almost all data mining approaches, especially dealing with streaming

118

6.3. CENTRALIZED CLUSTERING OF STREAMING DATA SOURCES data, consider incoming data with xed width, that is, only the number of observations increase with time. Current problems include an extra diculty as new sources may be added to the system through time. Given the nature of the task here at hand, a clear process of incorporating new sources in a running process must be available, so that the usual growth in data sources is accepted by the clustering system. Likewise, as data sources arise from all sorts of applications, their importance also fades out as dissemination and redundancy increase, becoming practically irrelevant to the clustering process. A clear identication of such irrelevant streams should also increase the quality of the dissimilarities computed within each cluster.

6.3.1.5

Adaptable Conguration

From the previous requirements, it becomes obvious that the clustering structure and, even more, the number of clusters in the universe of the problem may change over time.

This

way, approaches with xed number of target clusters, though still useful in several problems, should be considered only in that precise scope.

In general, approaches with adaptable

number of target clusters should be favored for the task of clustering streaming data sources. Moreover, hierarchical approaches present even more advantages as they inherently conceive a hierarchical relation of sub-clusters, which can be useful to locally detect changes in the structure of clusters.

6.3.2

Compliance of Existing Approaches

The problem of clustering streaming data sources, assuming data is gathered by a centralized process while it is becoming available for online analysis, was already targeted by recent research. More than an exaustive review, we shall make a quick overview on some of the recent approaches to the problem, assessing their compliance with the enunciated requirements. Considered algorithms were presented in Section 2.4.3 and include Online KM (Beringer & Hüllermeier, 2006), COD (Dai et al., 2006) and ODAC (Rodrigues et al., 2008d). The main characteristics and compliance of these systems with the previously dened requirements is sketched in Table 6.1.

Although complying with most of the requirements for clustering

streaming data sources, the previously proposed approaches to the problem assume data is gathered by a centralized process before it is available for analysis.

However, in the real

world this is often not the case. Data is produced and processed by distributed data sources. In the next section we explore the new features of the ubiquitous setting where, rather than performing centralized streaming analysis, data must be considered spreaded across the network of data sources, enabling and even compelling the use of distributed procedures.

119

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES Table 6.1:

Compliance of existing centralized algorithms for clustering of streaming data

sources with the enunciated requirements.

Data Representation Model Generation Constant Time/Memory Anytime Representation Structural Drift New/Relevant Adaptable

1 2 3

6.4

Online KM1

COD2

ODAC3

DFT K-Means

Wavelet On demand

Dissimilarities Hierarchical

Yes Yes Fuzzy No Stepwise

Yes Data only Human No Oine

Yes Yes Local No Hierarchical

(Beringer & Hüllermeier, 2006) (Dai et al., 2006) (Rodrigues et al., 2008d)

Distributed Clustering of Streaming Data Sources

Clustering streaming data sources has been already targeted by researchers, in order to cope with the tendentiously innite amount of data produced at high speed. However, if this data is produced by distributed data sources, the proposed algorithms tend to deal with them as a centralized multivariate stream. They are designed as a single process of analysis, without taking into account the locality of data produced by sources on a distributed scenario, the transmission and processing resources of the network, and the breach in the transmitted data quality (Rodrigues et

al., 2008c).

All of these issues are usually motivated by energy eciency demands of both the network and the actual data sources.

Moreover, these algorithms tend to be designed as a single

process of analysis without the necessary attention on the distributed setting, already addressed on some example clustering systems (Cormode et

al., 2007), which creates high

levels of data storage, processing and communication. Distributed data mining appears to have the necessary features to apply clustering to streaming data produced on distributed scenarios (Park & Kargupta, 2002).

However, distributed implementations of well-known

algorithms may produce both valuable and impractical systems, so the path to them should be carefully inspected.

120

6.4. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES

Figure 6.1:

Example of a mote sensor network (left), with links of possible transmission

represented by straight lines and physical subnetworks (represented by IDs) separated by dashed lines, and a possible clustering denition of the series produced by each sensor (right). This illustrative example shows the orthogonality expected to exist between network topology and the sensors' data clustering structure.

6.4.1

Issues in Distributed Settings

Considering the main restrictions of distributed settings, let us focus on the example of a sensor network, where the analysis of clusters of multiple sensor streams should comply not only with the requirements for clustering streaming data sources but also with the available resources and setting of the corresponding sensor network.

If a distributed algorithm for

clustering streaming data sources is to be integrated on each sensor, how can local nodes process data and the network interact in order to cluster similar behaviors produced by sensors far from each other, without a fully centralized monitoring process?

If communication is

required, how should this be done in order to avoid the problems of data communication on sensor networks, prone to implosion and overlap?

For example, a network of wireless

integrated network sensors (WINS) has to support large numbers of sensors in a local area with short range and low average bit-rate communication (Pottie & Kaiser, 2000). Moreover, what is the relationship between sensor data and the geographical location of sensors? Common sensor networks data aggregation techniques are based on the Euclidean distance (physical proximity) of sensors to perform summaries on a given neighborhood (Chan et

al., 2005).

However, the clustering structure denition of the series of data produced by the sensors is orthogonal to the physical topology of the network, as stressed in the example presented in gure 6.1. These and other questions should be considered in the development of new techniques to eciently and eectively perform clustering of distributed streaming data sources, as massive sensor networks produce high levels of data processing and transmission, reducing not only the ability to feedback, in useful time, the information to the system, but also the uptime of sensors themselves, due to high energy consumptions.

121

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES We can overview the features that act both as requirements for clustering distributed streaming data sources sensors and future paths to research in this area (Rodrigues et al., 2008c):



the requirements for clustering streaming data sources must be considered, with more emphasis on the adaptability of the whole system;



processing must be distributed and synchronized on local neighborhoods or querying nodes;



the main focus should be on nding similar data sources irrespectively to their physical location;



the relevance of sensors in the clustering denition can also be based on geographical position if the querying entity's interest is focused on a local area;



processes should minimize dierent resources (mainly energy) consumption in order to achieve high uptime;



operation should consider a compact representation of both the data and the generated models, enabling fast and ecient transmission and access from mobile and embedded devices.

The nal goal is to infer a global clustering structure of all relevant data sources. Hence, approximate algorithms should be considered to prevent global data transmission. The main idea behind this task is the following: some (or all) of the nodes in the network should perform some kind of processing over the data gathered by themselves or/and by their neighbors, in order to achieve an up-to-date clustering structure denition of the streaming data sources.

6.4.1.1

Global Clustering Structure

The main question that must be answered is how can a distributed system develop and learn the global clustering structure of distributed streaming data sources, even though communication between nodes is limited (and even inexistent in some extent). The handicap on processing streams is the impossibility of total knowledge of each series data. One of the most suitable solutions to this problem is the application of approximate algorithms (Gama et al., 2007). The handicap is reinforced in ubiquitous settings as, for a given processing unit, total knowledge of the complete set of sources' data is also improbable. Hence, approximate algorithms must be considered in this direction also. A rst approach could consist of a centralized process that would gather data from distributed sources, even if just a small sample, analyzing it afterwards in a unique multivariate stream. As

122

6.4. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES previously stated, this model tends to be unapplicable as distributed applications (e.g. sensor networks) grow unbounded. Thus, dierent techniques must be developed. On one side, the data clustering structure could be dened locally, possibly restricted by the network physical clustering structure, in order to conne communications to nearby nodes. Afterwards, these local structures would be combined by top-level processing units to dene a global clustering structure. On the other hand, nodes could be able to dene representative data or summary information that would be used by any top-level process to dene a single clustering structure, even if roughly approximated.

6.4.1.2

Ubiquitous Data Processing

Although distributed applications usually operate with limited bandwidth, due to energy restrictions, the amount of data produced by these applications can become unbounded due to the large number of data sources and their fast processing abilities. This can turn out to be an important bottleneck and force some nodes to spend more energy on relaying information to a sink (Pottie & Kaiser, 2000). The key objective of streaming data processing is to maintain information incrementally, in such a way that the system can cope with high-speed production of data. Sucient statistics can basically be computed for a data source and its neighbors, complying with the rst two requirements dened in Section 6.3. However, the ubiquitous setting narrows the possibility of communication between all the nodes, which is usually required by clustering methods. Even hierarchical approaches such as ODAC (Rodrigues et

al., 2008d), which performs

local computations on dierent levels of the hierarchy, would require global referencing of data sources to allow communication between those which, although highly correlated, could be several hops away from each other.

Given the processing abilities of each node,

clustering results should be preferably localized on the nodes where this information becomes an asset. Thus, information query and transmission should only be considered on a restricted space, either using ooding-based approaches, where communication is only considered between nodes within a spherical neighborhood of the querier/transmitter, or trajectorybased approaches, where data is transmitted step-by-step on a path of neighbor nodes. A mixture of this approaches is also possible for query re-transmission (Sadagopan et al., 2005). These features reveal a key problem to be solved. If for centralized clustering procedures, the sucient statistics are used to dene a proximity basis between data sources, in a distributed setup the proximity basis between the data sources should also help to determine whether the sucient statistics for these sources should continue to be maintained. example of ditributed streaming data sources are sensor networks.

A paradigmatic

Sensors are usually

small, low-cost devices capable of sensing some attribute and of communicating with other

123

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES sensors. These characteristics imply resource restrictions which narrow the possibilities for high-load computation while operating under a limited bandwidth (Gaber & Yu, 2006). The main requirement for ubiquitous processing is to minimize power consumption on a general basis, balancing local computation with data acquisition and transmission. mining on distributed scenarios (e.g.

Data stream

sensor networks) needs to operate under a limited

bandwidth, reducing the capability to represent and transmit the data mining models over the network (Kargupta et

al., 2002), which creates an even thicker barrier to an ecient

handling of the continuous ow of data (Rodrigues et al., 2008c).

6.4.1.3

Adaptivity to Changes

With respect to adaptivity of the system to changes in the data clustering structure denition, or structural drift detection (as dened in Section 6.3), the detection and reaction to changes must be adapted to the new distributed setting. However, while it may seem straightforward to adapt previously developed techniques, since changes can only be monitored if statistics are maintained to support that decision, there is another change that must be monitored, with even more control: that of the network topology. Keeping our paradigmatic example, sensor networks are often wireless and ubiquitous. Sensors are organized by wireless links, possibly without centralized control. This way, the network topology is highly volatile, evolving with time due to, for example, sensor movement, broken links or sensor failures.

Abrupt changes in the short-range links (e.

g.

a gateway is

permanently shutdown) can occur unexpectedly, forcing the global system do adapt the clustering structure. Smoother changes may also occur, for example, with the deployment of new sensors, or their deactivation, creating an expansion or contraction behavior. On top of all these issues, the deployment of moving sensors is an emergent technique, used in many applications.

Examples of dynamic systems creating transient settings for

sensor networks are the deployment of sensors for ocean current monitoring, river ooding alert, atmospheric phenomena sensing, etc. In these contexts, requirements for distributed clustering systems become extreme.

Given the emergence of these techniques, streaming

sensor clustering on these networks becomes even more relevant for research.

6.4.1.4

Mobile Human Interaction

Ubiquitous activities such as clustering of distributed streaming data sources usually imply mobile data access and management, in the sense that even sensor networks with static topology could be queried by transient devices, such as PDAs, laptops or other embedded

124

6.4. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES devices. Thus, the clustering structure denition should also be accessible from these mobile and embedded devices, so that information is more accurate on a subnetwork enclosing the querying device. Of course, this would give more relevance to the network topology while preserving the proximity basis between data sources all over the network. In this setup, mining data streams in a mobile environment raises an additional challenge to intelligent systems, as model analysis and corresponding results need to be visualized in a small screen, requiring alternate multimedia-based human-computer interaction. A previous work which took into account this issue was developed for stock market mining. MobiMine (Kargupta et

al., 2002) is a mobile data mining system that allows intelligent monitoring of

time-critical nancial data, enabling quick reactions to events on the market. Another system presented a light-weight visualization of electricity demand data streams, especially designed for expert usage with mobile devices for reactive or proactive actions (Rodrigues & Gama, 2010). Although these applications are still somehow based on the client-server model, or at least relie on the centralized processing of some data, the mobile restrictions to the interface apply in the same way to sensor network applications.

6.4.2

Desiderata for Distributed Clustering of Streaming Sources

The main goal of a clustering system should be to be able to answer queries for the global clustering denition of the entire network of sources.

If sources are distributed on a wide

area, with local sites being accessible from transient devices, queries could be issued at each local site, enabling fast answers to be sent to the querying device. However, current setups assume data is forwarded into a central server, where it is processed, being this server the main answering device. This setup forces not only the data but also the queries to be transmitted across the network into a sink. The main ideas serving as desiderata for this task are:



each node needs to keep a sketch of the data stream it is producing;



communication is only considered on local neighborhoods, hopefully using only direct links with no data forwarding;



need to include mechanisms of convergence and concept change detection to manage unnecessary communication;



each node should achieve an approximation of the global clustering structure of the entire network.

Even though processing may be concentrated on local computations and short-range communication, the nal goal is to infer a global clustering structure of all relevant sensors. Hence,

125

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES approximate algorithms should be considered to prevent global data transmission. Given this, when querying a given sensor for the global clustering, we allow (and known beforehand that we will have) an approximate result within a maximum possible error with a certain probability. Each approximation step (local sketch, local clustering update, merging dierent cluster denitions, etc.) should be restricted by some stability bound on the error (Cherno, 1952). These bounds should serve as balancing deciders in the trade-o between transmission management and resulting errors.

6.5

Memoryless Fading Window for Stream Processing

One of the main requirements discussed in previous sections is that each node should be able to keep a sketch of its own data. In data streams scenarios, recent data is usually more important than old data (Gama & Rodrigues, 2007), so research usually focus on slidingwindow models of the most recent data.

In

Chapter 2

we have already presented some

of those window models for data stream processing. In this work we defend the usage of a weighted window model to keep the most recent data. However, given the memory constraints of distributed data sources (e.g. sensor networks), we propose a memoryless approximation to the weighted window model. This section presents the theoretical support for the use of fading windows as an approximation of a weighted sliding window, which is later used as sketching mechanism for the proposed clustering algorithm.

6.5.1

Weighted Sliding Windows

Even within a sliding window, the most recent data point is usually more important than the last one which is about to be discarded. This way, a simple approach could consider giving weights to data points depending on their age within the sliding window. Several weighting models could apply: linear, loglinear, etc. Given its particular characteristics, we will present an exponential approach, where the weight of a data point decreases exponentially with time.

Denition

Let

i

be the number of observations of a given variable

monitoring a sliding window of size

w.

The

α-weighted window

X

from which we are

is the set of points

 X˙ α,w (i) = αi−j xj | j ∈]i − w, i], 0 < α < 1 .

(6.1)

The main advantages of this window model are two-fold. First, compared to traditional sliding windows, more importance is given to recent data points, as the weight of each observation

126

6.5. MEMORYLESS FADING WINDOW FOR STREAM PROCESSING decreases exponentially with time. Second, compared to other weighting approaches, it can be maintained on the y.

At rst, we are tempted to say that the set

computed at every new observation

i.

X˙ α,w (i)

must be

However, given its exponential denition, we can

compute it using the recursive form

X˙ α,w (i) = {xi } ∪ α × X˙ α,w (i − 1) \ {x˙ i−w } .

6.5.1.1

(6.2)

Weighted Statistics

As previously stated, almost every systems have the need to compute simple statistics over recent data. Examples of those statistics include averages and correlation, which are based on sums and counts. In the following we will focus on averages. Using the

α-weighted

window

model, these statistics need to be dened slightly dierently.

Denition

Let

i

X from α-weighted increment Nα,w (i) is

be the number of observations of a given variable

monitoring a sliding window of size

w.

A

i X

Nα,w (i) =

αi−j =

αj ,

(6.3)

j=0

j=1+i−w with

w−1 X

which we are

0 < α < 1.

Denition

X from which we are monitoring a sliding window of size w . The α-weighted moving sum Sx,α,w (i) is a weighted sum of the observations of X within a recent window of size w , where Let

i

be the number of observations of a given variable

i X

Sx,α,w (i) =

αi−j xj ,

(6.4)

j=1+i−w with

0 < α < 1.

Denition Let Sx,α,w (i) be the α-weighted moving sum of variable X after i observations. ¯ α,w (i) is a weighted moving average within the window The α-weighted moving average X of the

w

most recent observations of

x,

so that

Pi ¯ α,w (i) = X with

j=1+i−w

αi−j xj

Nα,w (i)

,

(6.5)

0 < α < 1. 127

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES 6.5.1.2

Comparison with the Sliding Window Model

Figure 6.2 (top) presents the comparison among this window model and a traditional window model in the computation of averages.

The reader can note that, in the example, while

having a catastrophic forgetting of data points older than 1000, the

α-weighted

moving

average keeps a smooth forgetting of values newer than that. This way, it can approximate better the eect of a small sliding window (better adaptation) while keeping information from older data (smooth behavior).

This eect is even more profound when dealing with

data with concept drift. Figure 6.2 (bottom) also presents the same evaluation but for the harsh scenario where the data drifts in terms of the mean of values. The main feature of the weighted sliding window model is the use of smooth forgetting proposed recently as a fading factor (Gama et al., 2009b). In fact, this approach was rstly proposed to deal with evaluation of learning algorithms, and the usage of all previous data. Its application to window models presents some advantages to the processing of data streams, as its recursive form enables the computation of weighted data points on the y.

6.5.2

Memoryless Fading Windows

To avoid keeping all data in the window when computing statistics which are based on sums of the data points, and in order to include a smooth forgetting of information, the previous approach can be applied to achieve an approximated value for the elementary statistics on a data stream. This is strongly related to a weighted sum of the points in the sliding window, with more weight being given to most recent data points. The fading factors are memoryless, an important property in streaming scenarios (Gama et al., 2009b). In this section we will dene fading quantities, which do not require to keep the data points within the sliding window, but showing at the same time that they are, in fact, approximations of the same statistics computed with data points from the sliding window. The corresponding

α-fading window

can be better interpreted in Figure 6.3, where an illustrative example of

dierent types of windows is presented.

6.5.2.1

Fading Statistics

Similar to the

α-weighted

window denitions, dierent statistics can be dened, based on

counts and sums, which make use of all the previous data points. When possible, we will use the recursive forms to illustrate its applicability to data streams.

128

6.5. MEMORYLESS FADING WINDOW FOR STREAM PROCESSING

Weighted Moving Average on Sliding Windows

5000 3000

4000

Value

6000

7000

MA (w=1000) MA (w=335) WMA (w=1000,alpha=0.997)

0

500

1000

1500

2000

2500

4000

5000

Observation

12000

Weighted Moving Average on Sliding Windows

8000 4000

6000

Value

10000

MA (w=1000) MA (w=335) WMA (w=1000,alpha=0.997)

0

1000

2000

3000 Observation

Figure 6.2: Comparison between traditional and weighted moving averages. The data stream

w = 1000 (solid thick line), as a moving w = 335 (long dashed thick line) and a weighted moving average with w = 1000 α = .997 (dashed thick line).

(light thin line) is sketched as a moving average with average with and

129

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES

Time (days)

23

24

25

26

27

28

29

25

26

27

28

29

31

22

24

30

21

23

31

20

22

30

19

21

18

17

16

20

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

31

30

29

28

27

26

25

24

23

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

15

Time (days)

Fading Window

Alpha−Weighted Moving Window

Figure 6.3:

19

Time (days)

14

13

12

9

8

11

10

7

6

5

4

3

2

1

0

31

30

29

28

27

26

25

24

23

22

21

20

Linearly−Weighted Moving Window

19

18

17

16

15

14

13

12

9

11

8

10

7

6

5

4

3

2

1

0

Moving Window

Time (days)

Illustrative example of dierent window models:

sliding (top left), linearly

weighted (top right), exponentially weighted (bottom left), and fading (bottom right). Grey boxes represent the windows at two dierent time steps, while black line inside the box represents the weight given to the examples at that time step.

Denition Nα (i)

Let

i be the number of observations of a given variable X .  1,

i=1

(6.6)

1 + α × Nα (i − 1), i > 1

0 < α < 1.

A rst look at the

α-fading

Theorem 6.5.1 Nα (i) Proof

increment claries its dierences to usual increments.

converges to

value

1 1−α when

i → +∞.

Using simple arithmethics one can see that

a = 1

converge to

Pn

Nα (i) =

Pi−1

j=0

αj

which is a special form

k

k=0 ar , built from the geometric progression with initial r = α. For values of |r| < 1, these series are known to 1 lim Nα (i) = 1−α .

of the innite geometric series

and common ratio

a 1−r , hence i→∞

Given its convergence property,

α-fading increments will have a massive eect in the properties

of the remaining count-based statistics.

130

α-fading increment

is

Nα (i) = with

A

6.5. MEMORYLESS FADING WINDOW FOR STREAM PROCESSING Denition Sx,α (i)

of

Let

X

i

be the number of observations of a given variable

is a weighted sum of the observations of

X,

X.

The

α-fading sum

where

 x , i=1 1 Sx,α (i) = xi + α × Sx,α (i − 1), i > 1 with

(6.7)

0 < α < 1.

As for the weighted window model, the

α-fading increment is clearly related with the α-fading

sum.

Lemma 6.5.2 Nα (i) α-fading

Proof

sum

equals the total amount of weight given to observations used in the

Sx,α (i).

α-fading sum Sx,α (i) can be rewritten Pi Pi−1 i−j weights is = j=0 αj = Nα (i). j=1 α

The

sum of

as

Sx,α (i) =

Pi

j=1

αi−j xj .

Hence, the

Given the previous denitions, we can approximate the weighted window model when dealing with sum-based statistics.

6.5.2.2

Approximating the Weighted Window Model

Since we are using all previous data in the computation of sums, to approximate a window model of size of the data that is older than the objective

w

w,

α-fading

increments and

α-fading

an important aspect to check is how much

is being in fact used in this computation.

Denition Let w be the size of a window with the most recent examples xj , ∀j ∈ ]i − w, i]. The ballast weight Bα,w (i) is the proportion of weight given to old observations (with respect to

w)

in the computation of the

α-fading

sum

Pi−1 Bα,w (i) =

j=w

Sx,α (i),

αj

Nα (i)

.

i.e.

(6.8)

Apparently, as the window slides on, the amount of old data points being used in the computation increases. However, the special form of the fading factors yields an important property.

131

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES Theorem 6.5.3 Bα,w (i) Proof

Given that

sion, getting

converges to

αw

lim Bα,w (i) = lim

i→∞

j=w

αj

Pi−1

αj

lim Bα,w (i) = 1 − (1 −

j=0

αj

1 1−α −

lim Bα,w (i) =

i→∞

Pi−1 i→∞ Pw−1

i→∞

w α) 1−α 1−α

when

j=0 1 1−α

i → +∞.

, we can apply Theorem 6.5.1 to the expres-

. Then, for

w

α < 1,

with simple arithmetics we get

w

=1−1+α =α .

α and an objective window size w, we can know exactly how much data points older than w . This way, depending on the application

Now, given a fading factor of the sum is based on

domain, we can dene an admissable value for the ballast weight, so that we are approximating a sliding window with minimum error: the (α,ε)-fading window for

Sx,α (i).

Theorem 6.5.4 Sx,α (i).

Proof then

If

Let ε be an admissable value for the ballast weight of a given α-fading w > logα ε the ballast weight converges to Bα,w (i) < ε when i → +∞.

From Theorem 6.5.3 comes that

w

lim Bα,w (i) = αw .

i→∞

Hence, for

sum

lim Bα,w (i) < ε,

i→∞

α < ε ⇔ w > logα ε.

Denition

(α,ε)-fading window

The

most recent observations of

x,

with

α-fading sum log ε d log α e.

of a

w=

Sx,α (i)

is the window of the

w

This window has the nice property of being the smallest one for which the ballast bounds apply. Figure 6.4 presents an illustrative example of the convergence of ballast weight according to a (α,ε)-fading window.

Corollary 6.5.5 of the

w

Let

0 Bα,w (i)

be the proportion of weight given to examples in the window

most recent examples, i.e.

smallest recent window with total

0 Bα,w (i) = 1 − Bα,w (i). The (α,ε)-fading window is the 0 weight proportion Bα,w (i) ≥ 1 − ε in the α-fading sum

Sx,α (i).

Proof

Given the denition of

comes that for which

lim Bα,w (i) < ε

i→∞

Bα,w (i) < ε ⇔

0 0 (i), lim Bα,w (i) < 1− lim Bα,w (i). Bα,w if

w>

0 Bα,w (i)

132

α-weighted

moving average.

w=

From Theorem 6.5.4 i→∞ log ε d log α e is the smallest integer value of w

≥ 1 − ε.

After proving some properties of the the

i→∞ log ε log α , so

α-fading

sum, we can take the step of approximating

6.5. MEMORYLESS FADING WINDOW FOR STREAM PROCESSING

0.6 0.4

w :: objective window size (w = 1000) ε :: admissible ballast weight (ε = 0.05) α :: weight factor (α = ε1 w ~ 0.997)

0.2

Weight :: αi−t

0.8

1.0

Ballast Weight Convergence :: Bαw(i)

lim Bαw(i) = ε

lim Bαw' (i) = 1 − ε

0.0

i→∞

i − 2w

...

i→∞

i − 3w 2

i−w

i−w 2

i

Time :: t

Figure 6.4: Illustrative example of the convergence of the ballast weight, for given

ε.

α, w

and

The curve represents the weight given to pasta examples. The density area is the sum of

weights in the interval.

Denition

Let

Sx,α (i) be the α-fading sum of variable X after i observations, and Nα (i) α-fading increment. The α-fading average Mx,α (i) is a weighted average of x, where Sx,α (i) Mx,α (i) = , (6.9) Nα (i)

the corresponding of observations

with

0 < α < 1.

Similar approaches can be made for

α-fading correlation

the sucient statistics needed to compute the nal Nonetheless, the denition of the

α-fading average

where sums.

is in itself a milestone in the processing

of streaming data. However, the approximation to the

6.5.2.3

α-fading histograms, measure are kept as α-fading and

α-weighted

average is not error free.

Approximation Error Bounds

When dealing with approximated results, a level of error should be associated with it. Hence, research has evolved into the denition of bounds for those errors (Domingos & Hulten, 2001). Considering

Mx,α (i)

an approximation of

¯ α,w (i), X

the error can be clearly bounded.

133

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES Denition ∆x,α,w (i) with

Mx,α (i),

is the error of approximating the

α-weighted

moving average

¯ α,w (i) X

i.e.

¯ α,w (i) − Mx,α (i) . ∆x,α,w (i) = X

Theorem 6.5.6

Let

ε < 1 be an admissable ballast weight X . Then, ∆x,α,w (i) ≤ 2εR.

for the

(6.10)

α-fading

sum, and

R

the

known range of variable

Proof

From previous denitions comes that

Pi

Pi i−j i−j

xj

j=1+i−w α xj

j=1 α ∆x,α,w (i) = Pi − Pi

αi−j αi−j j=1+i−w

j=1

There are two sources of error when approximating if we separate the computation of the

ε-fading

Mx,α (i)

¯ α,w (i) with Mx,α (i), that become visible X

in terms of the sums of points inside and outside

window.

out

∆x,α,w (i) = ∆in x,α,w (i) − ∆x,α,w (i) , where

Pi

Pi ∆in x,α,w (i)

=

i−j xj j=1+i−w α Pi i−j j=1+i−w α

and



Pi−w ∆out x,α,w (i)

j=1

= Pi

i−j xj j=1+i−w α Pi i−j j=1 α

αi−j xj

j=1

αi−j

,

.

Mx,α (i), we are giving less weight to the good data points i−j ¯ α,w (i) divides by Pi window w ): while X , Mx,α (i) j=1+i−w α

The rst one occurs because, in (the ones within the recent divides by

Pi

j=1

αi−j .

The second source of error comes from including bad data points

(the ones outside the recent window

w).

Of course the denition of good and bad is just

illustrative; they might actually have the paradoxal eect in the average. Since we are looking for an upper bound on the error, the worst case scenario is that these two sources of error do not cancel out, rather adding up their eect:

out



∆x,α,w (i) ≤ ∆in x,α,w (i) + ∆x,α,w (i) . Hence

Pi−w

Pi i−j

(

)( j=1+i−w αi−j xj )

j=1 α

∆x,α,w (i) ≤ Pi

+ ∆out Pi x,α,w (i) . i−j i−j

( α )( α ) j=1+i−w

134

j=1

6.5. MEMORYLESS FADING WINDOW FOR STREAM PROCESSING Then, from previous denitions comes that

i−j

Bα,w (i)Nα (i) Pi

xj

j=1+i−w α

∆x,α,w (i) ≤

+ ∆out x,α,w (i) 0

Bα,w (i)Nα (i)Nα (i)

Pi−w

i−j i−j

Bα,w (i) Pi

xj

j=1 α xj j=1+i−w α = +

. 0



Bα,w (i)Nα (i) Nα (i) Since we are looking for an upper bound on the error, let us analyse the worst case scenario. For the rst term of the error, the worst case scenario is when we are decreasing weight to data points whose impact in the moving average is higher, those that are farther apart from the average (max δj , with

¯ α,w (i) ). δ j = xj − X

For the second term of the error, the worst

case scenario is when we are giving weight to data points that will have a higher impact in the average, hence, also the farthest apart from it. If the variable being measured is bounded by

R (e.g. each observation is a probability) then we can dene an upper bound ˆ = max X1..i −min X1..i , δj , i.e. δj ≤ R, ∀j . If R is not known, we can use R ∼ R(i) the observed range of previous observations of X . Since keeping the range of the variable in ˆ of a slidding window is a blocking operator, we can only keep track of the global range R(i)

a certain range to each

the variable. Nonetheless, this range is always at least as large as the range of the values in the recent window, so the bound stays robust. Hence, the upper bound on the error is given by considering all

xj = R :

Pi−w

i−j i−j

Bα,w (i) Pi

xj

j=1 α xj j=1+i−w α ∆x,α,w (i) ≤ +

0



Bα,w (i)Nα (i) Nα (i)



0

Bα,w (i)Bα,w (i)Nα (i)R Bα,w (i)Nα (i)R

+ =

0 Bα,w (i)Nα (i) Nα (i) = kBα,w (i)Rk + kBα,w (i)Rk = 2 kBα,w (i)Rk . Fixating

ε

for an admissable ballast weight,

Mx,α (i) of a variable X with range R, we can state that this ¯ α,w (i), with w = d log ε e, average is an approximation of the α-weighted moving average X log α within an error interval of Mx,α (i) ± 2εR, where ε is the admissable ballast weight. Given Now, given a

α-fading

∆x,α,w (i) ≤ 2εR.

average

this, we can clearly dene the parameters needed to achieve the objective approximation.

Corollary 6.5.7

R. In order to compute an estimate Mx,α (i) ¯ of the α-weighted moving average Xα,w (i) on a sliding window with size w , so that Mx,α (i) ∈ ¯ α,w (i) ± 2εR, α must be set to α = ε w1 . X Let

X

be a variable with range

135

6. DISTRIBUTED CLUSTERING OF STREAMING DATA SOURCES Proof

From Theorem 6.5.4 and Theorem 6.5.6 comes that, since for having

ε, w > logα ε,

then to achieve

¯ α,w (i) ± 2εR, α = ε Mx,α (i) ∈ X

1 w

lim Bα,w (i)