Document not found! Please try again

Data Streams

6 downloads 0 Views 1MB Size Report
Cham C. Aggarwal, Jiawei Hon, Jianyong Wang and Philip S. Yu. 1. ... Jiawei Han, Y. Dora Cai, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W. Wah, and.
Data Streams Models and Algorithms

edited by

Charu C. Aggarwal IBM, T.J. Watson Research Center Yorktown Heights, NY, USA

4y Springer

Contents

ListofFigures ListofTables Preface

xi xv xvii

1

An Introduction to Data Streams Cham C. Aggarwal 1. Introduction 2. Stream Mining Algorithms 3. Conclusions and Summary References 2 On Clustering Massive Data Streams: A Summahzation Paradigm Cham C. Aggarwal, Jiawei Hon, Jianyong Wang and Philip S. Yu 1. Introduction 2. The Micro-clustering Based Stream Mining Framework 3. Clustering Evolving Data Streams: A Micro-clustering Approach 3.1 Micro-clustering Challenges 3.2 Online Micro-cluster Maintenance: The CluStream Algorithm 3.3 High Dimensional Projected Stream Clustering 4. Classification of Data Streams: A Micro-clustering Approach 4.1 On-Demand Stream Classification 5. Other Applications of Micro-clustering and Research Directions 6. Performance Study and Experimental Results 7. Discussion References 3 A Survey of Classification Methods in Data Streams Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy 1. Introduction 2. Research Issues 3. Solution Approaches 4. Classification Techniques 4.1 Ensemble Based Classification 4.2 Very Fast Decision Trees (VFDT)

1 1 2 6 7 9 10 12 17 18 19 22 23 24 26 27 36 36 39 39 41 43 44 45 46

vi

DATA STREAMS: MODELS AND ALGORITHMS 4.3 On Demand Classification 4.4 Online Information Network (OLIN) 4.5 LWClass Algorithm 4.6 ANNCAD Algorithm 4.7 SCALLOP Algorithm 5. Summary References

4 Frequent Pattern Mining in Data Streams RuomingJin and Gagan Agrawal 1. Introduction 2. Overview 3. New Algorithm 4. Work onOther Related Problems 5. Conclusions and Future Directions References 5 A Survey of Change Diagnosis Algorithms in Evolving Data Streams Cham C. Aggarwal 1. Introduction 2. The Velocity Density Method 2.1 Spatial Velocity Profiles 2.2 Evolution Computations in High Dimensional Case 2.3 On the use of clustering for characterizing stream evolution 3. On the Effect of Evolution in Data Mining Algorithms 4. Conclusions References 6 Multi-Dimensional Analysis of Data Streams Using Stream Cubes Jiawei Han, Y. Dora Cai, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W. Jianyong Wang 1. Introduction 2. Problem Definition 3. Architecture for On-line Analysis of Data Streams 3.1 Tilted time frame 3.2 Critical layers 3.3 Partial materialization of stream cube 4. Stream Data Cube Computation 4.1 Algorithms for cube computation 5. Performance Study 6. Related Work 7. Possible Extensions 8. Conclusions References

48 48 49 51 51 52 53 61 61 62 67 79 80 81 85

86 88 93 95 96 97 100 101 103 Wah, and 104 106 108 108 110 111 112 115 117 120 121 122 123

Contents 7 Load Shedding in Data Stream Systems Brian Babcock, Mayur Datar and Rajeev Motwani 1. Load Shedding for Aggregation Queries 1.1 Problem Formulation 1.2 Load Shedding Algorithm 1.3 Extensions 2. Load Shedding in Aurora 3. Load Shedding for Sliding Window Joins 4. Load Shedding for Classification Queries 5. Summary References 8 The Sliding-Window Computation Model and Results Mayur Datar and Rajeev Motwani 0.1 Motivation and Road Map 1. A Solution to the BASICCOUNTING Problem 1.1 The Approximation Scheme 2. Space Lower Bound for BASICCOUNTING Problem 3. BeyondO'sandl's 4. References and Related Work 5. Conclusion References 9 A Survey of Synopsis Construction in Data Streams Cham C. Aggarwal, Philip S. Yu 1. Introduction 2. Sampling Methods 2.1 Random Sampling with a Reservoir 2.2 Concise Sampling 3. Wavelets 3.1 Recent Research on Wavelet Decomposition in Data Streams 4. Sketches 4.1 Fixed Window Sketches for Massive Time Series 4.2 Variable Window Sketches of Massive Time Series 4.3 Sketches and their applications in Data Streams 4.4 Sketches with p-stable distributions 4.5 The Count-Min Sketch 4.6 Related Counting Methods: Hash Functions for Determining Distinct Elements 4.7 Advantages and Limitations of Sketch Based Methods 5. Histograms 5.1 One Pass Construction of Equi-depth Histograms 5.2 Constructing V-Optimal Histograms 5.3 Wavelet Based Histograms for Query Answering 5.4 Sketch Based Methods for Multi-dimensional Histograms 6. Discussion and Challenges

vii 127 128 129 133 141 142 144 145 146 146 149 150 152 154 157 158 163 164 166 169

169 172 174 176 177 182 184 185 185 186 190 191 193 194 196 198 198 199 200 200

viii

DATA STREAMS: MODELSAND ALGORITHMS References

10 A Survey of Join Processing in Data Streams Junyi Xie and Jun Yang 1. Introduction 2. Model and Semantics 3. State Management for Stream Joins 3.1 Exploiting Constraints 3.2 Exploiting Statistical Properties 4. Fundamental Algorithms for Stream Join Processing 5. Optimizing Stream Joins 6. Conclusion Acknowledgments References 11 Indexing and Querying Data Streams Ahmet Bulut, Ambuj K. Singh 1. Introduction 2. Indexing Streams 2.1 Preliminaries and definitions 2.2 Feature extraction 2.3 Index maintenance 2.4 Discrete Wavelet Transform 3. Querying Streams 3.1 Monitoring an aggregate query 3.2 Monitoring a pattem query 3.3 Monitoring a correlation query 4. Related Work 5. Future Directions 5.1 Distributed monitoring Systems 5.2 Probabilistic modeling of sensor networks 5.3 Content distribution networks 6. Chapter Summary References 12 Dimensionality Reduction and Forecasting on Streams Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos 1. Related work 2. Principal component analysis (PCA) 3. Auto-regressive modeis and recursive least Squares 4. MUSCLES 5. Tracking correlations and hidden variables: SPIRIT 6. Putting SPIRIT to work 7. Experimental case studies

202 209 209 210 213 214 216 225 227 230 232 232 237 238 239 239 240 244 246 248 248 251 252 254 255 255 256 256 257 257 261 264 265 267 269 271 276 278

Contents 8. 9.

ix

Performance and accuracy Conclusion

283 286

Acknowledgments

286

References

287

13

A Survey of Distributed Mining of Data Streams Srinivasan Parthasarathy, Amol Ghoting and Matthew Eric Otey 1. Introduction 2. Outlier and Anomaly Detection 3. Clustering 4. Frequent itemset mining 5. Classification 6. Summarization 7. Mining Distributed Data Streams in Resource Constrained Environments 8. Systems Support References 14 Algorithms for Distributed Data Stream Mining Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar, Hillol Kargupta, Wolffand Rong Chen 1. Introduction 2. Motivation: Why Distributed Data Stream Mining? 3. Existing Distributed Data Stream Mining Algorithms 4. A local algorithm for distributed data stream mining 4.1 Local Algorithms: definition 4.2 Algorithm details 4.3 Experimental results 4.4 Modifications and extensions 5. Bayesian Network Leaming from Distributed Data Streams 5.1 Distributed Bayesian Network Leaming Algorithm 5.2 Selection of samples for transmission to global site 5.3 Online Distributed Bayesian Network Leaming 5.4 Experimental Results 6. Conclusion References 15 A Survey of Stream Processing Problems and Techniques in Sensor Networks Sharmila Subramaniam, Dimitrios Gunopulos 1. Challenges

289 289 291 295 296 297 298 299 300 304 309 Ran 310 311 312 315 315 316 318 320 321 322 323 324 326 326 329 333

334

DATA STREAMS: MODELS AND ALGORITHMS

X

2. 3. 4.

The Data Collection Model Data Communication Query Processing 4.1 Aggregate Queries 4.2 Join Queries 4.3 Top-fc Monitoring 4.4 Continuous Queries 5. Compression and Modeling 5.1 Data Distribution Modeling 5.2 OutlierDetection 6. Application: Tracking of Objects using Sensor Networks 7. Summary References

335 335 337 338 340 341 341 342 343 344 345 347 348

Index

353