call for papers
IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS
18 - 20 June Algarve, Portugal
Proceedings of DATA MINING 2009
Edited by: Ajith P. Abraham
international association for development of the information society
IADIS EUROPEAN CONFERENCE ON
DATA MINING 2009
part of the IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS 2009
ii
PROCEEDINGS OF THE IADIS EUROPEAN CONFERENCE ON
DATA MINIG 2009
part of the IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS 2009
Algarve, Portugal JUNE 18 - 20, 2009
Organised by IADIS International Association for Development of the Information Society iii
Copyright 2009 IADIS Press All rights reserved This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Permission for use must always be obtained from IADIS Press. Please contact
[email protected]
Data Mining Volume Editor: Ajith P. Abraham Computer Science and Information Systems Series Editors: Piet Kommers, Pedro Isaías and Nian-Shing Chen Associate Editors: Luís Rodrigues and Patrícia Barbosa
ISBN: 978-972-8924-88-1
SUPPORTED BY
iv
TABLE OF CONTENTS FOREWORD
ix
PROGRAM COMMITTEE
xi
KEYNOTE LECTURES
xv xviii
CONFERENCE TUTORIAL KEYNOTE PAPER
xix
FULL PAPERS AN EXPERIMENTAL STUDY OF THE DISTRIBUTED CLUSTERING FOR AIR POLLUTION PATTERN RECOGNITION IN SENSOR NETWORKS
3
Yajie Ma, Yike Guo and Moustafa Ghanem
A NEW FEATURE WEIGHTED FUZZY C-MEANS CLUSTERING ALGORITHM
11
Huaiguo Fu and Ahmed M. Elmisery
A NOVEL THREE STAGED CLUSTERING ALGORITHM
19
Jamil Al-Shaqsi and Wenjia Wang
BEHAVIOURAL FINANCE AS A MULTI-INSTANCE LEARNING PROBLEM
27
Piotr Juszczak
BATCH QUERY SELECTION IN ACTIVE LEARNING
35
Piotr Juszczak
CONTINUOUS-TIME HIDDEN MARKOV MODELS FOR THE COPY NUMBER ANALYSIS OF GENOTYPING ARRAYS
43
Matthew Kowgier and Rafal Kustra
OUT-OF-CORE DATA HANDLING WITH PERIODIC PARTIAL RESULT MERGING
50
Sándor Juhász and Renáta Iváncsy
A FUZZY WEB ANALYTICS MODEL FOR WEB MINING
59
Darius Zumstein and Michael Kaufmann
DATE-BASED DYNAMIC CACHING MECHANISM Christos Bouras, Vassilis Poulopoulos and Panagiotis Silintziris
v
67
GENETIC ALGORITHM TO DETERMINE RELEVANT FEATURES FOR INTRUSION DETECTION
75
Namita Aggarwal, R K Agrawal and H M Jain
ACCURATELY RANKING OUTLIERS IN DATA WITH MIXTURE OF VARIANCES AND NOISE
83
Minh Quoc Nguyen, Edward Omiecinski and Leo Mark
TIME SERIES DATA PUBLISHING AND MINING SYSTEM
95
Ye Zhu, Yongjian Fu and Huirong Fu
UNIFYING THE SYNTAX OF ASSOCIATION RULES
103
Michal Burda
AN APPROACH TO VARIABLE SELECTION IN EFFICIENCY ANALYSIS
111
Veska Noncheva, Armando Mendes and Emiliana Silva
SHORT PAPERS MIDPDC: A NEW FRAMEWORK TO SUPPORT DIGITAL MAMMOGRAM DIAGNOSIS
121
Jagatheesan Senthilkumar, A. Ezhilarasi and D. Manjula
A TWO-STAGE APPROACH FOR RELEVANT GENE SELECTION FOR CANCER CLASSIFICATION
127
Rajni Bala and R. K. Agrawal
TARGEN: A MARKET BASKET DATASET GENERATOR FOR TEMPORAL ASSOCIATION RULE MINING
133
Tim Schlüter and Stefan Conrad
USING TEXT CATEGORISATION FOR DETECTING USER ACTIVITY
139
Marko Kääramees and Raido Paaslepp
APPROACHES FOR EFFICIENT HANDLING OF LARGE DATASETS
143
Renáta Iváncsy and Sándor Juhász
GROUPING OF ACTORS ON AN ENTERPRISE SOCIAL NETWORK USING OPTIMIZED UNION-FIND ALGORITHM
148
Aasma Zahid, Umar Muneer, Shoab A. Khan
APPLYING ASD-DM METHODOLOGY ON BUSINESS INTELLIGENCE SOLUTIONS: A CASE STUDY ON BUILDING CUSTOMER CARE DATA MART
153
Mouhib Alnoukari and Zaidoun Alzoabi and Asim El Sheikh
COMPARING PREDICTIONS OF MACHINE SPEEDUPS USING MICROARCHITECTURE INDEPENDENT CHARACTERISTICS Claudio Luiz Curotto
vi
158
DDG-CLUSTERING: A NOVEL TECHNIQUE FOR HIGHLY ACCURATE RESULTS
163
Zahraa Said Ammar and Mohamed Medhat Gaber
POSTERS WIEBMAT, A NEW INFORMATION EXTRACTION SYSTEMEN
171
El ouerkhaoui Asmaa, Driss Aboutajdine and Doukkali Aziz
CLUSTER OF REUTERS 21578 COLLECTIONS USING GENETIC ALGORITHMS AND NZIPF METHOD
174
José Luis Castillo Sequera, José R. Fernández del Castillo and León González Sotos
I-SOAS DATA REPOSITORY FOR ADVANCED PRODUCT DATA MANAGEMENT
177
Zeeshan Ahmed
DATA PREPROCESSING DEPENDENCY FOR WEB USAGE MINING BASED ON SEQUENCE RULE ANALYSIS
179
Michal Munk, Jozef Kapusta and Peter Švec
GEOGRAPHIC DATA MINING WITH GRR Lubomír Popelínský
AUTHOR INDEX
vii
182
viii
FOREWORD These proceedings contain the papers of the IADIS European Conference on Data Mining 2009, which was organised by the International Association for Development of the Information Society in Algarve, Portugal, 18 – 20 June, 2009. This conference is part of the Multi Conference on Computer Science and Information Systems 2009, 17 - 23 June 2009, which had a total of 1131 submissions. The IADIS European Conference on Data Mining (ECDM’09) is aimed to gather researchers and application developers from a wide range of data mining related areas such as statistics, computational intelligence, pattern recognition, databases and visualization. ECDM’09 is aimed to advance the state of the art in data mining field and its various real world applications. ECDM’09 will provide opportunities for technical collaboration among data mining and machine learning researchers around the globe The conference accepts submissions in the following areas: Core Data Mining Topics - Parallel and distributed data mining algorithms - Data streams mining - Graph mining - Spatial data mining - Text video, multimedia data mining - Web mining - Pre-processing techniques - Visualization - Security and information hiding in data mining Data Mining Applications - Databases, - Bioinformatics, - Biometrics - Image analysis - Financial modeling - Forecasting - Classification - Clustering
ix
The IADIS European Conference on Data Mining 2009 received 63 submissions from more than 19 countries. Each submission has been anonymously reviewed by an average of five independent reviewers, to ensure that accepted submissions were of a high standard. Consequently only 14 full papers were published which means an acceptance rate of about 22 %. A few more papers were accepted as short papers, reflection papers and posters. An extended version of the best papers will be published in the IADIS International Journal on Computer Science and Information Systems (ISSN: 1646-3692) and also in other selected journals, including journals from Inderscience. Besides the presentation of full papers, short papers, reflection papers and posters, the conference also included two keynote presentations from internationally distinguished researchers. We would therefore like to express our gratitude to Professor Kurosh Madani, Images, Signals and Intelligence Systems Laboratory (LISSI / EA 3956) PARIS XII University, Senart-Fontainebleau Institute of Technology, France and Dr. Claude C. Chibelushi, Faculty of Computing, Engineering & Technology, Staffordshire University, UK for accepting our invitation as keynote speakers. Also thanks to the tutorial presenter, Professor Kurosh Madani, Images, Signals and Intelligence Systems Laboratory (LISSI / EA 3956) PARIS XII University, Senart-Fontainebleau Institute of Technology, France. As we all know, organising a conference requires the effort of many individuals. We would like to thank all members of the Program Committee, for their hard work in reviewing and selecting the papers that appear in the proceedings. This volume has taken shape as a result of the contributions from a number of individuals. We are grateful to all authors who have submitted their papers to enrich the conference proceedings. We wish to thank all members of the organizing committee, delegates, invitees and guests whose contribution and involvement are crucial for the success of the conference. Last but not the least, we hope that everybody will have a good time in Algarve, and we invite all participants for the next year edition of the IADIS European Conference on Data Mining 2010, that will be held in Freiburg, Germany. Ajith P. Abraham School of Computer Science, Chung-Ang University South Korea European Conference on Data Mining 2009 Program Chair Piet Kommers, University of Twente, The Netherlands Pedro Isaías, Universidade Aberta (Portuguese Open University), Portugal Nian-Shing Chen, National Sun Yat-sen University, Taiwan MCCSIS 2009 General Conference Co-Chairs Algarve, Portugal June 2009
x
PROGRAM COMMITTEE EUROPEAN CONFERENCE ON DATA MINING PROGRAM CHAIR Ajith P. Abraham, School of Computer Science, Chung-Ang University, South Korea
MCCSIS GENERAL CONFERENCE CO-CHAIRS Piet Kommers, University of Twente, The Netherlands Pedro Isaías, Universidade Aberta (Portuguese Open University), Portugal Nian-Shing Chen, National Sun Yat-sen University, Taiwan
EUROPEAN CONFERENCE ON DATA MINING COMMITTEE MEMBERS Abdel-Badeeh M. Salem, Ain Shams University, Egypt Akihiro Inokuchi, Osaka University, Japan Alessandra Raffaetà, Università Ca' Foscari di Venezia, Italy Alexandros Nanopoulos, University of Hildesheim, Germany Alfredo Cuzzocrea, University of Calabria, Italy Anastasios Dimou, Informatics and Telematics Institute, Greece Andreas König, TU Kaiserslautern, Germany Annalisa Appice, Università degli Studi di Bari, Italy Arnab Bhattacharya, I.I.T. Kanpur, India Artchil Maysuradze, Moscow University, Russia Ben Kao, The University of Hong Kong, Hong Kong Carson Leung, University of Manitoba, Canada Chao Luo, University of Technology, Sydney, Australia Christos Makris, University of Patras, Greece Claudio Lucchese, Università Ca' Foscari di Venezia, Italy Claudio Silvestri, Università di Ca' Foscari di Venezia, Italy Dan Wu, University of Windsor, Canada Daniel Kudenko, University of York, UK Daniel Pop, University of the West Timisoara, Romania Daniela Zaharie, West University of Timisoara, Romania Daoqiang Zhang, Nanjing University of Aeronautics and Astronautics, China David Cheung, University of Hong Kong, Hong Kong Dimitrios Katsaros, University of Thessaly, Greece Dino Ienco, Università di Torino, Italy Edward Hung, Hong Kong Polytechnic University, Hong Kong
xi
Eugenio Cesario, Università della Calabria, Italy Fotis Lazarinis, Technological Educational Institute, Greece Francesco Folino, University of Calabria, Italy Gabriela Kokai, Friedrich Alexander University, Germany George Pallis, University of Cyprus, Cyprus Georgios Yannakakis, IT-University of Copenhagen, Denmark Hamed Nassar, Suez Canal University, Egypt Harish Karnick, IIT Kanpur, India Hui Xiong, Rutgers University, USA Ingrid Fischer, University of Konstanz, Germany Ioannis Kopanakis, Technological Educational Institute of Crete, Greece Jason Wang, New Jersey Institute of Technology, USA Jia -Yu Pan, Google Inc., USA Jialie Shen, Singapore Management University, Singapore John Kouris, University of Patras, Greece José M. Peña, Technical University of Madrid, Spain Jun Huan, University of Kansas, USA Junjie Wu, Beijing University of Aeronautics and Astronautics, China Justin Dauwels, MIT, USA Katia Lida Kermanidis, Ionian University, Greece Keiichi Horio, Kyushu Institute of Technology, Japan Lefteris Angelis, Aristotle University of Thessaloniki, Greece Liang Chen, Amazon.com, USA Lyudmila Shulga, Moscow University, Russia Manolis Maragoudakis, University of Crete, Greece Mario Koeppen, KIT, Japan Maurizio Atzori, ISTI-CNR, Italy Minlie Huang, Tsinghua University, China Min-Ling Zhang, Hohai University, China Miriam Baglioni, University of Pisa, Italy Qi Li, Western Kentucky University, USA Raffaele Perego, ISTI-CNR, Italy Ranieri Baraglia, Italian National Research Council (CNR), Italy Reda Alhajj, University of Calgary, Canada Robert Hilderman, University of Regina, Canada Roberto Esposito, Università di Torino, Italy Sandeep Pandey, Yahoo! Research,USA Sherry Y. Chen, Brunel University, UK Stefanos Vrochidis, Informatics and Telematics Institute, Greece Tao Ban, National Institute of Information and Communications Technology, Japan Tao Li, Florida International University, USA Tao Xiong, eBay Inc., USA Tatiana Tambouratzis, University of Piraeus, Greece
xii
Themis Palpanas, University of Trento, Italy Thorsten Meinl, University of Konstanz, Germany Tianming Hu, Dongguan University of Technology, China Tomonobu Ozaki, Kobe University, Japan Trevor Dix, Monash University, Australia Tsuyoshi Ide, IBM Research, Japan Valerio Grossi, University of Pisa, Italy Vasile Rus, University of Memphis, USA Vassilios Verykios, University of Thessaly, Greece Wai-Keung Fung, University of Manitoba, Canada Wei Wang, Fudan University, China Xiaowei Xu, University of Arkansas at Little Rock, USA Xiaoyan Zhu, Tsinghua University, China Xingquan Zhu, Florida Atlantic University, USA Xintao Wu, University of North Carolina at Charlotte (UNCC), USA Yanchang Zhao, University of Technology, Sydney, Australia Ying Zhao, Tsinghua University, China Yixin Chen, University of Mississippi, USA
xiii
xiv
KEYNOTES LECTURES TOWARD HIGHER LEVEL OF INTELLIGENT SYSTEMS FOR COMPLEX DATA PROCESSING AND MINING Professor Kurosh Madani Images, Signals and Intelligence Systems Laboratory (LISSI / EA 3956) PARIS XII University Senart-Fontainebleau Institute of Technology France
ABSTRACT Real world applications and especially those dealing with complex data mining ones make quickly appear the insufficiency of academic (called also sometime theoretical) approach in solving such categories of problems. The difficulties appear since definition of the “problem’s solution” notion. In fact, academic approaches often begin by problem’s constraints simplification in order to obtain a “solvable” model (here, solvable model means a set of mathematically solvable relations or equations describing a processing flow, a behavior, a set of phenomena, etc…). If the theoretical consideration is a mandatory step to study a given problem’s solvability, for a very large number of real world dilemmas, it doesn’t lead to a solvable or realistic solution. Difficulty could be related to several issues among which: - large number of parameters to be taken into account making conventional mathematical tools inefficient, - strong nonlinearity of the data (describing a complex behavior or ruling relationship between involved data), leading to unsolvable equations, - partial or total inaccessibility to relevant features (relevant data), making the model insignificant, - subjective nature of relevant features, parameters or data, making the processing of such data or parameters difficult in the frame of conventional quantification, - necessity of expert’s knowledge, or heuristic information consideration, - imprecise information or data leakage. Examples illustrating the above-mentioned difficulties are numerous and may concern various areas of real world or industrial applications. As first example, one can emphasize difficulties related to economical and financial modeling (data mining, features’ extraction and prediction), where the large number of parameters, on the one hand, and human related factors, on the other hand, make related real world problems among the most difficult to solve. Another illustrative example concerns the delicate class of dilemmas dealing with complex data’s and multifaceted information’s processing, especially when processed information (representing patterns, signals, images, etc.) are strongly noisy or involve deficient data. In fact, real world and industrial applications, comprising image analysis, systems and plants safety, complex manufacturing and processes optimization, priority selection and decision,, classification and clustering are often those belonging to such class of dilemmas.
xv
If much is still to discover about how the animal’s brain trains and self-organizes itself in order to process and mining so various and so complex information, a number of recent advances in “neurobiology” allow already highlighting some of key mechanisms of this marvels machine. Among them one can emphasizes brain’s “modular” structure and its “self-organizing” capabilities. In fact, if our simple and inappropriate binary technology remains too primitive to achieve the processing ability of these marvels mechanisms, a number of those highlighted points could already be sources of inspiration for designing new machine learning approaches leading to higher levels of artificial systems’ intelligence. This plenary talk deals with machine learning based modular approaches which could offer powerful solutions to overcome processing difficulties in the aforementioned frame. It focuses machine learning based modular approaches which take advantage from self-organizing multimodeling ("divide and conquer" paradigm). If the machine learning capability provides processing system’s adaptability and offers an appealing alternative for fashioning the processing technique adequacy, the modularity may result on a substantial reduction of treatment’s complexity. In fact, the modularity issued complexity reduction may be obtained from several instances: it may result from distribution of computational effort on several modules (mluti-modeling and macro parallelism); it can emerge from cooperative or concurrent contribution of several processing modules in handling a same task (mixture of experts); it may drop from the modules’ complementary contribution (e.g. specialization of a module on treating a given task to be performed). One of the most challenging classes of data processing and mining dilemmas concerns the situation when no a priori information (or hypothesis) is available. Within this frame, a self-organizing modular machine learning approach, combining "divide and conquer" paradigm and “complexity estimation” techniques called self-organizing “Tree-like Divide To Simplify” (T-DTS) approach will be described and evaluated.
xvi
HCI THROUGH THE ‘HC EYE’ (HUMAN-CENTRED EYE): CAN COMPUTER VISION INTERFACES EXTRACT THE MEANING OF HUMAN INTERACTIVE BEHAVIOUR? Dr. Claude C. Chibelushi Faculty of Computing, Engineering & Technology, Staffordshire University, UK
ABSTRACT Some researchers advocating a human-centred computing perspective have been investigating new methods for interacting with computer systems. A goal of these methods is to achieve natural, intuitive and effortless interaction between humans and computers, by going beyond traditional interaction devices such as the keyboard and the mouse. In particular, significant technical advances have been made in the development of the next generation of human computer interfaces which are based on processing visual information captured by a computer. For example, existing image analysis techniques can detect, track and recognise humans or specific parts of their body such as faces and hands, and they can also recognise facial expressions and body gestures. This talk will explore technical developments and highlight directions for future research in digital image and video analysis which can enhance the intelligence of computers by giving them, for example, the ability to understand the meaning of communicative gestures made by humans and recognise context-relevant human emotion. The talk will review research efforts towards enabling a computer vision interface to answer the what, when, where, who, why, and how aspects of human interactive behaviour. The talk will also discuss the potential impacts and implications of technical solutions to problems arising in the context of human computer interaction. Moreover, it will suggest how the power of the tools built onto these solutions can be harnessed in many realms of human endeavour.
xvii
CONFERENCE TUTORIAL
BIO-INSPIRED ARTIFICIAL INTELLIGENCE AND ISSUED APPLICATIONS Professor Kurosh Madani Images, Signals and Intelligence Systems Laboratory (LISSI / EA 3956) PARIS XII University Senart-Fontainebleau Institute of Technology France
xviii
Keynote Paper
xix
xx
TOWARD HIGHER LEVEL OF INTELLIGENT SYSTEMS FOR COMPLEX DATA PROCESSING AND MINING Kurosh Madani Images, Signals and Intelligent Systems Laboratory (LISSI / EA 3956), PARIS-EST / PARIS 12 University Senart-FB Institute of Technology, Bat. A, Av. Pierre Point, F-77127 Lieusaint - France
ABSTRACT If the theoretical consideration is a mandatory step to study a given problem’s solvability, for a very large number of real world dilemmas, it doesn’t lead to a solvable or realistic solution. In fact, academic approaches often begin by problem’s constraints simplification in order to obtain “mathematically solvable” models. However, the animal’s brain overthrows real-world quandaries pondering their whole complexity. If much is still to discover about how the animal’s brain trains and self-organizes itself in order to process and mining so various and so complex information, a number of recent advances in “neurobiology” allow already highlighting some of key mechanisms of this marvels machine. Among them one can emphasizes brain’s “modular” structure and its “self-organizing” capability, which could already be sources of inspiration for designing new machine learning approaches leading to higher levels of artificial systems’ intelligence. One of the most challenging classes of data processing and mining problems concerns the situation when no a priori information (or hypothesis) is available. Within this frame, a self-organizing modular machine learning approach, combining "divide and conquer" paradigm and “complexity estimation” techniques called “Tree-like Divide To Simplify” (T-DTS) approach is described and evaluated. KEYWORDS Machine Learning, Self-organization, Complexity Estimation, Modular Structure, Divide and Conquer, Classification.
1. INTRODUCTION Real world applications and especially those dealing with complex data mining ones made quickly appear the deficiency of academic (called also sometime theoretical) approaches in solving such categories of problems. The difficulties appear since definition of the “problem’s solution” notion. In fact, academic approaches often begin by problem’s constraints simplification in order to obtain a “solvable” model (here, solvable model means a set of mathematically solvable relations or equations describing a processing flow, a behavior, a set of phenomena, etc…). If the theoretical consideration is a mandatory step to study a given problem’s solvability, for a very large number of real world dilemmas, it doesn’t lead to a solvable or realistic solution. Difficulty could be related to several issues among which: - large number of parameters to be taken into account making conventional mathematical tools inefficient, - strong nonlinearity of the data (describing a complex behavior or ruling relationship between involved data), leading to unsolvable equations, - partial or total inaccessibility to relevant features (relevant data), making the model insignificant, - subjective nature of relevant features, parameters or data, making the processing of such data or parameters difficult in the frame of conventional quantification, - necessity of expert’s knowledge, or heuristic information consideration, - imprecise information or data leakage. As example, one can emphasize difficulties related to economical and financial modeling (data mining, features’ extraction and prediction), where the large number of parameters, on the one hand, and human related factors, on the other hand, make related real world problems among the most difficult to solve. However, examples illustrating the above-mentioned difficulties are numerous and concern wide panel of real world or industrial applications’ areas (Madani, 2003-b). In fact, real world and industrial applications, involving complex images’ or signals’ analysis, complex manufacturing and processes optimization, priority selection and decision, classification and clustering are often those belonging to such class of problems.
xxi
It is fascinating to note that the animal’s brain overthrows complex real-world quandaries brooding over their whole complexity. If much is still to discover about how the animal’s brain trains and self-organizes itself in order to process and mining so various and so complex information, a number of recent advances in “neurobiology” allow already highlighting some of key mechanisms of this marvels machine. Among them one can emphasizes brain’s “modular” structure and its “self-organizing” capabilities. In fact, if our simple and inappropriate binary technology remains too primitive to achieve the processing ability of these marvels mechanisms, a number of those highlighted points could already be sources of inspiration for designing new machine learning approaches leading to higher levels of artificial systems’ intelligence. The present article deals with a machine learning based modular approach which takes advantage from self-organizing multi-modeling ("divide and conquer" paradigm). If the machine learning capability provides processing system’s adaptability and offers an appealing alternative for fashioning the processing technique adequacy, the modularity may result on a substantial reduction of treatment’s complexity. In fact, the modularity issued complexity reduction may be obtained from several instances: it may result from distribution of computational effort on several modules (multi-modeling and macro parallelism); it can emerge from cooperative or concurrent contribution of several processing modules in handling a same task (mixture of experts); it may drop from the modules’ complementary contribution (e.g. specialization of a module on treating a given task to be performed). One of the most challenging classes of data processing and mining dilemmas concerns the situation when no a priori information (or hypothesis) is available. Within this frame, a self-organizing modular machine learning approach, combining "divide and conquer" paradigm and “complexity estimation” techniques called self-organizing “Tree-like Divide To Simplify” (T-DTS) approach will is described and evaluated.
2. T-DTS: A MULTI-MODEL GENERATOR WITH COMPLEXITY ESTIMATION BASED SELF-ORGANIZATION The main idea of the propose concept is to take advantage from self-organizing modular processing of information where the self-organization is controlled (regulated) by data’s “complexity” (Madani, 2003-a), (Madani, 2005-a), (Madani, 2005-b). In other words, the modular information processing system is expected to self-organize its own structure taking into account the data and processing models complexities. Of course, the goal is to reduce the processing difficulty, to enhance the processing performances and to decrease the global processing time (i.e. to increase the global processing speed). Taking into account the above-expressed ambitious objective, three dilemmas should be solved: - self-organization strategy, - modularity regulation and decision strategies, - local models construction and generation strategies. It is important to note that a crucial assumption here is the availability of a data base, which will be called “Learning Data-Base” (LDB), supposed to be representative of the problem (processing problem) to be solved. Thus, the learning phase will represent a key operation in the proposed self-organizing modular information processing system. There could be also a pre-processing phase, which arranges (prepares) data to be processed. Pre-processing phase could include several steps (as: data normalization, data appropriate selection, etc…).
2.1 T-DTS Architecture and Functional Blocs The architecture of the proposed self-organizing modular information processing system is defined around three main operations, interacting with each others: - data complexity estimation, - database splitting decision and self-organizing procedure control, - processing models (modules) construction.
xxii
Data (D), Targets (T)
Preprocessing Bloc (Normalizing, Removing Outliers, Principal Component Analysis)
Complexity Estimation Loop
Learning Bloc Feature Space Splitting NN based Local Models’ Generation
Multi-model’s Structure
Generalization Bloc
Processing Results
Figure 1. General bloc diagram of DTS, presenting main operational levels.
Figure 1 gives the operation bloc-diagram of the proposed architecture. The T-DTS architecture includes three main operational blocs. The first is the “pre-processing bloc”, which arranges (prepares) data to be processed. Pre-processing phase could include several steps resulting to a convenient format (representation) of the involved data. The second is the “learning bloc”, a chief stage in T-DTS system’s operational structure, which is in charge of the “learning phase”. Finally, the third one is the “generalization bloc” (or “working bloc”) processing incoming data (which have not been learned).
LNM
Learning sub-database
LNM
Learning sub-database
LNM
Learning sub-database
LNM
Learning sub-database
SP
Learning Database SP
SP
Figure 2. General bloc diagram of T-DTS tree-like splitting process (left) and learning processes operations’ flow.
The learning phase is an important phase during which T-DTS performs several key operations: splitting the learning database into several sub-databases, building a set of “Local Neural Models” (LNM) for each sub-database issued from the treelike splitting process and constructing (dynamically) the “Supervision/ Scheduling Unit” (SSU). Figure 2 represents the feature space’s splitting (SP) and LNM construction process
xxiii
bloc diagram. As this figure shows, after the learning phase, a set of neural network based models (trained from sub-databases) are available and cover the behaviour (map the complex model) region-by-region in the problem’s feature space. In this way, a complex problem is decomposed recursively into a set of simpler subproblems: the initial feature space is divided into M sub-spaces. For each subspace k, T-DTS constructs a neural based model describing the relations between inputs and outputs (data). If a neural based model cannot be built for an obtained sub-database, then, a new decomposition will be performed on the concerned subspace, dividing it into several other sub-spaces. Figure 3 gives the bloc diagram of the constructed solution (e.g. constructed multi-model). As shows this figure, the issued processing system appears as a multi-model including a set of local models and a supervision unit (e.g. SSU). When processing an unlearned data, first the SSU determines the most suitable LMN for processing that incoming data; then, the selected LMN processes the data. Control Path LNM 1
Data Path
LNM k
Y1 Output-1
Yk Output-k
Input Ψ (t )
LNM M
Supervisor Scheduler Unit
YM Output-M
Figure 3. General bloc diagram of T-DTS generalization phase Complexity estimation methods
Bayes error estimation
Space partitioning Class Discriminability Measures (Kohn, 1996)
Indirect
Purity measure (Singh, 2003) Neighborhood Separability (Singh, 2003) Collective entropy (Singh, 2003)
Other Correlation-based approach (Rahman, 2003) Fisher discriminator ratio (Fisher, 2000) Interclass distance measure (Fukunaga, 1990) Volume of the overlap region (Ho et al, 1998) Feature efficiency (Friedman et al., 1979) Minimum Spanning Tree (Ho, 2000) Inter-intra cluster distance(Maddox, 1990) Space covered by epsilon neighbourhoods Ensemble of estimators
Chernoff bound (Chernoff, 1966) Bhattacharyya bound (Bhattacharya, 1943) Divergence (Lin, 1991) NonMahalanobis distance (Takeshita, 1987) parametric Jeffries-Matusita dist. (Matusita, 1967) Error of the classifier itself Entropy measures (Chen, 1976) k-Nearest Neighbours, (k-NN) (Cover et al.,1967) Parzen Estimation (Parzen, 1962) Boundary methods (Pierson, 1998)
Figure 4. Taxonomy of classification’s complexity estimation methods
“Complexity Estimation Loop” (CEL) plays a capital role in splitting process (initial complex problem’s division into a set of sub-problems with reduced complexity), proffering self-organization capability of TDTS. It acts as some kind of “regulation” mechanism which controls the splitting process in order to handle the global task more efficiently. The complexity estimation based decomposition could be performed according to two general strategies: “static regulation policy” and “adaptive regulation policy”. In both strategies, the issued solution could either be a binary tree-like ANN based structure or a multiple branches tree-like ANN based framework. The main difference between two strategies remains in nature of the complexity estimation indicators and the splitting decision operator performing the splitting process: “static splitting policy” in the first one and “adaptive decomposition policy” in the second. Figure 4 gives the general
xxiv
taxonomy of different “complexity estimation” approaches including references describing a number of techniques involved in above-mentioned two general strategies. In a general way, techniques used for complexity estimation could be sorted out in three main categories: those based on “Bayes Error Estimation”, those based on “Space Partitioning Methods” and others based on “Intuitive Paradigms”. Bayes Error Estimation” may involve two classes of approaches, known as: indirect and non-parametric Bayes error estimation methods, respectively. Concerning on “Intuitive Paradigms” based complexity estimation, an appealing approach is to use the ANN learning as complexity estimation indicator (Budnyk et al., 2008). The idea is based on following assumption: “more complex a task (or problem) is more neurons will be needed to learn it correctly”. However, the choice of an appropriated neural model is here of major importance. In fact, the learning rule of the neural network’s model used as complexity estimator has to be sensitive to the problem’s complexity. If m represents the number of data to learn and g i (m) is a function relating the learning complexity, then a first indicator could be defined as relation (1). An adequate candidate satisfying the above-mentioned condition is the class of Kernel-like Neural Networks. In this kind of neural models the learning process acts directly on number of connected (e.g. involved) neurons in the unique hidden layer of this kind of ANN. For this class of ANN, gi(m) could be be the number of needed neurons in order to achieve a correct learning of m data, leading to a simple form of relation (1), expressed in term of relation (2) where n is the number of connected neurons in the hidden layer. g ( m) (1) , m ≥ 1, g i ( m ) ≥ 0 Qi (m) = i m n , (2) m ≥ 1, n ≥ 0 Q= m An appealing simple version of kernel-like ANN is implemented by the IBM ZISC-036 neuro-processor (Detremiolles, 1998). In this simple model a neuron is an element, which is able to: • memorize a prototype (64 components coded on 8 bits), the associated category (14 bits), an influence field (14 bits) and a context (7 bits), • compute the distance, based on the selected norm (norm L1 or LSUP) between its memorized prototype and the input vector (the distance is coded on fourteen bits), • compare the computed distance with the influence fields, • communicate with other neurons (in order to find the minimum distance, category, etc.), • adjust its influence field (during learning phase). The ZISC-036 learning mechanism’s simplicity make it a suitable candidate for implementing the abovedescribed intuitive complexity estimation concept.
2.2 Software Implementation T-DTS software incorporates three databases: decomposition methods, ANN models and complexity estimation modules T-DTS software engine is the Control Unit. This core-module controls and activates several software packages: normalization of incoming database (if it’s required), splitting and building a tree of prototypes using selected decomposition method, sculpting the set of local results and generating global result (learning and generalization rates). T-DTS software can be seen as a Lego system of decomposition methods, processing methods powered by a control engine an accessible by operator thought Graphic User Interface. The current T-DTS software (version 2.02) includes the following units and methods: - Decomposition Units: 9 CN (Competitive Network) 9 SOM (Self Organized Map) 9 LVQ (Learning Vector Quantization) - Processing Units: 9 LVQ (Learning Vector Quantization) 9 Perceptrons 9 MLP (Multilayer Perceptron) 9 GRNN (General Regression Neural Network) 9 RBF (Radial basis function network) 9 PNN (Probabilistic Neural Network) 9 LN
xxv
-
Complexity estimators (Bouyoucef, 2007), are based on the following criteria: 9 MaxStd (Sum of the maximal standard deviations) 9 Fisher measure. (Fisher, 2000) 9 Purity measure (Singh, 2003) 9 Normalized mean distance (Kohn, 1996) 9 Divergence measure (Lin, 1991) 9 Jeffries-Matusita distance (Matusita, 1967) 9 Bhattacharyya bound (Bhattacharya, 1943) 9 Mahalanobis distance (Takeshita, 1987) 9 Scattered-matrix method based on inter-intra matrix-criteria (Fukunaga, 1972). 9 ZISC© IBM ® based complexity indicator (Budnyk & al. 2007).
Figure 5. General bloc diagram of the T-DTS system’s software architecture (left) and an example of its 2D-data representation graphic option (right).
Figure 6. Screenshot of Matlab-implementation of T-DTS User Graphic Interface showing parameterization screenshot (left) and results control panel (right).
The output result-panel offers to the user several graphic variants. Figure 7 shows the parameterization, and results-display control panels Among the offered possibility, one of the most useful is the option allowing representing a 2D-data representation sorting the decomposed sub-databases and their representative centers conformably to the performed decomposition process. In this representation, the final graphic will show the obtained tree and the obtained clusters. The right picture of figure 6 gives an example of such representation.
2.3 Experimental Evaluation A specific benchmark has been designed in order to investigate complexity estimation strategies described in reported references. The benchmark has been elaborated on the basis of a 2-Classes classification framework and has been defined in the following way: three databases, including data in a 2-D feature space (meaning that the class to which a given data belongs depends to two parameters) belonging to two classes, have been generated. Two of them, including 1000 vectors each, represent two different distributions (of data). In the
xxvi
first database, data is distributed according to “circle” geometry (symmetrical). In the second database, data is distributed conformably to a two spirals-like geometry. Each database has been divided into two equal parts (learning and generalization databases) of 500 vectors each. Databases are normalized (to obtain mean equal to 0 and variance equal to 1). The third database contains a set of data distributions (databases generated according to a similar philosophy that for the previous ones) with gradually increasing classification difficulties. Figure 7 gives three examples of data distributions with gradually increasing classification difficulty: (1) corresponds to the simplest classification problem and (12) the most difficult.
(1)
(6)
(12)
Figure 7. Benchmark: databases examples with gradually increasing complexity.
100%
100,0
F i she r = 0 . 7 5
100% 100
80,0
80%
60,0
90%
60% 50%
40,0 20,0 0,0
1
2
3
4
5 6
7
rate (%)
Learning Rate % Generalization Rate % Time (s)
70%
Time (s)
rate (%)
80
80%
Learn. Rate Gener. Rate PU Time (s)
70% 60%
0
1
Problem ref erence
40 20
50%
8 9 10 11 12
60
Time (s) / PU
90%
2
3
4
5 6 7 8 9 10 11 12 Problem reference
Figure 8. Results for modular structure with static complexity estimation strategy (left) and adaptive complexity estimation strategy (right). Time” corresponds to the “learning phase” duration and PU to “number of generated local models”.
Based on the above-presented set of benchmark problems, two self-organizing modular systems have been generated in order to solve the issued classification problem. The first one using the “static” complexity estimation method with a threshold based decomposition decision rule, and the second one, using “adaptive” complexity estimation criterion based on Fisher’s decimator. As the Fisher’s discriminator based complexity estimation indicator measures distance between two classes (versus the averages and dispersions of data representative of each class), it could be used to adjust the splitting decision proportionally to problem’s difficulty: a short distance between two classes (of data) reflects higher difficulty, while, well separated classes of data delimit two well identified regions (of data), and thus, lower processing complexity. Figures 8 gives classification results obtained for each of the above-considered case. One can note from the left diagram of figure 8, that the processing times are approximately the same for each dataset. While, the classification rate drops significantly for more complicated datasets. That proves that the when databases complexity is increasing such modular system cannot maintain the processing quality. Concerning figure 9, as one can notice, significant enhancement in generalization phase. The classification rates for learning mode are alike, achieving good learning performance. In fact, in generalization (test) phase, there is only a small dropping tendency of the classification rate when the classification’s difficulty increases. However, in this case, processing time (concerning essentially the learning phase) increases significantly for more complex datasets. This fact is in contrast with results obtained for the previous structure (results presented in the right diagram of figure 8). In fact, in this case dynamic structure adapts decomposition (and so, the modularity) in order to reduce processing complexity, by creating a processing modular structure proportional to the processed data’s complexity.
xxvii
Figure 9. Number of generated models versus the splitting threshold for each complexity estimation technique: results obtained for “circular” distribution (a) and ”two spiral” distribution (b) respectively.
Extending the experiment to other complexity estimation indicators, similar results have been obtained for different indicators with more or less sensitivity. Figures 9 gives the number of generated models versus the splitting threshold for each complexity estimation technique for “circular” and ”two spiral” distributions respectively. For both databases the best classification rate was obtained when the decomposition (splitting) is decided using “purity” measurement based complexity indicator. However, at the same time, “Fisher’s” discriminator based complexity estimation achieved performances close to the previous one. Regarding the number of generated models, the first complexity estimation indicator (purity) leads to much greater number of models.
Figure 10. ZISC-036 neuro-computer based complexity indicator versus the learning database’s size (m). Table 2. Correct classification rates within different configurations: T-DTS with different complexity estimators. Gr represents “correct classification rate”, Std. Dev. is “standard deviation”, “TTT” abbreviates the Tic-tac-toe end-game problem and “DNA” abbreviates the second benchmark. LGB is the data fraction (in %) used in learning phase and GDB is the data fraction used in generalization phase. Experimental Conditions TTT with LGB = 50% & GDB = 50% TTT with LGB = 50% & GDB = 50% TTT with LGB = 50% & GDB = 50% DNA with LGB = 20% & GDB = 80% DNA with LGB = 20% & GDB = 80% DNA with LGB = 20% & GDB = 80%
Complexity Estimator used in T-DTS Mahalanobis com. est. ZISC based com. est. Normalized mean com. est. Mahalanobis com. est. Jeffries-Matusita based com. est. ZISC based com. est.
Max Gr (± Std. Dev.) (%) 84.551 (± 4.592) 82.087 (± 2.455) 81.002 (±1.753) 78.672 (± 4.998) 75.647 (±8.665) 80.084 (± 3.176)
Finally, a similar verification benchmark including five increasing levels of complexity (resulting in “Q1” to “Q5” different sets of Qi indicator’s values: Q1 corresponds to the easiest one and Q5 to hardest problem) and eight different databases’ sizes, indexed from “1” to “8” respectively (containing 50, 100, 250, 500, 1000,
xxviii
2500, 5000 and 10000 patterns respectively), has been carried out. For each set of parameters, tests have been repeated 10 times in order to get reliable statistics in order to check the obtained results’ deviation and average. Totally, 800 tests have been performed. Figure 11 gives the evaluation results within the abovedescribed experimental protocol. The expected behavior of Qi indicator could be summarized as follow: for a same problem (e.g. same index) the increasing of number of representative data (e.g. m parameter) tends to decrease the indicator’s value (e.g. the enhancement of representatively reduces problem’s ambiguity). On the other hand, for problems of increasing complexity, Qi indicator tends to increases its value. I, fact, as one can remark from Figure 11, the proposed complexity estimator’s value decreases when the learning database’s size increases. In the same way, the value of the indicators ascends from easiest classification task (e.g. Q1) to the hardest one (e.g. Q5). The results of figure 5 show that the proposed complexity estimator is sensitive to the classification task’s complexity and behaves conformably to the aforementioned expectations. In order to extend the frame of the evaluation tests, two patterns classification benchmark problems have been considered. The first one, known as “Tic-tac-toe end-game problem” where the goal consists of predicting whether each of 958 legal endgame boards for tic-tac-toe is won for `x'. The 958 instances encode the complete set of possible board configurations at the end of tic-tac-toe. This problem is hard for the covering family algorithm, because of multi-overlapping. The second one, known as “Splice-junction DNA Sequences classification problem”, aims to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts DNA of the sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites). There are 3190 numbers of instances from Genbank 64.1, each of them compound 62 attributes which defines DNA sequences (ftp-site: ftp://ftp.genbank.bio.net) problem. Table 1 gives the obtained results for different configurations, T-DTS with different complexity estimators. It is pertinent to emphasize the relative stability of the “correct classification rates” and the corresponded “standard deviations” when intuitive ANN based complexity estimator is used.
3. CONCLUSION A key point on which one can act is the processing complexity reduction. It may concern not only the problem representation’s level (data) but also may appear at processing procedure’s level. An issue could be processing model complexity reduction by splitting a complex problem into a set of simpler sub-problems: multi-modeling where a set of simple models is used to sculpt a complex behavior. The main goal of this paper was to show that by introducing “modularity” and “self-organization” ability obtained from “complexity estimation based regulation” mechanisms, it is possible to obtain powerful adaptive modular information processing systems carrying out higher level intelligent operations. Especially, concerning the classification task, a chief steps in data-mining process, the presented concept show appealing potentiality to defeat ever-increasing needs of nowadays complex data-mining applications.
ACKNOWLEDGEMENT I would like to express my gratitude to Dr. Abrennasser Chebira, member of my research team who have engaged his valuable efforts since 1999 on this topic. I would also express many thanks to Dr. Mariusz Rybnik and Dr. El-Khier Bouyoucef, who have strongly contributed in carried out advances with in their PhD thesis. Finally, I would like thank Mr. Ivan Budnyk my PhD student who actually works on T-DTS and intuitive complexity estimator.
REFERENCES Book Author, year. Title (in italics). Publisher, location of publisher. Abiteboul, S. et al, 2000. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco, USA.
xxix
Journal Author, year. Paper title. Journal name (in italics), volume and issue numbers, inclusive pages. Bodorik P. et al, 1991. Deciding to Correct Distributed Query Processing. In IEEE Transactions on Data and Knowledge Engineering, Vol. 4, No. 3,pp 253-265. Conference paper or contributed volume Author, year, paper title. Proceedings title (in italics). City, country, inclusive pages. Beck, K. and Ralph, J., 1994. Patterns Generates Architectures. Proceedings of European Conference of Object-Oriented Programming. Bologna, Italy, pp. 139-149. Bhattacharya A., 1943, On a measure of divergence between two statistical populations defined by their probability distributions, Bulletin of Calcutta Maths Society, vol. 35, pp. 99-110. Budnyk I., Bouyoucef E., Chebira A., Madani K., 2008, Neuro-computer Based Complexity Estimator Optimizing a Hybrid Multi-Neural Network structure, COMPUTING, ISSN 1727-6209, Vol.7, Issue 3, pp. 122-129. Chen C.H., 1976, On information and distance measures, error bounds, and feature selection, Information Sciences, pp. 159-173. Chernoff A., 1966, Estimation of a multivariate density, Annals of the Institute of Statistical Mathematics, vol. 18, pp. 179-189. Cover T. M., Hart P. E., 1967, Nearest neighbors pattern classification, IEEE Trans on Inform. Theory, Vol. 13, pp.21-27. De Tremiolles G., 1998, Contribution to the theoretical study of neuro-mimetic models and to their experimental validation: a panel of industrial applications, Ph.D. Report, University of PARIS 12, (in French) Fisher A., 2000, The mathematical theory of probabilities, John Wiley. Friedman J. H.,. Rafsky L. C, 1979, Multi-variate generalizations of the Wald-Wolfowitz and Smirnov two sample tests, The Annals of Statistics, Vol. 7(4), pp. 697-717. Fukunaga K.., 1990, Introduction to statistical pattern recognition, Academic Press, New York, 2nd ed.. Ho K. , Baird H. S., 1993, Pattern classification with compact distribution maps, Computer Vision and Image Understanding, Vol. 70(1), pp. 101-110. Ho T. K., 2000, Complexity of classification problems and comparative advantages of combined classifiers, Lecture Notes in Computer Science. Kohn A., Nakano, L.G., and Mani V., 1996, A class discriminability measure based on feature space partitioning, Pattern Recognition, 29(5), pp. 873-887 Lin J., 1991, Divergence measures based on the Shannon entropy”, IEEE Transactions on Information Theory, 37(1):145151. Madani K., Chebira A., Rybnik M., 2003 - a. Data Driven Multiple Neural Network Models Generator Based on a Treelike Scheduler, LNCS "Computational Methods in Neural Modeling", Ed. J. Mira, J.R. Alvarez - Springer Verlag 2003, ISBN 3-540-40210-1, pp. 382-389. Madani K., Chebira A., Rybnik M., 2003 - b. Nonlinear Process Identification Using a Neural Network Based Multiple Models Generator, LNCS "Computational Methods in Neural Modeling", Ed. J. Mira, J.R. Alvarez - Springer Verlag 2003, ISBN 3-540-40211-X, pp. 647-654. Madani K., Thiaw L., Malti R., Sow G., 2005 - a. Multi-Modeling: a Different Way to Design Intelligent Predictors, LNCS, Ed. J. Cabestany, A. Prieto and D.F. Sandoval - Springer Verlag, Vol. 3512, pp. 976-984. Madani K., Chebira A., Rybnik M., Bouyoucef E., 2005 - b. Intelligent Classification Using Dynamic Modular Decomposition, 8-th International Conference on Pattern Recognition and Information Processing (PRIP 2005), May 18-20, 2005, Minsk, Byelorussia, ISBN 985-6329-55-8, pp. 225-228. Matusita K., 1967, On the notion of affinity of several distributions and some of its applications, Annals Inst. Statistical Mathematics, Vol. 19, pp. 181-192. Parzen E., On estimation of a probability density function and mode, Annals of Math. Statistics, vol. 33, pp. 1065-1076. Pierson W.E., 1998, Using boundary methods for estimating class separability, PhD Thesis, Dept of Elec. Engin, Ohio State University. Rahman A. F. R., Fairhurst M., 1998, Measuring classification complexity of image databases: a novel approach, Proc of Int Conf on Image Analysis and Processing, pp. 893-897. Singh S., 2003, Multi-resolution estimates of classification complexity, IEEE Trans on Pattern Analysis and Machine Intelligence Takeshita T., Kimura F., Miyake Y., 1987, On the estimation error of Mahalanobis distance, Trans. IEICE, pp. 567-573.
xxx
Full Papers
IADIS European Conference Data Mining 2009
AN EXPERIMENTAL STUDY OF THE DISTRIBUTED CLUSTERING FOR AIR POLLUTION PATTERN RECOGNITION IN SENSOR NETWORKS Yajie Ma Information Science and Engineering College, Wuhan University of Science and Technology 947, Heping Road, Wuhan, 43008, China Department of Computing, Imperial College London 180 Queens Gate, London, SW7 2BW, UK
Yike Guo, Moustafa Ghanem Department of Computing, Imperial College London 180 Queens Gate, London, SW7 2BW, UK
ABSTRACT In this paper, we make an experimental study of the urban air pollution pattern analysis within MESSAGE system. A hierarchical network framework consisted of mobile sensors and stationary sensors is designed. A sensor gateway core architecture is developed which is suited to grid-based computation. Then we make experimental analysis including the identification of pollution hotspots and the dispersion of pollution clouds based on a real-time peer-to-peer clustering algorithm. Our results provide a typical air pollution pattern in urban environment which gives a real-time track of the air pollution variation. KEYWORDS Pattern recognition, Distributed clustering, Sensor networks, Grid, Air pollution.
1. INTRODUCTION Road traffic makes a significant contribution to the following emissions of pollutants: benzene(C6H6), 1,3~butadiene, carbon monoxide(CO), lead, nitrogen dioxide(NO2), Ozone(O3), particulate matter(PM10 and PM2.5) and sulphur dioxide(SO2). In the past decade, environmental applications including air quality control and pollution monitoring [1–3] are experiencing a steadily increasing attention. Under the current Environment Act of UK [4], most local authorities have air quality monitoring stations to provide environmental information to public daily via Internet. The conventional approach to assessing pollution concentration levels is based on data collected from a network of permanent air quality monitoring stations. However, permanent monitoring stations are frequently situated so as to measure ambient background concentrations or at potential ‘hotspot’ locations and are usually several kilometers apart. According to our earlier research of ‘Discovery Net EPSRC e-Science Pilot Project’ [5], we learnt that the pollution levels and the hot spots change with time. This kind of pollution levels and hot spots change can be calculated as dispersion under some sets of meteorology conditions. Whatever dispersion model is used, it should relate to the source, meteorology, and spatial patterns to air quality at receptor points [6]. Till now, much attention has been paid to the spatial patterns in relationships between sources and receptors, such as how the arrangement of sources affects the air quality at receptor locations [7], how to employ various kinds of atmospheric pollution dispersion models [8, 9], and etc. However, the phenomenon of road traffic air pollution shows considerable variation within a street canyon as a function of distance to the source of pollution [10]. Therefore, the levels and consequently the effected number of inhabitants vary. Information on a number of key factors such as individual driver/vehicle activity, pollution concentration and individual human exposure has traditionally either simply not been available or only available at high levels of spatial and temporal
3
ISBN: 978-972-8924-88-1 © 2009 IADIS
aggregation This is mainly caused by the critical data gaps and asymmetries in data coverage, as well as the lack of on-line data processing capability offered by the e-Science. We can fill these data gaps by two ways: generating new forms of data (e.g., on exposure and driver/vehicle activity) and generating data at higher levels of spatial and temporal resolution than existing sensor systems. Taking advantage of the low cost mobile environmental sensor system, we construct the MESSAGE (Mobile Environmental Sensor System Across Grid Environments) system[11], which fully integrates existing static sensor systems and complementary data sources with the mobile environmental sensor system. It can provide radically improved capability for the detection and monitoring of environmental pollutants and hazardous materials. In this paper, based on our former work of MoDisNet [12], we introduce the experimental analysis for urban air pollution monitoring within the MESSAGE system. The main contributions of this paper are: first, we propose a sensor gateway core architecture for sensor grid to provides the processing, integrating, and analyzing heterogeneous sensor data in both centralized and distributed ways. With the support of the hierarchical network architecture formed by the mobile sensors fixed by the roadside devices and stationary sensors carried by the public vehicles, the MESSAGE system fully considers the urban background and the pollution features, which is highly effective for air pollution monitoring. Second, we make an experimental study for typical air pollution pattern analysis in urban environment based on a solution of the real-time distributed clustering algorithm in sensor grid, which gives a real-time track of the air pollution variation. The results also present important information about environmental protection and individual super vision. In the remainder of this paper, we first present the system architecture to meet the demands of the project in section 2. We also discuss the novel techniques we provide to address the problems when a sensor grid is constructed based on the mobile and high-throughput real-time data environment. In section 3, the distributed clustering algorithm is introduced as well as the performance analysis. We describe the real-time pollution pattern recognition experiments in section 4. Section 5 concludes the paper with a summary of the research and a discussion of future work.
2. METHODOLOGY 2.1 Modeling Approach The key feature of the MESSAGE system is to use a variety of vehicle fleets including buses, service vehicles, taxis and commercial vehicles as platform for environmental sensors. With the collaboration of the static sensors fixed on roadside, the whole system can detect the real-time air pollution distribution in London. To satisfy this demand, the MESSAGE system is constructed based on a two-layer network architecture cooperating with the e-Science Grid architecture. The Grid structure is featured by the sensor gateway core architecture, which enables the sensors themselves naturally form and communicate with each other in P2P style within large scale mobile sensor networks. These will provide MESSAGE with the ability to support the full scale analytical task ranging from dynamic real time mining of sensor data to the analysis of off-line data warehoused for historical analysis. The sensors in MESSAGE Grid are equipped with sufficient computational capabilities to participate in the Grid environment and to feed data to the warehouse as well as perform analysis tasks and communicating with their peers. The network framework and the sensor gateway core architecture are illustrated in Figure 1(a) and (b). The mobile sub-network formed by the Mobile Sensor Nodes (MSN in short) and the stationary subnetwork organized by the Static Sensor Nodes (SSN in short). MSNs are installed in the vehicles. They sample the pollution data and execute the AD conversion to get the digital signals. According to the system requirements, the MSNs may pre-process the raw data (such as the noise reduction, local data cleaning and fusion, etc.) and then send these data to a nearest SSN. The SSNs take in charge of the data receiving, update, storage and exchange works. The sensors (including SSN and MSN) connect to the MESSAGE Grid by several Sensor Gateways (SGs) according to different wireless access protocols. The sensors are capable of collecting the air pollution data up to 1Hz frequency and sending the data to the remote Grid service hop by hop. This capability enable the sensors exchange their raw data locally and then realize the data analysis and mining in distributed way.
4
IADIS European Conference Data Mining 2009
Data flow from sensors to paired sensor Gateway
Data flow from paired sensors
Data flow from paired sensors
Root Gateway Provisions Sensor Gateway
The SGs take in charge of connecting the wireless sensor network with the IP backbone, which can be either wired or wireless. All SGs are managed by a Root Gateway (RG), which is a logical entity that may consist of a number of physical root nodes that operate in a peer-to-peer environment to ensure reliability. The RG is the central element of the Sensor Gateway architecture. The SG Service maintains details of the SGs that are available and their available capacity. The aim of RGs is to load-balance across the available SGs, which is very useful for improving the throughput and performance of the Grid architecture. A database that can be accessed by SQL is managed by the Grid architecture which centrally stores and maintains all the archived data, including derived sensor data and the third part data such as the traffic data, the weather data and the health data. These data can provide wealth of information for the Grid computation to generate the short-term or long-term models that relate to the air pollution and traffic. Furthermore, it may give the supervision for the prediction of the forthcoming events about the traffic change and pollution trend.
Static Sensor Node (SSN)
Mobile Sensor Node (MSN)
(a) network framework
(b) sensor gateway core architecture
Figure 1. The Network Framework and Sensor Gateway Core Architecture within MESSAGE
2.2 Preparation of Input Data The input data based on our former research [5] uses the air pollution data sampled from 140 sensors marked as the red dots (see Figure 3 and 4 in section 4) distributed in a typical urban area around the Tower Hamlets and Bromley areas in east London. There are some of the typical landmarks such as the main roads extending from A6 to L10 and M1 to K10, the hospitals around B5 and K4, the schools at B7, C8, D6, F10, G2, H8, K8 and L3, the train stations at D7 and L5, and a gas work between D2 and E1. 140 sensors collect data from 8:00 to 17:59 at 1-minute intervals to monitor the pollution volumes of NO, NO2, SO2 and Ozone. Then there are 600 data items for each node and totally 84000 data items for the whole network. Each data item is identified by a time stamp, a location, and a four-pollutant volume reading. Once sensor data is collected, data cleaning and preprocessing is necessary before further analysis and visualization can be performed. Most importantly, missing data must be performed using bounding data from the sensor, or also using data from nearby sensors at the same time. Interpolated data may be stored back to the original database, with provenance information including the algorithm used. Such pre-processing is standard, and has been conducted using the available MESSAGE component. The relatively high spatial density of sensors used also allows a detailed map of pollution in both space and time to be generated.
5
ISBN: 978-972-8924-88-1 © 2009 IADIS
3. DISTRIBUTED CLUSTERING ALGORITHM Data mining for pollution monitoring in sensor networks in urban environment faces several challenges. First, the methods of data collection and pre-processing highly rely on the complexity of the environment. For example, the distribution and features of pollution data are correlated to inter-relationships between the environment, geography, topography, weather and climate and the pollution source, which may guide the design of the data mining algorithms. Also, the mobility of the sensor nodes increases the complexity of sensor data collection and analysis [13, 14]. Second, resource-constrained (power, computation or communication), distributed and noisy nature of sensor networks presents challenges for storing the historical data in each sensor, even for storing the summary/pattern from the historical data [15]. Third, sensor data come in time-ordered streams over network, which makes traditional centralized mining techniques inapplicable. As the result, the real-time distributed data mining (DDM in short) schemes are significantly demanded in such scenario. Considering the pattern recognition application, in this section, we introduce a peer-to-peer clustering algorithm as well as the performance analysis.
3.1 P2P Clustering Algorithm To realize the DDM algorithm with the capability to provide the information exchange in P2P style, a P2P clustering algorithm is designed to find out the pollution patterns in the urban environment according to the sampled air pollutants’ volumes. The algorithm is a hierarchical clustering algorithm based on DBSCAN in [16]. However, our algorithm has the following characteristics: 1. Nodes only require local synchronization at any time, which is better suited to a dynamic environment. 2. Nodes only need to communicate with their immediate neighbors, which reduces the communication complex. 3. Data are inherently distributed in all the nodes, which makes the algorithm be widely used in large, complex systems. The algorithm runs in each SSN (MSN only takes in charge of collecting data and sending data to a closest SSN). In order to describe this algorithm, we give some definitions first (suppose the total numbers of SSN is n (n > 0)). SSNi: a SSN node with the identity i (i = 0, …, n-1); Si: an Information Exchange Node Set (IENS) of SSNi, which is a set of some of the SSNs that can exchange information with SSNi; CS: candidate cluster centre set. Each element in CS is a cluster centre; Cli,j: the cluster center of jth (j ≥ 0) cluster that is computed in SSNi in lth recursion (l ≥ 0), Cli,j∈CS; Numi,j: the number of members (data points) belongs to jth cluster in SSNi; E(X, Y): the Euclidian distance of data X and Y; D: a pre-defined distance threshold; δ: a pre-defined offset threshold. The algorithm proceeds as follows. 1. Generates Si and local data set. Node SSNi receives data from MSNs as local data and chooses a certain number of SSNs as Si in term of a random algorithm (the detail of the random algorithm is beyond the scope of this article). 2. Generates CS. This process is described by the following pseudo code: SSNi chooses a data item j from its local data set into CS as C0i,j; for each other data item k in the local data set of SSNi for each data item m ∈ CS if E(k, m)>D put k into CS as C0i,k; 3. Distributes data. For each candidate cluster centre C0i,j∈CS and a data item Y, if E(C0i,j, Y) < D, then distribute Y into the cluster. Thus each local cluster of SSNi can be described as (C0i,j, Numi,j) 4. Updates CS. Node SSNi exchanges local data description with all the nodes in Si. After SSNi receives all the data descriptions it wants, it checks to see if two cluster centres C0i,j, C0i,k satisfy E(C0i,j, C0i,k) < 2D, then it combines these two clusters and updates the cluster centre as C1i,j.
6
IADIS European Conference Data Mining 2009
5. Compares C0i,j and C1i,j. Computes the offset between C1i,j and C0i,j. If the offset ≤ δ, then the algorithm finishes; otherwise SSNi replaces C0i,j with C1i,j, and go to step 3.
3.2 Clustering Accuracy Analysis The evaluation of the accuracy of the algorithm aims to investigate in what degree our P2P clustering algorithm can assign the data items into the correct clusters in comparison with the centralized algorithm. To do so, we design an experimental environment for data exchange and algorithm execution. The network topology of the simulation is shown in Figure 2. We use 18 sensor nodes, including 12 SSN nodes from node 0 to node 11 and 6 MSN nodes from node 12 to node 17. Data are sampled at each MSN and sent to a nearest SSN. The air pollution data we use is consisted of the volumes of four pollutants NO, NO2, SO2, and O3 sampled at 1-minute intervals in urban environment from 8:00 to 17:59 within a day collected from 6 MSNs (as described in Section 2.2). Then, the total number of data items in the dataset is 3600. Data can be sent and received in bi-directions along the edges. The comparison of the average clustering accuracy of the centralized and distributed clustering algorithms is shown in Table 1. For the centralized clustering algorithm, we suppose node 8 be the sink (central point for data processing), which means every other MSN sends the data to node 8. And the classic DBSCAN algorithm is running in node 8 for centralized clustering. For the accuracy measurement, let X i denote the dataset at node i. Let Likm (x) and Li (x) denote the labels (cluster membership) of sample x ( x ∈ X i ) at node i under centralized DBSCAN algorithm and our distributed clustering algorithm respectively. We define the Average Percentage Membership Match (APMM) as 1 n | {x ∈ X i : Li ( x) = Likm ( x)} | (1) ×100% ∑ n i =1 | Xi | Where n is the total number of SSNs. For the distributed clustering algorithm, we vary the number of nodes in the Information Exchange Node Set (IENS) of each SSN from 1 to 10. Let D = 10 and δ = 1. Data are randomly assigned to each SSN. Table 1 shows the APMM results. APMM =
14 15
16
13
3
2
1
0
7
6
5
4
11
10
9
8
12
17
Figure 2. The Network Topology of the Simulation Table 1. Centralized Clustering vs. Distributed Clustering (APMM results) IENS APMM
1 86.3%
2 91.2%
3 92.67%
4 93.46
5 93.55%
6 93.74%
7 93.93%
8 94.23%
9 94.59%
10 94.97%
From Table 1 we can see that, when the number of nodes in IENS is no less than 2, in other words, when each SSN exchanges data with at least two other SSNs, the APMM exceeds 91%. When the number of nodes in IENS is no less than 4, the APMM exceeds 93%. The results are achieved under the condition of assigning the data to each SSN randomly. In reality, if the patterns of the dataset are various in different locations, the APMM may be lower than the results in Table 1. In such situations, a good scheme of how to choose the nodes to construct the IENS would be very important.
7
ISBN: 978-972-8924-88-1 © 2009 IADIS
4. EXPERIMENTAL ANALYSIS OF PATTERN RECOGNITION 4.1 Pollution Hotspots Identification The pollution hotspots identification uses the air pollution data to find out the distribution of some key pollution locations within the research area. Our former work in Discovery Net can only classify the pollution data into several pollution levels, such as high or low, but can’t tell us the distribution of different pollutants in different locations and their contributions to the pollution levels. To improve the data analysis capability, in this data analysis experiment, we use the distributed clustering algorithm to cluster the pollutants into groups which can recognize different pollution patterns. From the experimental results of Discovery Net, we pickup all the high pollution level locations in the research area at 15:30 and 17:00 respectively to check the contribution of different pollutants (NO, NO2, SO2 and Ozone) to the pollution levels. The results are shown in Figure 3. In this figure, different clusters/patterns correspond to different colors, which reveal the relationship between the combination and volumes of different pollutants. According to the clustering result, we use red color to denote the pattern of high volume of NO2 and Ozone whilst low volume of NO; blue color features the pattern of high volume of SO2 and Ozone; yellow color only contains high volume of SO2. From the figures we can see, at 15:30 the hotspots are located at the schools (which are highlighted by circles and almost all featured by high volume of NO2 and Ozone) and the gas work (which is highlighted by square and featured by high volume of SO2). At 17:00, the hospitals (highlighted by the ellipses) and the gas work all contribute to the pollutant of SO2. Another kind of hotspot located on the main roads. However, they present different patterns at different time on different roads. Main road A6-L10 is covered in blue at 15:30 while red at 17:00. There are two reasons for this circumstance. First, the road transport sector is the major source of NOX emissions and the solid fuel and petroleum products are two main contributors of SO2. Second, NO2 and Ozone are all formed through a series of the photochemical reactions featuring NO, CO, hydrocarbons and PM. Generating NO2 and Ozone needs to take a period of time. This is why the density of NO2 is always high on the main road whereas Ozone at 17:00 is higher than that at 15:30. Another interesting fact is that, at 17:00 main roads A6L10 and M1- K10 show different pollution patterns. From the figure we can see, the pollution pattern on M1K10 is very similar to the patterns at the gas work and hospital areas, but not similar to the pattern on the other main road. We investigated this area and found that, a brook flows along this area in the near east and a factory area locates on the opposite side of the brook which is beyond the scope of this map. This can explain why the pollution patterns are different on these two main roads.
15:30
17:00
Figure 3. Pollution Hotspots Identification
8
IADIS European Conference Data Mining 2009
4.2 Pollution Clouds Dispersion Analysis In this experiment, we investigate the dispersion of different pollution clouds to see their movements and changes. We pick up the pollutants of NOX (NO+ NO2) and SO2 respectively and calculate the pollution clouds of them at the time points of 17:15, 17:30 and 17:45. The results are shown in Figure 4(a) and (b). According to the environmental reports of the UK, it is always the worst pollution distribution time period within a day after 17:00. The road transport sector contributes more than 50% to the total emission of NOX, especially in urban areas. At the meantime, the factories are another emission source of the nitride pollutants. Besides the major source of SO2 generated by the solid fuel and the petroleum products from the transport emission, some other locations such as the hospitals contribute some kind of pollutants, including the sulphide and nitride. These features are well illustrated by Figure 4. In Figure 4(a), the main road A6-L10 and its circumjacent areas are severely covered by high volume of NOX. The same situation appears in the area from A1 to N2 which includes a gas work (between D2 and E1), side roads (A1 to J2), factories and parking lots (K1 to L2). And we can notice that the dispersion of the NOX clouds fades as the time goes by, especially around the main road area. However, the NOX clouds will stay for a long time in A1 to N2 area. The dispersion of SO2 cloud in Figure 4(b), however, shows different feature. The cloud mainly covers the main roads, as well as two hospitals (around B5 and K4). In comparison with the result at 17:15, the SO2 cloud blooms at 17:30, which lays almost over all the two main roads and hospitals. However, it fades quickly at 17:45 and uncovers a lot of areas, especially the main road M1-K10 and hospital K4. This status may due to the different environmental conditions in this area (the dispersion of SO2 depends on a lot of factors such as the temperature, wind direction, humidity and air pressure, etc.). Besides, it also can be attributed to the existence of the brook in the near east – SO2 can be absorbed into water to form sulphurous acid very easily, which decreases the volume of SO2 in the air whereas increases the pollution of the water. 17:15
17:30
17:45
(a) NOX (NO+ NO2)
(b) SO2
Figure 4. Pollution Clouds Dispersion of NOX and SO2
5. CONCLUSION In this paper, we make an experimental study of the urban air pollution pattern analysis within MESSAGE system. Our work is featured by the sensor gateway core architecture in sensor grid, which provides a
9
ISBN: 978-972-8924-88-1 © 2009 IADIS
platform for different wireless access protocols, and the experiments of air pollution analysis based on distributed P2P clustering algorithm, which investigates the distribution of pollution hotspots and the dispersion of pollution clouds. The experimental results are useful for the government and local authorities to reduce the impact of road traffic on the environment and individuals. We are currently extending the application case studies to monitor PM10 and finer detection (e.g. PM2.5). As addressing global warming becomes more important, there are increasing requirements for greenhouse gas emission monitoring and reduction. Information on greenhouse gases is therefore also needed for long term monitoring purposes with similar linkages to traffic and weather data to understand the contribution of traffic to environmental conditions.
ACKNOWLEDGEMENTS This work was funded by the Engineering and Physical Sciences Research Council (EPSRC) project Mobile Environmental Sensing System Across a Grid Environment (MESSAGE), Grant No. EP/E002102/1.
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12. 13. 14. 15. 16.
10
A. Vaseashta, G. Gallios, M. Vaclavikova, et al, 2007. Nanostructures in Environmental Pollutuion Detection, Monitoring, And Remediation. In Science and Technology of Advanced Materials, Vol. 8, Issues 1-2, pp. 47-59. M. Ibrahim, E. H. Nassar, 2007. Executive Environmental Information System (ExecEIS). In Journal of Applied Sciences Research, Vol. 3, No.2, pp. 123-129. N. Kularatna, B.H. Sudantha, 2008. An Environmental Air Pollution Monitoring System Based on the IEEE 1451 Standard for Low Cost Requirements. In IEEE Sensors Journal, Vol. 8, Issue 4, pp. 415-422. Environment Act 1995. http://www.opsi.gov.uk/acts/acts1995/ukpga_19950025_en_1. M. Richards, M. Ghanem, M. Osmond, et al, 2006. Grid-based Analysis of Air Pollution Data. In Ecological Modelling, Vol.194, Issue 1-3, pp. 274-286. R. J. Allen, L. R. Babcock, N. L. Nagda, 1975. Air Pollution Dispersion Modeling: Application and Uncertainty. In Journal of Regional Analysis and Policy, Vol.5, No.1. A. Robins, E. Savory, A. Scaperdas, et al, 2002. Spatial Variability and Source-Receptor Relations at a Street Intersection. In Water, Air and Soil Pollution: Focus, Vol. 2, No. 5-6, pp. 381-393. R. Slama, L. Darrow, J. Parker, et al, 2008. Meeting Report: Atmospheric Pollution and Human Reproduction. In Environmental Health Perspectives, Vol. 116, No. 6, pp. 791-798. M. Rennesson, D. Maro, M.L. Fitamant, et al, 2005. Comparison of the local-scale atmospheric dispersion model CEDRAT with 85Kr measurements. In Radioprotection, Suppl. 1, Vol. 40, pp. S371-S377. G. Wang, F. H. M. van den Bosch, M. Kuffer, 2008. Modeling Urban Traffic Air Pollution Dispersion. Proceedings of The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Vol. XXXVII, Part B8, Beijing, China, pp.153-158. MESSAGE: Mobile Environmental Sensing System Across Grid Environments. http://www.message-project.org. Y. Ma, M. Richards, M. Ghanem, et al, 2008. Air Pollution Monitoring and Mining Based on Sensor Grid in London. In Sensors, Vol. 8, pp. 3601-3623. M.J. Franklin. Challenges in Ubiquitous Data Management, 2001. In Lecture Notes In Computer Science. Vol.2000, pp. 24-31. F. Perich, A. Joshi, T. Finin, et al, 2004. On Data Management in Pervasive Computing Environments. In IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 5, pp. 621-634. Y. Diao, D. Ganesan, G. Mathur, et al, 2007. Rethinking Data Management for Storage-centric Sensor Networks. Proceedings of The Third Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, pp. 22-31. M. Ester, H.-P. Kriegel, J. Sander, et al, 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of The 2nd International Conference on Knowledge Discovery and Data Mining, Oregon, Portland, pp. 226-231.
IADIS European Conference Data Mining 2009
A NEW FEATURE WEIGHTED FUZZY C-MEANS CLUSTERING ALGORITHM Huaiguo Fu, Ahmed M. Elmisery Telecommunications Software & Systems Group Waterford Institute of Technology, Waterford, Ireland
ABSTRACT In the field of cluster analysis, most of existing algorithms assume that each feature of the samples plays a uniform contribution for cluster analysis. Feature-weight assignment is a special case of feature selection where different features are ranked according to their importance. The feature is assigned a value in the interval [0, 1] indicating the importance of that feature, we call this value "feature-weight". In this paper we propose a new feature weighted fuzzy c-means clustering algorithm in a way which this algorithm be able to obtain the importance of each feature, and then use it in appropriate assignment of feature-weight. These weights incorporated into the distance measure to shape clusters based on variability, correlation and weighted features. KEYWORDS Cluster Analysis, Fuzzy Clustering, Feature Weighted.
1. INTRODUCTION The Goal of cluster analysis is to assign data points with similar properties to the same groups and dissimilar data points to different groups [3]. Generally, there are two main clustering approaches i.e. crisp clustering and fuzzy clustering. In the crisp clustering method the boundary between clusters is clearly defined. However, in many real cases, the boundaries between clusters cannot be clearly defined. Some objects may belong to more than one cluster. In such cases, the fuzzy clustering method provides a better and more useful method to cluster these objects [2]. Cluster analysis has been widely used in a variety of areas such as data mining and pattern recognition [e.g.1, 4, 6]. Fuzzy c-means (FCM) proposed by [5] and extended by [4] is one of the most well-known methodologies in clustering analysis. Basically FCM clustering is dependent on the measure of distance between samples. In most situations, FCM uses the common Euclidean distance which supposes that each feature has equal importance in FCM. This assumption seriously affects the performance of FCM, so that the obtained clusters are not logically satisfying. Since in most real world problems, features are not considered to be equally important. Considering example in [17], the Iris database [9] which has four features, i.e., sepal length (SL), sepal width (SW), petal length (PL) and petal width (PW). Fig. 1 shows a clustering for Iris database based on features SL and SW, while Fig. 2 shows a clustering based on PL and PW. From Fig. 1, one can see that there are much more crossover between the star class and the point class. It is difficult for us to discriminate the star class from the point class. On the other hand, it is easy to see that Fig. 2 is more crisp than Fig. 1. It illustrates that, for the classification of Iris database, features PL and PW are more important than SL and SW. Here we can think of that the weight assignment (0, 0, 1, 1) is better than (1, 1, 0, 0) for Iris database classification.
11
ISBN: 978-972-8924-88-1 © 2009 IADIS
Figure1. Clustering Result of Iris Database Based on Feature Weights (1, 1, 0, 0) by FCM Algorithm
Figure 2. Clustering Result of Iris Database Based on Feature Weights (0, 0, 1, 1) by FCM Algorithm
Feature selection and weighting have been hot research topics in cluster analysis. Desarbo [8] introduced the SYNCLUS algorithm for variable weighting in k-means clustering. It is divided into two stages. First it uses k-means clustering with initial set of weights to partition data into k clusters. It then determines a new set of optimal weights by optimizing a weighted mean-square. The two stages iterate until they obtain an optimal set of weights. Huang [7] presented W-k-means, a new k-means type algorithm that can calculate variable weights automatically. Based on the current partition in the iterative k-means clustering process, the algorithm calculates a new weight for each variable based on the variance of the within cluster distances. The new weights are used in deciding the cluster memberships of objects in the next iteration. The optimal weights are found when the algorithm converges. The weights can be used to identify important variables for clustering. The variables which may contribute noise to the clustering process can be removed from the data in the future analysis. With respect to FCM clustering, it is sensitive to the selection of distance metric. Zhao [12] stated that the Euclidean distance give good results when all clusters are spheroids with same size or when all clusters are well separated. In [13, 10], they proposed a G–K algorithm which uses the well-known Mahalanobis distance as the metric in FCM. They reported that the G–K algorithm is better than Euclidean distance based algorithms when the shape of data is considered. In [11], the authors proposed a new robust metric, which is distinguished from the Euclidean distance, to improve the robustness of FCM. Since FCM’s performance depends on selected metrics, it will depend on the feature-weights that must be incorporated into the Euclidean distance. Each feature should have an importance degree which is called feature-weight. Feature-weight assignment is an extension of feature selection [17]. The latter has only either 0-weight or 1-weight value, while the former can have weight values in the interval [0.1]. Generally speaking, feature selection method cannot be used as feature-weight learning technique, but the inverse is right. To be able to deal with such cases, we propose a new FCM Algorithm that takes into account weight of each feature in the data set that will be clustered. After a brief review of the FCM in section 2, a number of features ranking methods are described in section 3. These methods will be used in determining FWA (feature weight assignment) of each feature. In section 4 distance measures are studied and a new one is proposed to handle the different feature-weights. In section 5 we proposed the new FCM for clustering data objects with different feature-weights.
2. FUZZY C-MEAN ALGORITHM Fuzzy c-mean (FCM) is an unsupervised clustering algorithm that has been applied to wide range of problems involving feature analysis, clustering and classifier design. FCM has a wide domain of applications such as agricultural engineering, astronomy, chemistry, geology, image analysis, medical diagnosis, shape analysis, and target recognition [14]. Unlabeled data are classified by minimizing an objective function based on a distance measure and clusters prototype. Although the description of the original algorithm dates back to 1974 [4, 5] derivatives have been described with modified definitions for the distance measure and prototypes for the cluster centers [12, 13, 11, 10] as explained above. The FCM minimizes an objective function J m , which is the weighted sum of squared errors within groups and is defined as follows:
12
IADIS European Conference Data Mining 2009
J
∑
(U , V ; X ) =
m
Where V= (v 1 , v
∑
n k =1
n i=1
u
m ik
x
k
− v
2 i
,1 < m
A
< ∞
(1)
) is a vector of unknown cluster prototype (centers) v i ∈ ℜ p . The value of u ik represent the grade of membership of data point xk of set X= {x 1 , x 2 ,........, x c } to ,........,
2
v
c
the ith cluster. The inner product defined by a distance measure matrix A defines a measure of similarity between a data object and the cluster prototypes. A hard fuzzy c-means partition of X is conveniently represented by a matrix U = [u ik ] . It has been shown by [4] that if x k − v i 2 > 0 for all i and k, then
(U , V ) may minimize J m n
v
i
=
∑ (u k =1
)m
ik
X
A
only, when m>1 and
For 1
k n
∑ (u k =1
u
ik
=
1 c
∑
j=1
⎛ ⎜ ⎜ ⎜ ⎝
ik
≤
i ≤
(2)
c
)m 1 m −1
2
x
k
− v
i
x
k
− v
A 2
j
A
⎞ ⎟ ⎟ For ⎟ ⎠
1 ≤ i ≤ c
,
1 ≤ k ≤ n
(3)
Among others, Jm can be minimized by Picard iteration approach. This method minimizes Jm by initializing the matrix U randomly and computing the cluster prototypes (Eq.2) and the membership values (Eq.3) after each iteration. The iteration is terminated when it reaches a stable condition. This can be defined for example, when the changes in the cluster centers or the membership values at two successive iteration steps is smaller than a predefined threshold value. The FCM algorithm always converges to a local minimum. A different initial guess of uij may lead to a different local minimum. Finally, to assign each data point to a specific cluster, defuzzification is necessary, e.g., by attaching a data point to a cluster for which the value of the membership is maximal [14].
3. ESTIMATING FWA OF FEATURES In section 1 we mentioned that we propose a new clustering algorithm for a data objects with different feature-weights, which means that data with features of different FWA should be clustered. A key question that arises here is how we can determine the importance of each feature. In other words, we are about to assign a weight to each feature so that the weight of each feature determines the FWA of it. To determine the FWA of features of a data set two major approaches can be adopted: Human-based approach and Automatic approach. In human-based approach we determine the FWA of each feature based on negotiation with an expert individual who has enough experience and knowledge in the field that is the subject of clustering. On the other hand, in automatic approach we use the data set itself to determine the FWA of its features. We will discuss more about these approaches in next lines. Human-based approach: As is described above, in human-based approach by negotiating with an expert, we choose FWA of each feature. This approach has some advantages and some drawbacks. In some cases, using the data set itself to determine the FWA of each feature may fail to achieve the real FWA's, and humanbased approach should be adopted to determine the FWA of each feature. Fig.3 demonstrates a situation this case happens. Sample Data objects Feature B
5 4 3 2 1 0 0
1
2 3 Feature A
4
5
Figure 3. Data Object with Two Features
13
ISBN: 978-972-8924-88-1 © 2009 IADIS
Suppose Fig.3 shows a data objects in which FWA of feature A is two times FWA of feature B in reality. Since automatic approach uses the position of data points in the data space to determine the FWA of features, using data set itself to determine the FWA of features A and B (automatic approach) will lead to equal FWA's for A and B. Although this case (data set with homogeneously and equidistantly distributed data points) rarely happens in real world and is somehow an exaggerated one, it shows that, sometimes, human-base approach is the better choice. On the other hand, human-based approach has its own drawbacks. We cannot guarantee that the behaviors that are observed by a human expert and used to determine the FWA's include all situations that can occur due to disturbances, noise, or plant parameter variations. Also suppose situation in which there is no human expert for negotiation to determine FWA's. How does this problem should be dealt with? Structure the signal can be found using linear transforms. This approach does not take into account that the system has some structure. In the time domain, filtering is a linear transformation. The Fourier, Wavelet, and Karhunen-Loeve transforms have compression Capability and can be used to identify some structure in the signals. When we are using these transforms, we do not take into account any structure in the system. Automatic approach: Several methods based on fuzzy set theory, artificial neural network, fuzzy-rough set theory, principle component analysis and neuro-fuzzy methods and have been reported [16] for weighted feature estimation. Some of the mentioned methods just rank features, but with some modifications they will be able to calculate the FWA of the features. Here we introduce a feature weight estimation method which can be used to determine the FWA of features. This method extends the one proposed in [15]. Let the pth pattern vector (each pattern is a single data item in the data set and a pattern vector is a vector which its elements are the values that the pattern features assume in the data set) be represented as (4) x p = [ x 1p , x 2p ,......... ..., x np ] Where n is the number of features of the data set, and x
p i
is the ith element of the vector. Let probk and
d k ( x p ) stand for the priori probability for the class Ck and the distance of the pattern x mean vector, (5) respectively. m k = [m k 1 , m k 2 ,......... .., m kn ]
p
from the kth
The feature estimation index for a subset ( Ω ) containing few of these n features is defined as
E=
∑∑
x p ∈c k
(6)
k '≠ k
is constituted by the features of Ω only. (7) and ) = μ ck ( x p ) × 1 − μ ck ( x p ) 1 1 x p = × μ ck ( x p ) × 1 − μ ck ' ( x p ) + × μ ck ' ( x p ) × 1 − μ ck ( x p ) 2 2
Where sk (x
s k 'k
k
sk ( x p ) × αk ∑ sk ' k ( x p )
x
p
( (
p
( )
[
)
)]
[
(
)]
(8)
μ ck ( x p ) and μ ck ' ( x p ) are the membership values of the pattern x in classes C k and C k ' , respectively. α k is the normalizing constant for class Ck which takes care of the effect of relative sizes of the classes. p
Note that s k is zero (minimum) if μ ck = 1 or 0, and is 0.25 (maximum) if μ ck = 0 . 5 . On the other hand,
s k ' k is zero (minimum) when μ ck = μ ck ' = 1 or 0, and is 0.5 (maximum) for μ ck = 1 , μ ck ' = 0 or vice versa. Therefore, the term s k ∑ s k ' k , is minimum if μ ck = 1 and μ ck ' = 0 for all k ≠ k ' i.e., if the k≠k'
ambiguity in the belongingness of a pattern x
p
to classes C k and C k ' is minimum (pattern belongs to only
one class). It takes its maximum value when μ ck = 0 .5 for all k. In other words, the value of E decreases as the belongingness of the patterns increases to only one class (i.e., compactness of individual classes increases) and at the same time decreases for other classes (i.e., separation between classes increases). E increases when the patterns tend to lie at the boundaries between classes (i.e. μ → 0 . 5 ). The objective in feature selection problem, therefore, is to select those features for which the value of E is minimum [15]. In
14
IADIS European Conference Data Mining 2009
order to achieve this, the membership μ ck ( x p ) of a pattern x dimensional π - function which is given by
μ ck ( x )
p
x
p
The distance d k (x
p
( ) = ⎡⎢ ∑
dk x λ
k
i
⎢⎣
i
(x
p
ki
=
) of the pattern
⎛ x ip − m ⎜ ⎜ λ ki ⎝
p
= 2 max
And m
( ) ( )]
⎧ = 1 − 2 d k2 x ⎪ ⎪ ⎨ = 2 1 − d k x ⎪ = 0 ⎪⎩
[
p
∑
− m
p i
⎞ ⎟ ⎟ ⎠
ki
ki
p
if 0 ≤ d 2
2 k
if 0 . 5
p
to a class is defined, with a multi-
(x ) < 0 . 5 ≤ d (x ) < p
2 k
p
1
(9)
otherwise
2
⎤ ⎥ ⎥⎦
)
from m k (the center of class C k ) is defined as: 1/ 2
, (10)
where
(11)
x ip
p∈ C
k
C
(12) k
Let us now explain the role of α
k
. E is computed over all the samples in the feature space irrespective
of the size of the classes. Therefore, it is expected that the contribution of a class of bigger size (i.e. with larger number of samples) will be more in the computation of E. As a result, the index value will be more biased by the bigger classes; which might affect the process of feature estimation. In order to overcome this i.e., to normalize this effect of the size of the classes, a factor α k , corresponding to the class C k , is introduced. In the present investigation, we have chosen α
k
= 1 C k . However, other expressions like
α k = 1 prob k or α k = 1 − probk could also have been used. If a particular subset (F1) of features is more important than another subset (F2) in characterizing / discriminating the classes / between classes then the value of E computed over F1 will be less than that computed over F2. In that case, both individual class compactness and between class separation would be more in the feature space constituted by F1 than that of F2. In the case of individual feature ranking (that fits to our need for feature estimation), the subset F contains only one feature [15]. Now, using feature estimation index we are able to calculate the FWA of each feature. As mentioned above, the smaller the value of E of a feature, the more significant that feature is. On the other hand, with FWA we mean that the larger its value for a given feature, the more significant that feature is. So we calculate the FWA of a feature this way: suppose a 1 , a 2 ,......... a n are n features of a data set and E (ai) and FWA (ai) are feature estimation index and feature-weight assignment of feature ai, respectively so ⎛ n ⎜ ∑ E (a ⎜ j =1 FWA ( a i ) = ⎝ n
j
)⎞⎟⎟ − E (a ) i
⎠
∑ E (a ) j =1
,
1≤ i≤ n
(13)
j
With this definition, FWA (ai) is always in the interval [0.1]. So we define vector FWA which its ith element is FWA (ai). Till now we have calculated FWA of each feature of the data set. Now we should take into account these values in calculating the distance between data points, which is of great significance in clustering.
15
ISBN: 978-972-8924-88-1 © 2009 IADIS
4. MODIFIED DISTANCE MEASURE FOR THE NEW FCM ALGORITHM Two distance measures are used in FCM widely in literature: Euclidian and Mahalanobis distance measure. Suppose x and y are two pattern vectors (we have introduced pattern vector in section 3). The Euclidian distance between x and y is: (14) d 2 (x , y ) = ( x − y ) T ( x − y ) And the Mahalanobis distance between x and a center t (taking into account the variability and correlation of the data) is: (15) d 2 (x , t , C ) = ( x − t ) T C − 1 ( x − t ) In Mahalanobis distance measure C is the co-variance matrix. Using co-variance matrix in Mahalanobis distance measure takes into account the variability and correlation of the data. To take into account the weight of the features in calculation of distance between two data points we suggest the use of (x-y)m (modified (x-y)) instead of (x-y) in distance measure, whether it is Euclidian or Mahalanobis. (x-y)m is a vector that its ith element is obtained by multiplication of ith element of vector (x – y) and ith element of vector FWA. So, with this modification, equ.14 and equ.15 will be modified to this form:
d m2 ( x , y ) = ( x − y ) tm ( x − y ) m
d
2 m
(x , t , C ) =
( x − y )m ( i )
(x − t)
t m
C
−1
(16)
(x − t)m
(17)
= ( x − y )( i ) × FFWI ( i )
and respectively , where
(18).
We will use this modified distance measure in our algorithm of clustering data set with different featureweights in next section. To illustrate different aspects of the distance measures mentioned above let’s look at some graphs in Fig.4 Points in all graphs are at equal distance (with different distance measures) to the center. A circumference in graph A represents points with equal Euclidian distance to the center. In graph B, points 0 ⎞ In this case are of equal Mahalanobis distance to the center. Here the co-variance matrix is: C = ⎛⎜ 1 ⎟ ⎜0 ⎝
4 ⎟⎠
the variable Y has more variability than the variable X, then, even if the values in the y-axis appear further from the origin with respect to the Euclidean Distance, they have the same Mahalanobis distance as those in the x-axis or the rest of the ellipsoid. Graph A
Graph B
Graph C
Graph D
Graph E
Figure 4. Point with Equal Distance to the Center
⎛ 2 .5
− 1 .5 ⎞
In the third case, let’s assume that the parameters C is given by C = ⎜⎜ ⎟⎟ Now the variables ⎝ − 1 .5 2 . 5 ⎠ have a covariance different from zero. As a consequence, the ellipsoid rotates and the direction of the axis is given by the eigenvectors of C. In this case, greater values of Y are associated with smaller values of X. In other words, every time we move up, we also move to the left, so the axis given by the y-axis rotates to the left (see graph (C)). Graphs D and E demonstrate point with equal modified Euclidian and modified Mahalanobis distance to the centre, respectively. In both of them FWA vector is FWA= (0.33 0.67), and in graph E, C is equal to what it was in graph C. Comparing graphs C and E, we can conclude that in graph E in addition to variability and correlation of data, the FWA of features is considered in calculating distances.
16
IADIS European Conference Data Mining 2009
5. NEW FEATURE WEIGHTED FCM ALGORITHM In this section we propose the new clustering algorithm, which is based on FCM and extend the method that is proposed by [15] for determining FWA of features and, moreover, uses modified Mahalanobis measure of distance, which takes into account the FWA of features in addition to variability of data. As mentioned before, despite FCM, this algorithm clusters the data set based on weights of features. In the first step of this algorithm we should calculate the FWA vector using method proposed in [15]. To do so, we need some clusters over the data set to be able to calculate mk i and d k ( x p ) (having these parameters in hand, we can easily calculate the feature estimation index for each feature. see section 3). To have these clusters we apply FCM algorithm with Euclidian distance on the data set. The created clusters help us to calculate the FWA vector. This step, in fact, is a pre-computing step. In the next and final step, we apply our Feature weighted FCM algorithm on the data set, but here we use modified Mahalanobis distance in FCM algorithm. The result will be clusters which have two major difference with the clusters obtained in the first step. The first difference is that the Mahalanobis distance is used. It means that the variability and correlation of data is taken into account in calculating the clusters. The second difference, that is the main contribution of this investigation, is that features weight index has a great role in shaping the clusters.
6. CONCLUSIONS In this paper, we have presented a new clustering algorithm based on fuzzy c-mean algorithm which is salient feature is that it clusters data set based on weighted features. We used a feature estimation index to obtain FWA of each feature. The index is defined based on the aggregated measure of compactness of the individual classes and the separation between the classes in terms of class membership functions. The index value decreases with the increase in both the compactness of individual classes and the separation between the classes. To calculate the feature estimation index we passed a pre-computing step which was a fuzzy clustering using FCM with Euclidian distance. Then we transformed the values into the FWA vector which its elements are in interval [0, 1] and each element shows the relative significance of its peer feature. Then, we merged the FWA vector and distance measures and used this modified distance measure in our algorithm. The result was a clustering on the data set in which weight of each feature plays a significant role in forming the shape of clusters.
ACKNOWLEDGEMENTS This work is supported by FutureComm, the PRTLI project of Higher Education Authority (HEA), Ireland.
REFERENCES 1. Hall, L.O., Bensaid, A.M., Clarke, L.P., et al., 1992. "A comparison of neural network and fuzzy clustering techniques in segmentation magnetic resonance images of the brain". IEEE Trans. Neural Networks 3. 2. Hung M, D. ang D, 2001 "An efficient fuzzy c-means clustering algorithm". In Proc. the 2001 IEEE International Conference on Data Mining. 3. Han J., Kamber M., 2001 "Datamining: Concepts and Techniques". Morgan Kaufmann Publishers, San Francisco. 4. Bezdek, J.C., 1981. "Pattern Recognition with Fuzzy Objective Function Algorithms". Plenum, New York. 5. Dunn, J.C., 1974. "Some recent investigations of a new fuzzy partition algorithm and its application to pattern classification problems". J. Cybernetics 6. Cannon, R.L., Dave, J., Bezdek, J.C., 1986. "Efficient implementation of the fuzzy c means clustering algorithms". IEEE Trans. Pattern Anal. Machine Intell 7. Huang JZ , Ng MK , Rong H and Li Z.,2005. "Automated Variable Weighting in k-Means Type Clustering". IEEE Transactions on Pattern Analysis & Machine Intelligence, Vol. 27, No. 5.
17
ISBN: 978-972-8924-88-1 © 2009 IADIS
8. Desarbo W.S., Carroll J.D.; Clark, and Green P.E., 1984 “Synthesized Clustering: A Method for Amalgamating Clustering Bases with Differential Weighting Variables,” Psychometrika, vol. 49. 9. Fisher, R., 1936. "The use of multiple measurements in taxonomic problems". Ann. Eugenics 7. 10. Krishnapuram, R., Kim, J., 1999. "A note on the Gustafson–Kessel and adaptive fuzzy clustering algorithms". IEEE Trans. Fuzzy Syst. 7. 11. Wu, K.L., Yang, M.S., 2002. "Alternative c-means clustering algorithms". Pattern Recog. 35. 12. Zhao, S.Y., 1987. "Calculus and Clustering". China Renming University Press. 13. Gustafson, D.E., Kessel, W., 1979. "Fuzzy clustering with a fuzzy covariance matrix". In: Proceedings of IEEE Conference on Decision Control, San Diego, CA. 14. Hopner , K, R., Runkler, 1999 “Fuzzy Cluster Analysis”, John Wily & sons. 15. Pal S. K. and Pal A. (Eds.) 2002, "Pattern Recognition: From Classical to Modern Approaches". World Scientific, Singapore. 16. de Oliveira J.V., Pedrycz W., 2007, "Advances in Fuzzy Clustering and its Applications", John Wily & sons. 17. X. Wang, Y. Wang and L. Wang.,2004 "Improving fuzzy c-means clustering based on feature-weight learning", Pattern Recognition Letters 25.
18
IADIS European Conference Data Mining 2009
A NOVEL THREE STAGED CLUSTERING ALGORITHM Jamil Al-Shaqsi, Wenjia Wang School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK
ABSTRACT This paper presents a novel three staged clustering algorithm and a new similarity measure. The main objective of the first stage is to create the initial clusters, the second stage is to refine the initial clusters, and the third stage is to refine the initial BASES, if necessary. The novelty of our algorithm originates mainly from three aspects: automatically estimating k value, a new similarity measure and starting the clustering process with a promising BASE. A BASE acts similar to a centroid or a medoid in common clustering method but is determined differently in our method. The new similarity measure is defined particularly to reflect the degree of the relative change between data samples and to accommodate both numerical and categorical variables. Moreover, an additional function has been devised within this algorithm to automatically estimate the most appropriate number of clusters for a given dataset. The proposed algorithm has been tested on 3 benchmark datasets and compared with 7 other commonly used methods including TwoStep, kmeans, k-modes, GAClust, Squeezer and some ensemble based methods including k-ANMI. The experimental results indicate that our algorithm identified the appropriate number of clusters for the tested datasets and also showed its overall better clustering performance over the compared clustering algorithms. KEYWORDS Clustering, similarity measures, automatic cluster detection, centroid selection
1. INTRODUCTION Clustering is the process of splitting a given dataset into homogenous groups so that elements in one group are much similar to each other than the elements in different groups. Many clustering techniques and algorithms have been developed and used in a variety of applications. Nevertheless, each individual clustering technique has its limits in some areas and none of them can adequately handle all types of clustering problems and produce reliable and meaningful results; thus, clustering is still considered as a challenge and there is still a need for exploring new approaches for clustering. This paper presents a novel clustering algorithm based on a new similarity definition. The novelty of our algorithm comes mainly from three aspects, (1) employing a new similarity measure that we defined to measure the similarity of the relative changes between data samples, (2) being able to estimate the most probable number of the clusters for a given dataset, (3) starting the clustering process with a promising BASE. The details of these techniques will be described in Section 3 after reviewing the related work in Section 2. Section 4 gives the new definition of similarity measure. Section 5 presents the experiments and evaluation of the results. The conclusions highlighting the fundamental issues and the future research are given in the final Section.
2. RELATED WORK Jain and Fred published a paper (Jain K. and Fred L.N. 2002) that reviewed the clustering methods and algorithms developed up to 2002, in general. Here we only review the six methods and algorithms that are closely related to our work and so have been used in comparison with our newly proposed algorithm.
19
ISBN: 978-972-8924-88-1 © 2009 IADIS
k-means algorithm is the most commonly used algorithm because it is simple and computationally efficient. However, it has several weaknesses including: (1) it needs to specify the number of clusters and (2) its sensitivities to initial seeds. k-means algorithm requires the number of clusters, k, to be set in advance in order to run. However, finding the appropriate number of clusters in a given dataset is one of the difficult tasks to accomplish if there is no good prior knowledge of the data. Thus, a common strategy used in practice is to try a range of numbers such as from 2 to 10 for a given dataset. Concerning the initial seeds, k-means relies on the initial seeds as the centriods of initial clusters to produce final clusters after some iterations. If appropriate seeds are selected, good clustering results can be generated; otherwise, poor clustering results might be obtained. In standard k-means, the initial seeds are usually selected at random. Thus, it usually runs several times in order to find the better clustering results. k-modes Algorithm. It is an extension of the traditional k-means and has been developed to handle categorical datasets (Huang Z. 1997; Huang Z. 1998). Unlike the traditional k-means this algorithm uses modes instead of means for clusters. It also uses a frequency-based method to update the modes. To calculate the similarity between data samples, the algorithm employs a simple matching dissimilarity measure in which the similarity between a data sample and a particular mode is determined by the number of corresponding features that have identical values. Huang (Huang Z. 1997) evaluated the performance of k-modes on Soybean dataset and used two different methods for selecting the mode. Out of 200 experiments (100 experiments by each method), the kmodes algorithm succeeded in producing correct clustering results (0 misclassification) only in 13 cases by using the first selection method, and only in 14 cases by using the second selection method. He concluded that overall, there is a 45% chance to obtain a good clustering result when the first selection method is used, and a 64% chance if the second selection method is used. k-ANMI Algorithm. This algorithm, proposed by (He Z., et al. 2008), is viewed to be suitable for clustering data of categorical variables. It works in the same way as k-means algorithm, except it employs a mutual information based measure - the average normalized mutual information (ANMI) as the criterion to evaluate its performance in each step of the clustering process. They tested it on 3 different datasets (Votes, Cancer, and Mushroom) and compared it with 4 other algorithms. They claimed that it produced the best average clustering accuracy for all the datasets. However, it suffers from the same problem as k-means, i.e. requiring k to be set a right value in advance as the basis of finding good clustering results. TwoStep Algorithm. It was designed by SPSS (SPSS1 2001) based on the BIRCH algorithm by (Zhang, T. et al. 1996) as a clustering component in it data mining software Clementine. As its name suggested, it basically consists of two steps for clustering: pre-clustering step and merging step. The first step is to create many micro-clusters by growing a so-called modified clustering feature tree with each of its leaf-nodes representing a micro-cluster. Then the second step is to merge the resulted micro-clusters into the “desired” number of larger clusters by employing the agglomerative hierarchical clustering method. Moreover, it uses Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) to estimate the number of clusters automatically. The SPSS technical report (SPSS2 2001) shows that it was tested on 10 different datasets and succeeded in finding the correct number of clusters in all the datasets, whilst achieving clustering accuracy above 90% for all the cases, and 100% on three datasets. Moreover, it is claimed that TwoStep is scalable and able to handle both numerical and categorical variables (SPSS2 2001). It seems these good results and features make the TwoStep a very promising technique for clustering. Nevertheless, the report did not mention any comparison of it against other clustering algorithms. Squeezer Algorithm. This algorithm was introduced by (He Z., et al. 2002) to cluster categorical data. It is efficient as it scans the data only once. It reads each record and then, based on some similarity criteria, it decides whether such a record should be assigned to an existing cluster or should generate a new cluster. The Squeezer algorithm does not require the number of clusters as an input parameter and can handle the outliers efficiently and directly (He Z., et al. 2002). The Squeezer algorithm was tested on Votes and Mushroom datasets and compared with other algorithms such as ROCK (Guha S., et al. 1999). They concluded that the Squeezer and the ROCK algorithms can generate high quality clusters.
ccdByEnsemble Algorithm. ccdByEnsemble (He Z., et al. 2005), the abbreviation of Clustering Categorical Data By Cluster Ensemble, aims to obtain the best possible partition by running multiple clustering algorithms including Cluster-based Similarity Partition Algorithm (CSPA) (Kuncheva L.I.,. et al.
20
IADIS European Conference Data Mining 2009
2006), HyperGraph Partitioning Algorithm (HGPA) (Mobasher B., et al. 2000), and Meta-Clustering Algorithm (MCLA) (Strehl A. and Ghosh J. 2003), on the same dataset and then select the one that produces the highest ANMI (He Z., et al. 2005) as a final partition. The ccdByEnsemble was tested on four datasets and compared it with Squeezer and GAClust - a genetic algorithm based clustering algorithm proposed by (Cristofor D. and Simovici D. 2002). The algorithm won in two datasets (Votes and Cancer) and lost in the other two (Mushroom and Zoo). In general, it performed more or less the same as the compared methods. As can be seen that all the reviewed algorithms use the existing similarity or dissimilarity measure and have common weaknesses among them are: sensitive to initial seeds (k-means), unable to determine k value (k-means, k-modes, and k-ANMI), and not suitable for large dataset (TwoStep). Beside this no single one algorithm performed consistently well in the tested cases. Stage 1: Produce Initial Clusters
3. A NOVEL CLUSTERING ALGORITHM
Stage 2: Refine Initial Clusters
Based on the literature study, we proposed a three staged clustering algorithm. The main objective of the first staged is to build up the initial clusters, the second stage is to refine the initial clusters, and the third stage is responsible to refine the initial BASES, if necessary. Another important good feature of our algorithm is to have a mechanism to estimate the number of cluster, which is done in preprocess.
Refined Clusters Data samples Movement between clusters Yes
No
Stage 3: Refine Bases
3.1 First Stage Figure 1. Framework of the Proposed Algorithm
The first task in the first stage is to find a BASE. A BASE is a real sample that acts like a medoid or a centroid. The major steps in the first stage are: 1. Find a BASE 1) Find a mode or a centroid • Find a mode (medoid) for each categorical feature. Calculate the frequency of each category for all the categorical features and then take the most frequent category in each feature as its mode. • Calculate the average (centroid) for each numerical feature. 2) Construct a sample with the modes and centriods. 3) Calculate the similarity between the constructed sample and all the samples in the dataset by using the proposed similarity measure (described in section 4.2). 4) Select the sample that has the highest similarity value with the constructed sample as a BASE. 2. Calculate the similarity between the obtained BASE and the remaining samples. 3. Those samples that have similarity value higher than or equal to the set threshold will be assigned to the BASE’s cluster. 4. If there are any samples that have not been assigned to any clusters then a new BASE is required. 5. Repeat steps 1 to 4 until no samples left. We give a simple example here to illustrate how a BASE is found. Table 1 below shows a sample of Balloon dataset (Merz C.J. and P. 1996). As all the features are categorical, (1) calculate the frequency of all categorical features (See Table 2-A); (2) construct the mode sample, which comprise the most frequent category in each feature (see Table 2-B); (3) calculate the similarity between the mode sample and all the samples using the similarity measure; (4) select the sample that has the highest similarity value with the constructed mode sample as a BASE. As the constructed mode sample is identical to the second sample, it has a similarity value of 1, thus this sample is selected as the first BASE. At the end of the first stage, the algorithm will produce k clusters with all the data samples assigned to their clusters.
21
ISBN: 978-972-8924-88-1 © 2009 IADIS
Table 1. Balloon Dataset No.
Table 2-A. Frequency of Categories of Each Feature Feature
Features Frequency
1
Inflated T
Color Purple
Size large
Act Stretch
Age Child
Inflated
T= 8
F=2
2
T
Yellow
Small
Stretch
Adult
Color
Purple = 4
Yellow= 6
3
T
Yellow
Small
Stretch
Child
Size
Large = 3
Small = 7
4
T
Yellow
Small
DIP
Adult
Act
Stretch = 6
DIP = 4
5
T
Yellow
large
Stretch
Adult
Age
Child = 4
Adult = 6
6
T
Purple
Small
DIP
Adult
7
T
Purple
Small
Stretch
Adult
8
T
Purple
large
DIP
Adult
9
F
Yellow
Small
DIP
Child
10
F
Yellow
Small
Stretch
Child
Table 2-B. Constructed Mode Sample (Mode of Each Feature) mode
T
Yellow
Small
Stretch
Adult
3.2 Second Stage The second stage commences by selecting the BASE of the second obtained cluster and calculates its similarity with all the samples in the first cluster. This is because the second BASE has not been used to calculate the similarity with the samples in the first cluster. Therefore, if any record has a greater similarity value than its original cluster, the record has to be moved to the second cluster. This process will go through all the remaining clusters.
3.3 Third Stage The objective of the third stage is to refine the initial BASES to see whether the solution can be further improved. The main steps in this stage include: 1. Calculate the frequency of all categorical features in the first refined clusters. 2. Construct a mode/centroid sample by following the steps mentioned in stage 1. 3. Calculate the similarity between the constructed sample and the cluster’s samples. 4. Select the new BASE whish is the sample most similar to the constructed mode/centroid sample. 5. Calculate the similarity between the new BASE and the cluster’s samples. 6. Repeat steps 1 to 5 for the remaining refined clusters. 7. If the obtained BASES differ from the original ones, repeat the second stage; otherwise, the clustering process is terminated. 8. Repeat the third stage until no data sample is moved between clusters.
3.4 Automatically Estimating the Appropriate Number of Clusters Determining the appropriate number of clusters is a critical and challenging task in clustering analysis. Sometimes, for the same dataset there may be possibly different answers depending on the purpose and criterion of the study (Strehl A. 2002). We devised a mechanism as a component of our proposed algorithm to identify the appropriate number of clusters, k, automatically. This is achieved by running the proposed algorithm with a varying similarity threshold (θ) value range from 1% until the interval lengths (L) start getting very small continuously (L < 2%). An interval is the number of times the algorithm produces a constant value of k continuously. We then terminate the algorithm and study the first longest interval at which k is constant and then stop the algorithm at the θ value that produces better average intra-cluster similarity for all clusters. This approach has been integrated into our clustering method as a preprocess function and tested in the experiments. The results confirmed it works well in most of the cases because the numbers of the clusters it identified are either the same as or very close to the number of true classes.
22
IADIS European Conference Data Mining 2009
To better understand this, consider the following example of cancer dataset. In Table 3, columns 1 to 3 present the range of the similarity value threshold (θ), the number of cluster(s), and the interval length, respectively. Although the longest interval is at k = 1, this interval is ignored as nobody is interested at k = 1. Therefore, the appropriate number of clusters is 3 as it has the longest interval (see Table 3 and Figure 2). Cancer Dataset
Table 3. Intervals of Cancer Dataset 10 9
Number of Clusters k
2
12.5
1
5.6 58
55
52
49
46
43
40
37
1
0
34
0.7 0.8 1.1
3
31
6 7 8
4
28
55.8 – 56.5 56.6 – 57.4 57.5 – 58.6
5
25
5.6 2.2
6
22
4 5
7
19
47.8 – 53.4 53.5 – 55.7
8
16
32.1 12.5
13
1 3
7
1 – 33.1 33.2 – 47.7
Interval length, L, (%)
10
k
4
Threshold values (%)
Threshold Value Threshold Interval
Figure 2. Intervals of Cancer Dataset
4. SIMILARITY MEASURES
Since measuring the similarity between data samples plays an essential role in all clustering techniques and can determine their performance, after studying the common existing similarity measures and evaluating their weaknesses, we proposed a new similarity measure. Table 4. Existing Similarity Measure
4.1 Existing Similarity Measures Similarity Measures Squared Euclidean distance Euclidean distance
In practice the most similarity measures are defined based on ‘distance’ between data points and some popular measures are listed in Table 4. The common major weaknesses of these measures are: 1. Unable to handle categorical features 2. Unsuitable for unweighted features. Therefore, one feature with large values might dominate the distance measure. 3. Unable to reflect the degree of change between data samples. To address theses weaknesses we proposed a novel similarity measure described in the next section.
Equations d ( x, y ) =
N
∑ (x i =1
d (x, y) =
N
∑
i
− yi ) 2
(xi − yi)
i=1
N
Correlation Correlation (x, y) =
∑ (z i =1
z )2
xi yi
N −1 N
Cosine cos ine ( x , y ) =
2
∑ (x y ) i
i =1
2
i
( ∑ i x i )( ∑ i y i ) 2
2
Chebychev (chy) Manhattan distance
chy ( x, y ) = max i x i − y i
Minkowski (p)
p ( x , y ) = ( ∑ i x i − y i )1 / p
Power(p,r)
Power( x, y ) = (∑ i xi − yi )1 / r
Block ( x , y ) =
∑
i
xi − y i p
p
4.2 A Novel Similarity Measure The new similarity measure, represented by Equation (1), was defined particularly to reflect the degree of the relative change between samples and to cope with both numerical and categorical variables. For numerical variables, Term 1 in Equation (1) is used. For the categorical variable, the similarity between two data samples is the number of the variables that have same categorical values in two considering data samples, and is calculated by Term 2. Where Sim is an abbreviation of similarity, N is the number of features, x represents the sample; i is sample index; j is feature index; B the BASE, k the index for clusters and BASES. R and Cat represent the numerical and categorical features, respectively. In this similarity measure, the similarity value between input xij and a BASEkj is scaled to [0, 1]. Thus, no one feature can dominate the similarity measure. This definition can be extended to measure similarity between any two samples not only limited to the BASE. The more detailed analysis and test on this new definition will be presented in a separate paper later. Sim
( x i , B
k
) = 1 −
1 N
⎡ ⎢ ⎢ ⎣
⎛ ⎜ ⎜ ⎝
x
N
∑
j = 1 , for
x
ij
∈ R
max
ij
− B
{x
ij
kj
, B
kj
⎞ ⎛ ⎟ + ⎜ ⎜ ⎝
}⎟⎠
N
j = 1,
for
∑ x
1 if x ij
∈ Cat
ij
= B
kj
,0
otherwise
⎞ ⎟ ⎟ ⎠
⎤ ⎥ (1 ) ⎥ ⎦
23
ISBN: 978-972-8924-88-1 © 2009 IADIS
5. EXPERIMENTS AND EVALUATION To evaluate the accuracy of the proposed algorithm and the effectiveness of the new similarity measure we implemented them and conducted the experiments by using the same benchmark datasets that were used by the comparing methods mentioned in the earlier section for making the comparison as fair as possible. The basic strategy of our comparison is to take the most commonly used k-means as a baseline, and TwoStep as a competing target because it is generally considered as a more accurate algorithm. Before presenting the experimental results and carrying out the intended comparison, we give the criteria for measuring clustering accuracy in section 5.1 and the method of using data in section 5.2.
5.1 Measuring the Clustering Accuracy One of the most important issues in clustering is how to measure and evaluate the clustering performance, usually in terms of accuracy. In unsupervised clustering, there is no absolute criterion of measuring the accuracy of clustering results. However, in some cases where the class labels are available, the quality of a partition can be assessed by measuring how close the clustering results are to the known groupings in the dataset. Thus, the correct clusters should be those clusters that have all the samples with the same labels within their own cluster. It should be noted that, with such a strategy, the class label is not included during the clustering process but just used at the end of the clustering procedure to assess the partition quality. In practice, the accuracy r is commonly measured by r = 1 ∑ ik= 1 a i (Huang Z. 1998), where ai is the n
number of majority samples with the same label in cluster i, and n the total number of samples in the dataset. Hence, the clustering error can be obtained by e = 1- r.
5.2 Testing Datasets Table 5. Details of the Benchmark Datasets
We used a total of 13 other datasets in our testing experiments but for a fair comparison with other No. of Features No. of No. of Datasets algorithms, in this paper we just presented the results of classes Samples C N the 3 datasets because they are the only 3 datasets used by Mushroom 2 8124 22 0 all the comparing methods, which we do not have the Votes 2 435 16 0 programs to run on other datasets. Table 5 shows the demographic details of 3 benchmark datasets (obtained Cancer 2 699 0 9 from UCI Machine learning Repository (Merz C.J. and P. 1996)) that have been used in our experiment. Following a commonly used strategy in clustering experiments, the whole dataset is used in experiments.
5.3 Experimental Results He et al. (He Z., et al. 2008) used the Squeezer, GAClust, ccdByEnsemble, k-modes, and k-ANMI algorithms to cluster the Votes, Cancer and Mushroom datasets. As most of these algorithms lack the ability to identify the appropriate number of clusters, different numbers of cluster range from 2 to 9 were chosen in their experiments. Thus to make like-to-like comparisons, we used the accuracy of each algorithm at the value of k that is estimated automatically by our proposed algorithms. Table 6 lists the results and comparison rankings of Cancer, Mushroom and Votes datasets. With respect to the clustering accuracy of Cancer dataset, it has been illustrated by He et al. (He Z., et al. 2008) that the kANMI algorithm produces the best clustering results for this dataset among their 5 methods. However, in our experiments of 8 clustering algorithms, the k-ANMI was only ranked at the 4th positions, about 1.2% lower than the result (96.7%) of our algorithm. For this particular dataset, k-means produced slightly better (0.2%) than ours. TwoStep ranked the third. For Mushroom dataset, it is clear that none of the compared algorithms used in (He Z., et al. 2008) managed to get at least an accuracy of 70%. On contrast, both our proposed algorithm and TwoStep algorithm achieved an accuracy of 89%. This accuracy is nearly 20% better than ccdByEnsemble and k-means, and 30% better than Squeezer, GAClust, k-modes,and k-ANMI.
24
IADIS European Conference Data Mining 2009
For Votes dataset, k-modes and k-means algorithms yield the best and the second best accuracy, respectively. Our proposed algorithm achieved third best accuracy and TwoStep achieved the fourth best accuracy. However, our accuracy is comparable to the accuracy of k-modes, which performed the best. Table 6. Experimental Results of Cancer, Mushroom and Votes datasets Datasets
Cancer (C), k= 3
Mushroom (M), k= 2
Votes (V), k= 4
Accuracy%
Ranking
Accuracy%
Ranking
Accuracy%
Ranking
Squeezer
≈90
7
≈54
8
≈84.9
8
GAClust
≈80
8
≈61
5
≈85.1
7
ccByEnsemble
≈94
5
≈68
3
≈88
5
k-modes
≈92
6
≈57
7
≈92
1
k-ANMI
≈95.5
4
≈58
6
≈88
5
k-means
96.9
1
67.8
4
90.8
2
TwoSteps
96.4
3
89
1
87.6
4
Our Algorithm
96.7
2
89
1
89.9
3
Algorithms
5.4 Evaluation and Discussion
Table 7. Ranking of the Algorithms
As TwoStep is used as a comparison target, it can be seen that our algorithm performed better than C M V Sum of Final TwoStep, won in two cases (Cancer and Votes Algorithms Rankings Ranking datasets) and tied in one case (Mushroom dataset). Squeezer 7 8 8 23 8 Regarding k-means algorithm, although it GAClust 8 5 7 20 7 performed the best on Cancer dataset and the 5 3 5 13 4 ccByEnsemble second best in Votes dataset, it performed the 6 7 1 14 5 fourth best on Mushroom dataset and its accuracy k-modes 4 6 5 15 6 was 21% lower than the accuracy of our k-ANMI 1 4 2 7 2 algorithm. k-modes worked best only for Votes k-means 3 2 4 9 3 dataset. It scored the sixth and seventh position on TwoSteps 2 1 3 6 1 Cancer and Mushroom datasets, respectively, and Our Algorithm its overall raking is the fifth (see Table 7). So both k-means and k-modes are not consistent and it is hard to know when they do better and when they do worse. The relative performance of the proposed algorithm can be summarized in Table 7. In this table, columns C, M, and V represent the rank of each algorithm among the 8 algorithms on Cancer, Mushroom and Votes datasets, respectively. The second last column presents the sum of the rankings of an algorithm over all the testing datasets. The smaller the sum is the better the overall performance is in terms of accuracy and consistency on all the datasets. The final ranking of each algorithm is presented in the last column. As shown, among the 8 algorithms our algorithm achieved the best clustering performance. Table 8 summarizes the accuracy differences between our algorithm and the compared algorithms. As shown that when the our algorithm achieved the best clustering results its accuracy was 21% higher than the second best Table 8. Summary of Experimental Result performance and 35% better than the worst performance. When Ranking of 1 2 3 our algorithm scored the second best in term of the accuracy, it was 0.2% less than the best algorithm. In contrast, it was 16.7% our algorithm +2 Max. better than the lowest accuracy. Concerning the case where our 1% 0.2% 2.1% algorithm scored the third position, although it was 2.1% less +3 +16. +5 Min. 5% 7% % than the highest clustering accuracy, it was 5% higher than the worst algorithm. Please note we conducted more intensive experiments on other datasets including Wine, Soybean, Credit Approval, Cleve, Zoo, Half-rings and 2-spirals and compared them with other clustering algorithms. The experimental results showed that the proposed algorithm has outperformed the clustering techniques compared in most of the datasets. It worked best for Soybean, and Wine datasets as its accuracy were 100%
25
ISBN: 978-972-8924-88-1 © 2009 IADIS
and 93.3%, respectively. The reason for not including all the results in this paper is that the compared algorithms did not use the above datasets.
6. CONCLUSIONS In this paper, we proposed a novel clustering algorithm and a new similarity definition. The proposed algorithm consists of three stages. With respect to the clustering accuracy, the experimental results show that our algorithm has always been ranked highly among 8 clustering algorithms compared. That indicates our algorithm is accurate, consistent and reliable, in general. More importantly, our algorithm does not need to specify the number of clusters, k, as it is estimated automatically. In addition, it is able to handle both numerical and categorical variables. As the similarity value between features is scaled to [0, 1] all features will have the same weight in calculating the over all similarity value. On the other hand, it should be pointed out that our proposed algorithm has a relatively high computational complexity because the process of finding and refining the BASES is time consuming, which has not been particularly addressed as it was not the focus of the study at this stage, but can be improved later. Future work will involve conducting more experiments to refine the proposed algorithm, improve its complexity and investigate the clustering ensemble methods.
REFERENCES Cristofor D. and Simovici D. 2002. "Finding median partitions using information-theoretical-based genetic algorithms." Journal of Universal Computer Science, Vol. 8 No. 2, pp. 153–172. Guha S., et al. 1999. Rock: A robust clustering algorithm for categorical attributes. Proceedings 15th International Conference on Data Engineering Sydney, Australia, pp. 512-521. He Z., et al. 2002. "Squeezer: An efficient algorithm for clustering categorical data." Journal of Computer Science and Technology, Vol. 17, No. 5, pp. 61–624. He Z., et al. 2005. "A cluster ensemble method for clustering categorical data." An International Journal on Information Fusion [1566-2535], Vol. 6, No. 2, p. 143. He Z., et al. 2008. "K-ANMI: A Mutual Information Based Clustering Algorithm for Categorical Data." Information Fusion, Vol. 12 No. 2, pp. 223-233 Huang Z. 1997. A fast clustering algorithm to cluster very large categorical data sets in data mining. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 1-8. Huang Z. 1998. "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values " Data Mining and Knowledge Discovery, Vol. 2, No.3, pp. 283-304. Jain K. and Fred L.N. 2002. Data Clustering Using Evidence Accumulation. 16th International Conference on Pattern Recognition ICPR'02, Quebec City. Kuncheva L.I., et al. 2006. Experimental Comparison of Cluster Ensemble Methods. Information Fusion, 2006 9th International Conference on, Florence. Merz C.J. and M. P. 1996. "UCI Repository of Machine Learning Databases." Mobasher B., et al. 2000. Discovery of Aggregate Usage Profiles for Web Personalization Proceedings of the Workshop on Web Mining for E-Commerce. SPSS1. 2001. "TwoStep Cluster Analysis." from http://www1.uni-hamburg.de/RRZ/Software/SPSS/Algorith.120/twostep_cluster.pdf. SPSS2. 2001. "The SPSS TwoStep Cluster Component: A scalable component enabling more efficient customer segmentation." Retrieved April, 2007, from http://www.spss.com/pdfs/S115AD8-1202A.pdf. Strehl A. 2002. Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. Austin, University of Texas, p. 232. Strehl A. and Ghosh J. 2003. "Cluster ensembles - a knowledge reuse framework for combining multiple partitions." Journal of Machine Learning Research,Vol. 3, No. 3, pp. 583-617. Zhang, T., et al. 1996. BIRCH: An Efficient Data Clustering Method for Very Large Databases. International Conference on Management of Data, Montreal, Canada, ACM Association for Computing Machinery.
26
IADIS European Conference Data Mining 2009
BEHAVIOURAL FINANCE AS A MULTI-INSTANCE LEARNING PROBLEM Piotr Juszczak General Practice Research Database, MHRA, 1 Nine Elms Lane, SW8 5NQ, London, United Kingdom
ABSTRACT In various application domains, including image recognition, text recognition or the subject of this paper, behavioural finance, it is natural to represent each example as a set of vectors. However, most traditional data analysis methods are based on relations between individual vectors only. To cope with sets of vectors, instead of single vector descriptions, existing methods have to be modified. The main challenge is to derive meaningful similarities or dissimilarities measures between sets of vectors. In this paper, we derive several dissimilarities measures between sets of vectors. The derived dissimilarities are used as rudiments of data analysis methods, such as kernel-based clustering and SVM classification. The performance of the proposed methods is examined on consumer credit cards behaviour problems. These problems are shown to be an example of a multi-instance learning problems. KEYWORDS behavioural finance, multi-instance learning
1. INTRODUCTION Multiple Instance Learning (MIL) is a variation of supervised learning with labelled sets of vectors rather than individual vectors. In supervised learning, every training example is assigned with a discrete or real-valued label. In comparison, in MIL the labels are only assigned to sets (also called bags) of examples. In the binary case, a set is labelled positive if all examples in that set are positive, and the set is labelled negative if all the examples in it are negative. The goal of MIL is, for example, to classify unseen sets based on the labelled sets as the training data. For example, a gray value image can be considered as a collection of pixels, vectors of intensity values. It is natural to compute the distance between such represented images as a distance between sets of vectors. This approach has a major advantage over the approach where features are derived from images, e.g. Gabor filters. Namely, if a meaningful distance is derived, the classes may become separable as information about examples is not reduced to features. Early studies on MIL were motivated by the problem of predicting drug molecule activity level. Subsequently, many MIL methods have been proposed, such as learning axis-parallel concepts (Dietterich et al., 1997), diverse density (Maron & Lozano-P´erez, 1998), extended Citation kNN (Wang & Zucker, 2000), to mention a few. These methods have been applied to a wide spectrum of applications ranging from image concept learning and text categorisation to stock market prediction. The early MIL work (Dietterich et al., 1997) was motivated by the problem of determining whether a drug molecule will bind strongly to a target protein. As the examples, some molecules that bind well (positive examples) and some molecules that do not bind well (negative examples) are provided. A molecule may adopt a wide range of shapes or confrontations. Therefore, it is very natural to model each molecule as a set, with the shapes it can adopt as the instances in that set. (Dietterich et al., 1997) showed that MIL approaches significantly outperform normal supervised learning approaches which ignore the multi-instance nature of MIL problems. (Maron & Lozano-P´erez, 1998) partitioned natural scene pictures from the Corel Image Gallery into fixed
27
ISBN: 978-972-8924-88-1 © 2009 IADIS
sized sub-images and applied a multi-instance learning algorithm, called diverse density (DD) to classify them into semantic classes. (Yang & Lozano-Perez, 2000) used a similar approach for content-based image retrieval. (Zhang & Goldman, 2001) compared DD with an expectation-maximisation (EM) version of the DD algorithm for image retrieval. Similar to the argument made about images, a text document can consist of multiple passages concerned with different topics, and thus descriptions at the document level might be too rough. (Andrews et al., 2002) applied SVM-based MIL methods to the problem of text categorisation, where each document is represented by overlapping passages. The popular k-Nearest Neighbour (k-NN) approach can be adapted for MIL problems if the distance between sets is defined. In (Wang & Zucker, 2000), the minimum Hausdorff distance was used as the set-level distance metric, defined as the shortest distance between any two instances from each set. Using this set-level distance, we can predict the label of an unlabelled set. (Wang & Zucker, 2000) proposed a modified k-NN algorithm called Citation k-NN. In this paper, we follow a similar approach. We focus on derivation of distances between sets of vectors and then use derived distances as a basis to define kernels or use them directly with distance-based analysis methods. The proposed distances are evaluated on behavioural finance problems.
2. DISTANCES BETWEEN SETS OF VECTORS To show the challenge of distance derivation for MIL, consider the following example shown in figure 1. The first subfigure, figure 1(a), shows a toy problem: a cup on a table with the table located on a floor. We would like to define a distance between these three objects: cup (C), table (T) and floor (F). We are not necessarily restricted to an Euclidean distance, but chose an other possibility. Pseudo−Euclidean space
1.5
1
T
F
C
T
0
0
1
0
F
0
0
1
−0.5
C
0
1
0
1
0.5
0.5
0.5
0 0
1
0.5
(a)
1.5
(b)
−1 −1
negative distance
1
C
F T?
T
0
F
−0.5
−0.5
0
(c)
0.5
1
−1 −1
−0.5
C 0 positive distance
0.5
1
(d)
Figure 1. (a) Illustration Of A Toy Problem: A Cup (C), A Table (T) And A Floor (F), (b) Distances Between The Three Objects, (c) Embedding Of Distances In An Euclidean Space, (d) Embedding Of Distances In A Pseudo-Euclidean Space. As Euclidean distance is defined between points, the three objects have to be reduced to points, e.g. to their centres of mass or centres of gravity. Therefore, the distance is not defined between objects but simplified versions of them. Alternatively we may define a distance between these objects as the smallest distance between any of their parts. Thus, since the cup touches the table, the distance between them is zero. Since the table touches the floor, the distance between them is also zero. However, because the cup does not touch the floor the distance between cup and floor is different from zero. For example, it may be set to one. Figure 1(b) shows a table with such computed distances between these three objects. Note however, that we can not embed these distances into a Euclidean space; see figure 1(c). To embed these relations one needs a much richer space called Pseudo-Euclidean space (Pekalska & Duin, 2005). The space is constructed from a negative and positive part; see figure 1(d). Dashed lines indicate zero distances between objects and continuous lines a boundary of allowed distances. For a broader discussion on possible distance measures see (Pekalska & Duin, 2005). This simple example shows the importance of the definition of distance. The example also shows some pitfalls that arise when objects are not represented as points. As a concrete example, in modern mathematical physics, string theoreticians face the problem of defining distance relations between strings and not simply points as in classical physics. This gives raise to new methods and new mathematics. In data analysis w often find similar problems and would benefit from not reducing objects to points in a feature space.
28
IADIS European Conference Data Mining 2009
2.1 Proposed Distance Measures In this paper we investigate the case when examples are presented as collections S = {x1 , x2 , . . . , xn } of d dimensional vectors xi ∈ Rd or xi ∈ S ⊂ Rd . Examples are associated with sets and labels are assigned to sets. + We denote positive sets as Si+ , and the jth example in that set as Sij . Suppose each example can be + represented by a vector, and we use Sijk to denote the value of the kth feature of example. Likewise, Si− − denotes a negative set and Sij the jth example in that set. Let us describe each set Si by a descriptor that enclose all vectors from that set. For example, we can describe each set by the minimum volume sphere or ellipsoid that encloses all vectors from the set (Juszczak, 2006; Juszczak et al., 2009). The similarity between two sets can be measured by the volume of a common part of these sets. Figure 2 shows two examples of descriptors with data vectors omitted for clarity. 2.1.1 Distance Based On Volume Of Overlapping Spheres The simplest type of such a descriptor is a minimum volume sphere. As the first distance measure between two sets, S1 and S2 , we propose to measure distance based on the volume of the overlap between spheres. We describe each set by a sphere of the minimum volume, that contain all the vectors from that set (Tax & Duin, 2004). The volume of the overlap Vo is a measure of the similarity between two sets, this volume is scale by dividing by the sum of volumes of spheres: D(S1 , S2 ) = 1 −
Vo VS(A1 ,R1 ) + VS(A2 ,R2 )
(1)
where S(Ai , Ri ) denote a sphere with a centre Ai and radius Ri .
A2 R2 rc hc hc 2 1
R1 A1
Vo
(b)
(a)
Figure 2. Examples Of Similarity Measures Between Sets. Vectors Are Omitted For Clarity. (a) A Similarity Measure Based On An Overlap Between Two Spheres S(A1 , R1 ) And S(A2 , R2 ). (b) A Similarity Measure Based On An Overlap Between Two An Arbitrary Shaped Descriptors. The volume of a single overlap Vo between two spheres equals the sum of volumes of two spherical caps. The spherical cap (Harris & Stocker, 1998) is a part of a sphere defined by its height hc ∈ [0, 2R] and radius rc ∈ [0, 2R]; see figure 2(a) where the overlap, the gray region in the figure, is created from two spherical caps between spheres S(A1 , R1 ) and S(A2 , R2 ). Note that these two spherical caps have the same radius rc but different heights hc . We have derived (see Appendix for a derivation) the volume of a d-dimensional spherical cap as an integral of d − 1-dimensional spheres over the height hc of the cap: π (d−1)/2 Rd−1 Vcap (R, hc ) = Γ((d − 1)/2 + 1)
βZmax
sind−1 (β)dβ
0
(2)
where p βmax = arcsin( (2R − hc )(hc /R2 ))
29
ISBN: 978-972-8924-88-1 © 2009 IADIS
Therefore, the volume of a single overlap Vo between two spheres can be computed as the sum of two spherical caps: Vo = Vcap (R1 , hc1 ) + Vcap (R2 , hc2 ) (3) therefore the distance D(S1 , S2 ) between two sets S1 and S2 can be computed as: D(S1 , S2 ) = 1 −
Vcap (R1 , hc1 ) + Vcap (R2 , hc2 ) VS(A1 ,R1 ) + VS(A2 ,R2 )
(4)
2.1.2 Related Work There are several well-known definitions of similarity or distance between distributions, including the KullbackLeibler divergence, Fisher kernel, χ2 distance (Duda et al., 2001). As an example we will describe in more depths a similarity measure based on Bhattacharyya’s distance between distributions (Bhattacharyya, 1943), Z p p K(x, x′ ) = K(p, p′ ) = p(x) p′ (x)dx, (5) For multivariate normal distributions N (µ, Σ) Bhattacharyya’s distance can be computed in a close form (Duda et al., 2001) as: 1 1 ′ 1 ′ ′ 1 1 1 1 K(p, p′ ) = |Σ|− 4 |Σ′ |− 4 | Σ−1 + Σ −1 |− 2 exp(− µT Σ−1 µ − µ T Σ −1 µ′ 2 2 4 4 ′ ′ ′ 1 −1 −1 ′ T −1 −1 −1 −1 ′ + (Σ µ + Σ µ ) (Σ + Σ )(Σ µ + Σ µ )) 4
(6)
This similarity measure is proportional to a common part between two Gaussians that describe data.
3.
BEHAVIOURAL FINANCE AS A MULTI-INSTANCE LEARNING PROBLEM
MIL has been used in several applications including image classification, document categorisation and text classification. In this section we will show that problems in behavioural finance can also be approached using MIL techniques. Financial institutions would like to classify customers not only based on static data, e.g. age, income, address, but also based on financial activities. These financial activities can be a set of transactions, loans, or investments on the market. In this paper we investigate financial behaviour based on transactions made in personal banking. In particular, we would like to find different patterns of behaviour for legitimate and fraudulent users in personal banking. The behaviour of an account owner is a set of transactions rather than a single vector. Therefore, in this problem MIL methods can be applied. We would like to find whether legitimate or fraudulent patterns are clustered, and if so, determine if the clusters are consistent with the following statements: 1. Because fraudulent users have a single purpose in withdrawing money from an account their transactions should exhibit some patterns. 2. The constrains of daily life, the infrastructure and availability of card transaction facilities means we can expect certain patterns of legitimate behaviour. For example, each of us have particular way of live, we work at certain hours, therefore can not use bank cards at this time, we earn certain amount of money, therefore our spending are limited, we withdraw money from ATMs that are close by work, home. However, the singular objective of fraudulent use suggests we should expect characteristically different behaviour.
30
IADIS European Conference Data Mining 2009
We explore corrections of these two statements with two real world data sets and present an empirical analysis of the behavioural patters. Each activity record, a transaction on a bank account is described by several features provided by our commercial collaborators. From this set we select a subset of features which are relevant to behavioural finance. We describe activities on accounts by answering questions like: when?, where?, what?, how often? Specifically, the features we use to describe a transaction are: amount mi , the amount in pence of a transaction, amount difference mi−1 − mi , difference in amount between current and last transaction (the difference between first and second transaction is also calculated for the first transaction and for any other feature that applied), amount sum mi−1 + mi , sum of the amount of the current and previous transaction, amount product mi−1 × mi , product of the amount of the current and previous transaction, time ti transaction time, in seconds after midnight, transaction interval ti − ti−1 , time, in seconds, from the previous transaction, service ID indicator for POS (Point of sale) or ATM (Automated teller machine) transaction, Merchant type categorical indicator indicating merchant type, ATM-ID categorical identifier for each ATM. Looking at individual accounts one can observe several types of behaviour. For example, examine the three accounts shown in figure 3. The figure shows three accounts (columns) described by three features (rows). The estimated probability density function (pdf) is computed for each feature. We can see that the first account owner (first column) withdraws money in the mornings, in small amounts, and in the period of 24 hours. The second account owner makes transactions during the entire day, withdrawing a little more money but less frequently. The third account owner transacts mostly in the evenings, withdrawing the greatest amount of money with periodic intervals. Financial institutions group their customers based on a set of rules and put them to associate clusters. We would like to verify whether instead there are ”natural”, supported by real data, groups of behaviour in plastic card transaction data, and assess their number and nature.
4. EXPERIMENTS Our commercial collaborators have provided a number of plastic card transaction data sets observed over short periods since 2005. We use two of these data sets, D1 and D2 , with characteristics described in Table 1. A fraudulent account in D1 has 15 fraud transactions on average. The average for D2 is different, reflecting the slightly different nature of the contributing organisations. For reasons of commercial confidentiality we cannot divulge more information about the data. Note that both data sets are large and richly structured, and significant pre-processing was required to extract suitable data for analysis. In the following experiments, we omit accounts for which fraud occurred within the first 10 transactions, as models need some minimal information for training. We used accounts with at least 50 legitimate transactions and with at least 5 fraudulent transactions. Table 1. Characteristics Of Data Sets. D1 D2
#of accounts 44637 11383
#transactions 2374311 646729
#fraud accounts 3742 3217
#fraud transactions 58844 18501
period (months) 3 6
31
ISBN: 978-972-8924-88-1 © 2009 IADIS
account 1
account 2
account 3 −3
−3
x 10
x 10 3
0.01
4
2.5
3.5
63 trans.
0.006 0.004
3
2
pdf
pdf
pdf
0.008
1.5 1
12am
30 trans.
0.002
0 0
8 4 x 10
6
4
time of day [s]
0.02
8 4 x 10
6
4
2
time of day [s]
0.03
0.03
0.025
0.025
30 trans.
0 0
8
6
4
2
money [p]
0.015 0.01 0.005
63 trans.
0 0
10 4 x 10
8 4 x 10
6
4
2
time of day [s]
0.02
pdf
pdf
pdf
0.01
12am
0 0
0.02
0.015
0.005
1.5
0.5
12am
2
53 trans.
2
1 0.5
0 0
2.5
money [p]
0.01 0.005
53 trans.
0 0
2
1.5
1
0.5
0.015
0.5
4
x 10
2
1.5
1
money [p]
4
x 10
−3
4
x 10
−3
6
x 10
0.014
3.5
5
0.012
3
2 1.5
pdf
pdf
pdf
0.01
4
2.5
3
2
0.008 0.006 0.004
1 0.5
30 trans.
1
10
0 0
0.002
53 trans.
63 trans. 0 0
2
4
6
time [s]
8
12 5
x 10
2
4
6
time [s]
8
10
12 5
x 10
0 0
2
4
6
time [s]
8
10
12 5
x 10
Figure 3. Probability Density Functions (PDF) Estimated On Legitimate Transactions, From Three Accounts (Columns), Described By Three Features (Rows). The First Row Shows The Time Of Day, In Seconds, When Transactions Are Made. The Second Row Shows The Amount Of Money, In Pennies, That Has Been Withdrawn And The Third Row Time Periods Between Transactions, In Seconds.
4.1 Clustering Of Fraudulent And Legitimate Behaviours To verify if fraudulent or legitimate patterns are clustered we compute a distance based on equation (4) between 2000 randomly selected legitimate and 2000 fraudulent accounts from D1 and D2 . Sorted distance matrices are shown in the first row of figure 4. The matrices were sorted using the VAT algorithm (Bezdek & Hathaway, 2002). To cluster patterns we use a kernel based clustering proposed in (Girolami, 2001) with a Gaussian kernel and complexity parameters optimised by maximum likelihood. The dendrograms are shown in the bottom row of figure 4. From figure 4 we can see that fraudulent behaviours are more clustered than legitimate behaviours, however in both groups there are visible clusters, corresponding to patterns of behaviours. Since fraudsters have a single purpose in withdrawing money they tend to be more clustered than legitimate patterns. These conclusions agree with our expectations, however now it is also supported by real data. Now a new set of transactions can be assessed as being similar to a set of transaction that is already in a dataset.
5. CONCLUSIONS We have proposed a distance measure between sets of vectors based on the volume of overlap between the smallest enclosing spheres. It has been shown that problems in behavioural finance can be approached by multiinstance methods. Based on the proposed distance measure we investigate ”natural” groups in legitimate and fraudulent personal banking behaviour. It has been shown that as fraudsters has a single purpose in withdrawing money they behavioural patterns tend to be more clustered than those of legitimate users.
32
IADIS European Conference Data Mining 2009
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.54
0.6
0.52
0.5
0.55
0.5
(a) fraud D1
0.5
0.5
(b) fraud D2
(c) legitimate D1
(d) legitimate D2
Figure 4. Distance Matrices And Dendrograms For Fraudulent And Legitimate Patterns In Data Sets D1 And D2 .
6. ACKNOWLEDGEMENTS The views expressed in this paper are those of the authors and do not reflect the official policy or position of the MHRA.
Appendix: The Volume Of A Spherical Cap The spherical cap is part of a sphere as shown in figure 5. When two spheres S(A1 , R1 ) and S(A2 , R2 ) intersect, a height hc and a radius rc of the two cups can be derived simply from Pythagoras equations as: kA1 − A2 k2 − R22 + R12 kA1 − A2 k2 + R22 − R12 , hc2 = R2 − 2kA1 − A2 k 2kA1 − A2 k q q rc1 = R12 − (R1 − hc1 )2 = rc2 = R22 − (R2 − hc2 )2
hc1 = R1 −
The volume of a single cap can be computed by integrating the volumes of d−1 dimensional spheres from the radius rc to 0 over different value of a height hc . hc
2π (d−1)/2 = Γ((d − 1)/2 + 1)
Vcap
rc R − hc βmax R
Zhc
rcN −1 (hc )dhc
(7)
0
From figure 5 it can be seen that: rc2 + (R − hc )2 = R2 ,
Figure 5. Spherical Cap.
rc = R sin(βmax ).
(8) (9)
Substituting those equations gives:
Vcap
2π (d−1)/2 Rd−1 = Γ((d − 1)/2 + 1)
βZmax
sind−1 (β)dβ
(10)
0
33
ISBN: 978-972-8924-88-1 © 2009 IADIS
The integral
R
p βmax = arcsin( (2R − hc )(hc /R2 ))
sind−1 (β)dβ can be handled by recursion (Bronshtein et al., 1997, § 8). Z
7.
(11)
sind−1 (β)dβ = −
d−2 sind−2 β cos β + d−1 d
Z
sind−3 βdβ
(12)
REFERENCES
Andrews, S., Tsochantaridis, I., & Hofmann, T. (2002). Support vector machines for multiple-instance learning. Neural Information Processing Systems (pp. 561–568). Bezdek, J. C., & Hathaway, R. J. (2002). VAT: a tool for visual assessment of (cluster) tendency. Proceedings of the International Joint Conference of Neural Networks (pp. 2225–2230). IEEE Press, Piscataway, NJ. Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math Soc. Bronshtein, I. N., Semendyayev, K. A., & Hirsch, K. A. (1997). Handbook of mathematics. Springer-Verlag Telos. Dietterich, T., Lathrop, R. H., & Lozano-P´erez, T. (1997). Solving the multiple instance problem with axisparallel rectangles. Artificial Intelligence, 89, 31–71. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. John Wiley & Sons. second edition. Girolami, M. (2001). Mercer kernel based clustering in feature space. I.E.E.E. Transactions on Neural Networks. Harris, J. W., & Stocker, H. (1998). Handbook of mathematics and computational science. New York: SpringerVerlag. Juszczak, P. (2006). Learning to recognise. A study on one-class classification and active learning. Doctoral dissertation, Delft University of Technology. ISBN: 978-90-9020684-4. Juszczak, P., Tax, D. M. J., Pekalska, E., & Duin, R. (2009). Minumum volume enclosing ellipsoid data description. Journal of Machine Learning Research, under revision. Maron, O., & Lozano-P´erez, T. (1998). A framework for multiple-instance learning. Neural Information Processing Systems (pp. 570–576). The MIT Press. Pekalska, E., & Duin, R. P. W. (2005). The dissimilarity representation for pattern recognition: foundations and applications. River Edge, NJ, USA: World Scientific Publishing Co., Inc. Tax, D. M. J., & Duin, R. P. W. (2004). Support vector data description. Machine Learning, 54, 45–56. Wang, J., & Zucker, J. (2000). Solving the multiple-instance problem: A lazy learning approach. Proc. 17th International Conf. on Machine Learning (pp. 1119–1125). Morgan Kaufmann, San Francisco, CA. Yang, C., & Lozano-Perez, T. (2000). Image database retrieval with multiple-instance learning techniques. Proc. of the 16th Int. Conf. on Data Engineering (pp. 233–243). Zhang, Q., & Goldman, S. A. (2001). EM-DD: An improved multiple-instance learning technique. Neural Information Processing Systems (pp. 1073–1080). MIT Press.
34
IADIS European Conference Data Mining 2009
BATCH QUERY SELECTION IN ACTIVE LEARNING Piotr Juszczak General Practice Research Database, MHRA, 1 Nine Elms Lane, SW8 5NQ, London, United Kingdom
ABSTRACT In the active learning framework it is assumed that initially a small training set Xt and a large unlabelled set Xu are given. The goal is to select the most informative object from Xu . The most informative object is the one, that after revealing its true label by the expert, and adding it to the training set improves the knowledge about the underlying problem the most, e.g. improves the most the performance of a classifier in a classification problem. In some practical problems however, it is necessary to select at the same time more than a single unlabelled object to be labelled by the expert. In this paper, we study pitfalls and merits of such selection. We introduce active learning functions that are especially useful in the multiple query selection. The performance of the proposed algorithms are compared with standard single query selection algorithms on toy problems and the UCI repository data sets. KEYWORDS active-learning, multiple query selection
1. INTRODUCTION In the traditional approach to statistical learning, one tries to determine a functional dependency between some data inputs and their corresponding outputs, e.g. their class labels. This is usually estimated based on a given, fixed set of labelled examples. Active sampling (Lewis & Gale, 1994; Cohn et al., 1995; Roy & McCallum, 2001; Juszczak & Duin, 2004) is an alternative approach to automatic learning: given a pool of unlabelled data Xu , one tries to select a set of training examples in an active way to reach a specified classification error with a minimum cardinality. Therefore, ideally the same classification error is achieved with a significantly smaller training set. The criterion of how informative a new object is depends on what we are interested in. We may select a new object to be maximally informative about parameter values of a classifier or select objects only from some regions, e.g. around a region that we are not able to sample directly, to improve classification accuracy only locally. Finally to select objects to minimise a probability of training models with large errors (Juszczak, 2006). Definition 1. We can define an active learning function as a function F that assigns a real value to each unlabelled object F (xi ) → R, xi ∈ Xu . Based on this criterion we can rank unlabelled objects and select the most informative object, x∗ , according to F : x∗ ≡ arg max F (xi ) xi ∈Xu
(1)
In this paper, the most informative object, x∗ ∈ Xu , is defined as the one, that after revealing its label and adding to the training set improves performance of a classifier the most. In this paper, we focus on poolbased active learning (Lewis & Gale, 1994; Cohn et al., 1995; Roy & McCallum, 2001; Juszczak & Duin, 2004). In this learning scheme there is an access to a small set of labelled objects Xt , a training set, and a large pool of unlabelled objects Xu . Objects are usually selected one by one according to a specified active learning function F . Active learning is usually compared to passive learning, in which unlabelled objects are sampled randomly, i.e. according to the probability density function P (X). The performance is measured as a generalisation error obtained on an independent test set. The general framework of pool-based active learning is presented in Algorithm 1 and learning curves for the two sampling are shown in figure 1.
35
ISBN: 978-972-8924-88-1 © 2009 IADIS
Algorithm 1 A General Framework Of Pool-based Active Learning. Assume that an initial labelled training set Xt , a classifier h, a active function F and unlabelled data Xu are given. 1. Train classifier h on the current training set Xt ; h(Xt ). 2. Select an object x∗ from the unlabelled data Xu according to the active query function x∗ ≡ arg maxxi ∈Xu F (xi ). 3. Ask an expert for the label of x∗ . Enlarge the training set Xt and reduce Xu . 4. Repeat steps (2)–(4) until a stopping criterion is satisfied, e.g. the maximum size of Xt is reached.
mean error
In some practical problems however, there is a need to select multiple queries to be labelled by the expert at the same time. Such problems are usually problems where expert knowledge is required not constantly but in time intervals, e.g. financial markets, medical intensive care units or remote control of machines. Let us take as an example a medical intensive care device where condition 0.25 active sampling labels (normal/abnormal) that a doctor assigns changes in time. random sampling Therefore, the classifier needs to be update in time as a condition 0.2 of a patient improves or deteriorates changing meaning of measurements. It is much more practical, if the doctor adjusts the 0.15 cost reduction classifier by assigning labels to several preselected measurements error reduction each hour or day, than assigning constantly a label to a single se- 0.1 lected measurement; e.g. per minute or even a second. The same holds for sensors in automated factories or predictors in stock mar- 0.05 ket. Measurement labels may need adjustments as they definitions 200 150 100 50 0 number of queries change in time. Similarly to the active learning approach, where one draws a single unlabelled object, here the goal is to learn an input-output mapping X → ω from a set of n training examples Figure 1. Expected Learning Curves (j) Xt = {xi , ωi }ni=1 , where xi ∈ X, ωi ∈ [ω (1) , . . . , ω (C) ] and the For Active And Random Sampling. size n of the training set should be minimised. However now, at each iteration, the active learner is allowed to select a multiple, new training input {x1 , . . . , xk }∗ ∈ Xu of k elements from the unlabelled data Xu . Note that k can be much larger than 1. The selection of {x1 , . . . , xk }∗ may be viewed as a query based on the following active sampling function: k X F (xi |h(Xt ), Xt ) (2) {x1 , . . . , xk }∗ ≡ arg max∗ {x1 ,...,xk } ∈Xu
i=1
Having selected {x1 , . . . , xk }∗ , the learner is provided with the corresponding labels [ω1 , . . . , ωk ] by an expert, as a result of an experiment or by some other action. The new training input {x∗j , ωj }kj=1 is added to the current training set Xt = Xt ∪ {x∗j , ωj }kj=1 and removed from the unlabelled set Xu = Xu \{x1 , . . . , xk }∗ . The classifier is retrained and the learner selects another set {x1 , . . . , xk }∗ . The process is repeated until the resources are exhausted, the training set reached a maximum number of examples, or the error reaches an acceptable level. Since in the standard active sampling algorithms, the classifier is recomputed after each query, the values of an active sampling function F , based on such a classifier, change as well. Therefore, unlabelled objects, that are queried, differ in consecutive draws. However, if we consider the simultaneous selection of multiple objects, k > 1, similar objects are selected in a single draw. To illustrate this in figure 2(a) five objects with the highest values of two active learning functions, F1 and F2 are shown. Here we used Probability Density Correction (pdc) and MinMax (Juszczak, 2006). It can be observed that for each sampling method the selected objects are chosen from the same region in the feature space. These active learning methods do not consider the fact of revealing a label of a single query on the remaining queries. Therefore, adding such a batch is not significantly more beneficial than adding just its single representative. For clarity of a figure we only show these two methods, however the same problem holds for all active learning methods that select a single query.
36
IADIS European Conference Data Mining 2009
When we would like to select multiple queries in active sampling we should consider not only the criterion they are based on, e.g. the uncertainty of labels of unlabelled objects, but also the effect of obtaining a class label of a single candidate object on the remaining candidate objects. In this paper we investigate this problem.
6
6
Xt,ω=1
4
Xt,ω=2
2
0
Xu
−2
4
2
0
−2
F1
−4
F2
−6
−8
−4
−6
−8 −8
−6
−4
−2
0
2
4
6
8
(a)
−8
−6
−4
−2
0
2
4
6
8
(b)
Figure 2. (a) Five Objects Selected By F1 And F2 Active Learning Functions. The Classifier, 1-NN (1-Nearest Neighbour), Is Drawn As Solid Line. (b) Five Objects Selected By F1 And F2 Considering Also The Influence Of Reviling Labels Of Other Objects In The Batch.
2. QUADRATIC PROGRAMMING QUERY DIVERSIFICATION We formulate three active query selection algorithms suitable for the selection of informative batches of unlabelled data. The diversification criterion in these algorithms is related to the type of a classifier, e.g. distance or density based, that is going to be trained on the selected objects. Such ’personalisation’ of queries is helpful in human learning. Different people may require different examples to learn efficiently certain concepts. The same also holds for classifiers. For some classifiers selecting unlabelled objects; e.g. with a maximum uncertainty change their decision boundary locally, e.g. for the 1-NN (1-Nearest Neighbour) rule. Selecting the same object for a parametric classifier, e.g. the mixture of Gaussians classifier changes the decision boundary globally. This also holds for kernel based methods such as SVM. Therefore, a good sampling function should also include, in its estimation of potential queries, properties of the classifier itself. In particular, we propose three active sampling methods with diversification criteria based on distances, densities and inner products between labelled and unlabelled objects. The presented three active sampling methods are generic and can be used with any type of a classifier, however, because they compute the utility criteria in a certain way they are especially useful for classifiers that are based on the same principles e.g. the 1-NN rule, the Parzen classifier and SVM.
2.1 Distance-based Diversification As was mentioned in the introduction the most informative batch of unlabelled objects should contain objects that have the minimum influence on the classification labels of each other. Intuitively, this can be related P P to distances between objects in a batch i.e. by maximising the sum of distances ki kj D(xi , xj ) between objects to be selected. For small distances between objects we expect redundant class information. If we give a weight 0 ≤ αi ≤ 1 to each object xi ∈ Xu , the above sum can be written as maxα αT Dα, αT 1 = k for a sparse solution. Additionally, we are interested in the objects that have information about labels of other not yet selected unlabelled objects, e.g centres of clusters in Xu can be expected to describe remaining unlabelled data in their clusters. To impose this, we can simply demand that the distance Dnn (xi ) = kµnn (xi ) − xi k between object xi and the mean µnn (xi ) of its nearest neighbours should be minimum; see figure 3. Note that µnn (xi ) is computed on the set of nearest neighbours of xi but without xi . Since Dnn (xi ) describes only a single object this linear term can be subtracted from the previous quadratic term as maxα αT Dα − αDnn , αT 1 = k. An active learning function F (x) should have a high value for the selected objects. This is also a linear term, therefore the selection of objects with the highest values for these three criteria can be written as maxα αT Dα + αT (F − Dnn ), αT 1 = k. We compute the utility of the batch of k unlabelled objects by maximising the above formula using a quadratic programming technique. This allows us to optimise the entire batch at once compared to iterative procedures which examine a single object in the batch at a time, as described in the next section. The diversification of queries, for a particular active learning function F , based on distances as a diversification criteria is written as:
37
ISBN: 978-972-8924-88-1 © 2009 IADIS
max
µnn (xi )Dnn (xi )
αT Dα + αT ρ,
α
s.t.
αT 1 = k;
0 ≤ αi ≤ 1,
(3)
xi
ρ = F − Dnn . Figure 3. The Distance Between An Object xi And The Mean Of Its Nearest Neighbours. The Euclidean distance matrix D, is not, in general, positive definite (zT Dz 0, ∀z ∈ RN ). The positive definiteness, or the negative definiteness, of the matrix D is required by Kuhn-Tucker conditions (Rustagi, 1994) for the quadratic optimisation to converge to the global minimum or maximum. However, several techniques can be used to transform a symmetric matrix to the positive (negative) definite one. For example one can apply 1/2 clipping D = Qp Λp , where Λp are only the positive eigenvalues, or simply taking the square Hadamard ∗2 ∗2 power D (D = d2ij ) and adding a small constant to the diagonal D = diag(D∗2 ) + c (Gower, 1986). In the optimisation (3), we are looking for k unlabelled objects x for which the optimised function αT Dα+ T α ρ is maximum. Such a criterion can be used in general with any type of classifier but is especially suited for the Nearest-Neighbour classifier, since it is based on distance relations between objects.
2.2 Density-based Diversification For density-based classifiers the diversification criterion can include a vector of densities P (x) instead of distances Dnn as such classifiers are based on densities. We can consider densities of unlabelled objects or the relative difference between densities of labelled and unlabelled objects ∆P (xi ) = P (xi |Xu ) − P (xi |Xt ), where x ∈ Xu ; see figure 4. The quadratic programming optimisation looks similar to the above optimisation for the Nearest-Neighbour classifier, except now the linear term depends on the difference in density estimates.
max
αT Dα + αT ρ
α
s.t.
αT 1 = k, 0 ≤ αi ≤ 1, (4) ρ = F + ∆P.
P (x|Xu ) − P (x|Xt )
Figure 4. The Positive Difference ∆P In Density Estimates For Labelled Xt : {+, •} And Unlabelled Xu : {◦} Objects Plotted As Isolines, The Current Classifier Boundary, Parzen Is Drown As A Solid, Thick Line. Such a method selects a batch of unlabelled objects with large distances D between selected objects and with the large density value in places where we have no samples yet. Finally the value of the active learning function F should be also significantly large. In figure 4 we can easily point to five unlabelled objects with high value of ∆P indicated by the centres of the concentric isolines. Such objects are remote from each other and are centres of clusters. These make them a potentially informative batch to ask an expert for labels. Since this diversification method is based on a density estimation it is particularly suitable for density based classifiers e.g. the Parzen, Quadratic (QDA) or Linear Discriminant Analysis (LDA).
38
IADIS European Conference Data Mining 2009
2.3 Boundary-based Diversification The last type of a classifier we are considering is the Support Vector Machine (SVM). For SVM it is convenient to express the mutual label relations of possible labels of unlabelled objects in terms of inner products or similarly the angle between vectors. The angle between two vectors xi and xj can be expressed as follows: ∠(xi , xj ) = arccos
xTi xj K(xi , xj ) = arccos p kxi kkxj k K(xi , xi )K(xj , xj )
(5)
where xTi xj denotes the inner product and K is the between Gram matrix. Similarly to the optimisation (3) for 1-NN we would like to select objects for which the sum of their angles is maximum and additionally they are centres of clusters in the Hilbert space H. For the Gaussian kernel the denominator in equation (5) becomes 1. Since the Gramm matrix K is already positive definite it is easier to minimise the sum of inner products between objects in the batch, instead of maximising the sum of square angles between selected objects: sv3 x3
sv4
min
x2
w0
αT Kα − αT ρ αT 1 = k,
0 ≤ αi ≤ 1,
x1
sv1
α
s.t.
sv2
(6) sv5
ρ = F + Knn . Figure 5. Equal Division Of The Approximated Version Space By Three Unlabelled Objects {x1 , x2 , x3 }. sv Indicates Five Support Vectors And Gray Circle Margin Of SVM. where Knn (xi ) = kµnn (K(xi , :)) − K(xi , :)k is the difference between vector K(xi , :) and the mean of its neighbours in H. Figure 5 presents a general idea of such a sampling. Let us assume that the problem is linearly separable in the feature space. This means that a version space (Mitchell, 1997) of a particular problem is non-empty. In the case of SVM, we can approximate the version space by the support objects and select objects that for two possible labels divide equally such an approximated version space (Tong & Koller, 2000). Regardless of the true class labels we always reject that half of the classifiers that is inconsistent with the labels of the training data. The tacit assumption is that classifiers are uniformly distributed, i.e. each classifier from the version space is equally probable. When we consider the selection of a batch of unlabelled objects an informative batch should contain objects that divide the version space equally. The selection of such objects implies that for all their possible labels the size of the version space is maximally minimised; see figure 5. Unlabelled objects {x1 , x2 , x3 } divide equally the version space restricted by support vectors {sv1 , . . . , sv5 }.
3. RELATED WORK In this section we shortly explain the difference between our query diversification algorithms and existing methods. In particular, we relate our work to (Brinker, 2003; Park, 2004; Lindenbaum et al., 2004). These papers present query diversification methods based on various criteria, e.g. similar to the proposed methods, distance between queries (Lindenbaum et al., 2004) or angles between queries in a batch (Brinker, 2003; Park, 2004). In particular (Lindenbaum et al., 2004) proposed for the k-NN rule to construct a batch of unlabelled data using an iterative procedure. At each step a single object is added to a batch that has a large value of an active learning function and a large distance to already selected objects in a batch. Next, such a constructed batch is presented as a query to an expert. (Brinker, 2003; Park, 2004) discusses similar iterative algorithms for the SVM. However, instead of selecting objects with the maximum sum of distances, they proposed to select objects based on the inner product relations. The algorithms select unlabelled objects that after including them
39
ISBN: 978-972-8924-88-1 © 2009 IADIS
to the training set yield the most orthogonal hyperplanes. Algorithm 2 Standard Diversification AlgoA simplified scheme of the existing algorithms is shown in rithm. Algorithm 2. First, an algorithm selects a single unlabelled B = [ ] ∗ object with the maximum value of a particular active learn- x = arg maxx∈Xu F (x) ing function F . Then the next objects are added to a batch repeat B for which either the sum of F and distances to the objects 1. B = B ∪ {x∗ }; Xu = Xu \{x∗ } already present in the batch D(x, B) is large (Lindenbaum 2. x∗ = arg maxx∈Xu [F (x) + D(x, B)] et al., 2004) or the inner products are small (Brinker, 2003; Park, 2004). The process is repeated until the required cardi- until |B| = k nality of B is reached. Because existing algorithms consider a single candidate to be added to a batch and not an entire batch, they do not necessary select the most informative set of unlabelled objects. The sum of distances and an active learning function do not necessarily reach their maxima for the selected batch. Moreover, the methods presented in these papers maximise distances, or minimise inner products, only between the selected objects; they do not take into account the distribution of unlabelled data. Such methods are sensitive to the presence of outliers, by selecting objects that are far from each other, and not, like in the proposed method, centres of local neighbourhood.
4. EXPERIMENTS As an illustration, we first test the proposed query diversification algorithms on the artificial Rubik’s cube data set (Juszczak, 2006). It is a 3 × 3 × 3 mode, 3D, two-class data set with an equal number of objects per class. This data set, although not realistic to occur in practice, shows clearly the point of the query diversification for active learning methods when multiple queries are to be selected. In our experiments the initial labelled training set Xt contains two randomly drawn objects per class. The learning proceeds with the queries of k = {1, 8, 16, 32, 64} elements. First, in each iteration, a single object is added to the current training set, a query of size k = 1, then the learning process is repeated for the query of size k = 8, 16 and so on. The learning curve determined for the query based on a single object, k = 1, is used as the baseline. The goal is to achieve the performance which is at least as good as obtained for the single object selection algorithm. Objects are selected according to the uncertainty sampling (Lewis & Gale, 1994) 1 . The learning curves for the 1-NN rule, the Parzen and the ν-SVM with a radial-basis kernel are shown in figure 6. The error is measured on an independent test set. The results are averaged over 100 random splits of all data into an initial training set, unlabelled data and a test set. The smoothing parameter of the Parzen classifier is optimised according to the maximum likelihood criterion (Duin, 1976) and the ν for ν-SVM is set to the 1-NN leave-oneout error on the training set. In the cases when this p error is zero, the ν is set to ν = 0.01. σ in the radial-basis kernel is chosen as the averaged distance to the ⌊ |Xt |⌋-nearest neighbour in Xt . Gray learning curves represent sampling without query diversification and black learning curves present sampling with query diversification. By observing gray curves it can be seen that by increasing the query size, the number of objects necessary to reach the minimum error classifier increases. This phenomenon is understandable since data are highly clustered and selecting queries based on the active sampling criterion, e.g. the uncertainty sampling, leads to the selection of similar objects from a single mode 2 . The black learning curves in figure 6 show the results of the same experiments with the proposed query diversification algorithms, for three types of classifiers for the same batch size. It can be seen that by diversifying queries using the proposed algorithms, the error drops, in this particular learning problem on average about 5% and the difference in the number of queries that is necessary to reach the certain classification error is in average 50 − 100 in all figures. Next, we tested the proposed query diversification algorithm on data sets from the UCI repository (Hettich et al., 1998). Initial training sets consist of two objects per class. Because the experiments with all three 1 The uncertainty sampling was chosen as an example, however the experimental results are similar for other selective sampling methods (Juszczak, 2006). 2 For a two-class problem with an equal number of objects per class, the average error in the beginning of the learning process is larger than 0.5. This is caused by the symmetric mode structure of the data set itself. Since every mode is surrounded by modes belonging to the other class additional labelled set causes, in the beginning, misclassification of objects from adjacent modes. By increasing the number of clusters, this phenomenon lasts longer.
40
IADIS European Conference Data Mining 2009
0.5
0.3
0.2
0.4
0.1
0.3
0.2
100
300 200 number of queries
0 0
500
400
(a) 1-NN k = {1, 8, 32}
0 0
500
400
300 200 number of queries
100
0.2
0.1
300 200 number of queries
0.2
0 0
(d) 1-NN k = {1, 16, 64}
1 16 16 qd 64 64 qd
0.4
0.3
500
400
500
400
(c) SVM k = {1, 8, 32}
0.1
100
300 200 number of queries
100
0.5 1 16 16 qd 64 64 qd
0.4 mean error
0.3
0 0
0.2
0.1
0.5 1 16 16 qd 64 64 qd
0.4
0.3
(b) Parzen k = {1, 8, 32}
0.5
1 8 8 qd 32 32 qd
0.4
0.1
0 0
mean error
0.5 1 8 8 qd 32 32 qd
mean error
mean error
0.4
mean error
1 8 8 qd 32 32 qd
mean error
0.5
0.3
0.2
0.1
100
300 200 number of queries
0 0
500
400
(e) Parzen k = {1, 16, 64}
300 200 number of queries
100
500
400
(f) SVM k = {1, 16, 64}
Figure 6. Learning Curves For The 1-NN, Parzen and ν-SVM For The Rubik’s Cube Data Set. Batches Of The Sizes k = {1, 8, 32} And k = {1, 16, 64} Are Selected According To The Uncertainty Criterion. The Black And Gray Curves Present The Error On An Independent Test Set As Functions Of The Training Size With And Without The Query Diversification, Respectively. The Results Were Averaged Over 100 Trials. classifiers and all diversification methods give similar outcomes, we present the results with the ν-SVM and query diversification based on the inner products. The settings of ν and σ are the same as in the experiments with the Rubik’s cube data set. The resulting learning curves for the uncertainty sampling with the query sizes of k = {1, 8, 16, 32, 64} are presented in Fig 7. The results are averaged over 100 random splits of data into initial training sets, unlabelled sets and test sets.
0.17
0.43
0.3
1 8 8 qd 32 32 qd
0.25 mean error
mean error
0.18
1 8 8 qd 32 32 qd
0.2 0.15
1 8 8 qd 32 32 qd
0.4 mean error
0.2 0.19
0.35
0.16
0.1 0.3
0.15
25
50
125 100 75 number of queries
(a) waveform k = {1, 8, 32}
(b) ionosphere k = {1, 8, 32} 0.3
0.17
100
200 number of queries
383
300
(c) diabetesk = {1, 8, 32} 0.43
1 16 16 qd 64 64 qd
0.25 mean error
0.18
1 16 16 qd 64 64 qd
0.28 0
175
150
0.2 0.19
mean error
100 200 300 400 500 600 700 800 900 1000 number of queries
0.2 0.15
1 16 16 qd 64 64 qd
0.4 mean error
0
0.05 0
0.35
0.16
0.1 0.3
0.15 0
100 200 300 400 500 600 700 800 900 1000 number of queries
(d) waveform k = {1, 16, 64}
0.05 0
25
50
125 100 75 number of queries
150
175
(e) ionospherek = {1, 16, 64}
0.28 0
100
200 number of queries
300
383
(f) diabetes k = {1, 16, 64}
Figure 7. Learning Curves For The UCI Repository Data Sets With The Query Sizes of k = {1, 8, 16, 32, 64} For The Uncertainty Sampling Approach With (Black) And Without (Gray) Query Diversification Algorithm. The Results Are Averaged Over 100 Trails. From our experiments, it can be seen that the proposed query diversification algorithm decreases the classification error. The improvement depend on the batch size. When the size of the batch increases, e.g. when k = {16, 32, 64}, the performance of the classifier decreases for all data sets. However, when query diversification is applied the performance increases significantly sometimes even outperforming the single query selection algorithm (diabetes). When we decrease the batch size to k = 8, the classification error is almost comparable with single query selection algorithm.
41
ISBN: 978-972-8924-88-1 © 2009 IADIS
5. CONCLUSIONS We have studied the problem of selecting multiple queries in a single draw based on a specified active learning function. In such a selection, a learner might yield a systematic error by selecting neighbouring objects that contain similar class information. Because of that, the learner should consider not only a particular active learning function but also investigate the influence of retrieving a label of an unlabelled object on other classification labels of potential candidates to a batch. We have formulated the problem of query diversification by using a convex quadratic programming optimisation technique. Different types of classifiers need different queries to reach the same classification error for a given size of a training set. Because of that, the presented algorithm uses properties of the individual classifier type to derive the objective criterion to select batches of queries. Moreover, comparing to the existing iterative procedures we take into account the distribution of labelled and unlabelled data which prevent from selecting outliers.
6. ACKNOWLEDGEMENTS The views expressed in this paper are those of the authors and do not reflect the official policy or position of the MHRA.
7.
REFERENCES
Brinker, K. (2003). Incorporating diversity in active learning with support vector machine. Proceedings of the 20th International Conference on Machine Learning (pp. 59–66). Menlo Park, California: AAAI Press. Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1995). Active learning with statistical models. Advances in Neural Information Processing Systems (pp. 705–712). MIT Press. Duin, R. (1976). On the choice of the smoothing parameters for Parzen estimators of probability density functions. IEEE Transactions on Computers, C-25, 1175–1179. Gower, J. C. (1986). Metric and euclidean properties of dissimilarity coefficients. J. of Classification, 3, 5–48. Hettich, S., Blake, C. L., & Merz, C. J. (1998). http://www.ics.uci.edu/˜mlearn/MLRepository.html.
UCI repository of machine learning databases.
Juszczak, P. (2006). Learning to recognise. A study on one-class classification and active learning. Doctoral dissertation, Delft University of Technology. ISBN: 978-90-9020684-4. Juszczak, P., & Duin, R. P. W. (2004). Selective sampling based on the variation in label assignments. Proceedings of 17th International Conference on Pattern Recognition (pp. 375–378). Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. Proceedings of 17th International Conference on Research and Development in Information Retrieval (pp. 3–12). Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective sampling for nearest neighbor classifiers. Machine Learning, 54. Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill. Park, J. M. (2004). Convergence and application of online active sampling using orthogonal pillar vectors. IEEE Trans. Pattern Anal. Mach. Intell., 26, 1197–1207. Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. Proceedings of the International Conference on Machine Learning (pp. 441–448). Rustagi, J. S. (1994). Optimization techniques in statistics. Academic Press. Tong, S., & Koller, D. (2000). Support vector machine active learning with applications to text classification. Proceedings of the 17th International Conference on Machine Learning (pp. 999–1006).
42
IADIS European Conference Data Mining 2009
CONTINUOUS-TIME HIDDEN MARKOV MODELS FOR THE COPY NUMBER ANALYSIS OF GENOTYPING ARRAYS Matthew Kowgier and Rafal Kustra Dalla Lana School of Public Health, University of Toronto 155 College Street, Toronto, ON, Canada
ABSTRACT We present a novel Hidden Markov Model for detecting copy number variations (CNV) from genotyping arrays. Our model is a novel application of HMM to inferring CNVs from genotyping arrays: it assumes a continuous time framework and is informed by prior findings from previously analysed real data. This framework is also more realistic than discrete-time models which are currently used since the underlying genomic sequence is few hundred times denser than the array data. We show how to estimate the model parameters using a training data of normal samples whose CNV regions have been confirmed, and present results from applying the model to a set of HapMap samples containing aberrant SNPs. KEYWORDS Hidden Markov Models; EM algorithm; copy number variation; HapMap; genotyping arrays.
1. INTRODUCTION In this paper we propose a novel application of a continuous-time Hidden Markov Model (CHMM) for interrogating genetic copy number (CN) information from genome-wide Single Nucleotide Polymorphism (SNP) arrays. Copy number changes (either deletion or amplifications of a region of DNA which, respectively, results in having less or more than the usual 2 versions of DNA sequence) are an important class of genetic mutations which are proving themselves extremely useful in understanding genetic underpinnings of many diseases and other phenotypic information. While SNP arrays were originally developed for genome-wide genotyping, the technology has also proven to be capable of producing copy number calls. The copy number analysis of SNP arrays consists of the following sequence of steps: (1) the preprocessing of low-level data, (2) single locus copy number estimation at each SNP location, and (3) chromosome-wide modelling to infer regions of copy number changes. This paper focuses on the improvement of methodology for the third step. Discrete-time Hidden Markov models (DHMMs) are commonly used in the genome-wide detection of copy number variations (CNVs). Examples of software that use the DHMM framework are dChip [5] and VanillaICE [9]. These methods account for the spatial correlation that exists along the genome by modeling genomic locations into the transition probability matrix. One problem with the discrete time framework is that the CN state changes are very likely to occur between the locations interrogated by the SNP arrays. This is a consequence of the fact that even the most dense SNP arrays interrogate only a small fraction (less than 0.1%) of the genome. Another problem is that SNPs on the current arrays are not uniformly spread over the genome, leading to the need to specify ad-hoc transition probability models that take into account the genetic distances. As further pointed out by [11], a problem with these models is that the resulting transition probabilities are usually either very close to 0 (for transitions between states) or 1 (for remaining in the same state). This leads to challenging parameter estimation since we are operating on the boundary of the parameter space. In this paper we propose and investigate a more realistic framework for modelling SNParray data based on a continuous-time HMM in which the copy number process is modelled as being continuous along the genome.
43
ISBN: 978-972-8924-88-1 © 2009 IADIS
2. BODY OF PAPER 2.1 Overview of Our Procedure for Copy Number Determination The SNP arrays produce a number of intensity values for each interrogated SNP. The description of the underlying technology and meaning of these values is beyond the scope of this paper but please consult [1] and the references therein. For the purpose of CN determination, a summary of total intensity, regardless of the underlying genotype present at the site, is useful. Once such a continuous summary is obtained, it is assumed to be generated from a conditional Gaussian model, whose parameters depend on the underlying and hidden, CN state. These Gaussians are usually called emission distributions. Since regions with altered CN states are assumed to be of genetic length that usually encompasses more than one SNP site, a Hidden Markov Model is used to estimate Gaussian model parameters and hence the underlying CN states across each chromosome. To provide priors for the underlying Bayesian model we use some previously analyzed datasets described below.
2.2 Copy-number Estimation Using Robust Multichip Analysis (CRMA) We use a popular procedure called CRMA [1] to obtain single-locus continuous copy number (CN) surrogates that summarize the total raw intensity data from each SNP loci. We refer to these estimates as raw CNs. Specifically, the raw CNs are the logarithm of a relative CN estimate, where the reference signal at locus represents the mean diploid signal. The reference signal is estimated as the median signal across a large control set which is expected to be diploid for most samples. For this purpose, we utilize the 270 HapMap [2] samples.
2.3 Titration Data The X chromosome titration data set (3X, 4X, and 5X) contains three artificially constructed DNA samples containing abnormal amplification of the whole X chromosome (aneuploidies). There are four replicates of each DNA sample. The anueoploidies are a X trisomy (presence of three copies of chromosome X); a X chromosome tetrasomy (presence of four copies of chromosome X); and a X chromosome pentasomy (presence of five copies of chromosome X). These data were downloaded from the Affymetrix data resource center. The Coriell Cell Repository numbers for these three cell lines are NA04626 (3X), NA01416 (4X), and NA06061 (5X). We use this data to estimate hyperparameters of the emission distribution.
2.4 Human Population Data [12] reports genomic coordinates for 578 copy number variable regions (CNVRs) from a large North American population consisting of 1,190 normal subjects. We use these data to estimate the parameters of the transition intensity matrix. For these data, the median length of gains is 66,921 bps, the median length of losses is 57,589 bps, the median of the proportion of SNPs on the Nsp chip in regions of gain is 0.00047, and median of the proportion of SNPs on the Nsp chip in regions of loss is 0.00057.
2.5 Emission Distribution for the Raw CNs Let
denote the observed raw CN for individual and SNP . We assume we have a study that involves subjects and SNPs on a given chromosome. The distribution of the raw CN estimates depends on the value of the copy number process. We assume that, independently for all and , (1)
44
IADIS European Conference Data Mining 2009
2.6 The Copy Number Process The copy number process records the number of DNA copies at specific locations along the genome. We let denote the unobserved copy number process of one sample which we wish to is the length of the chromosome in bp. We model as a stochastic process, specifically a infer, where homogeneous continuous-time Markov process. We allow the process to take three possible values: 1 (haploid), 2 (diploid) or 3 (triploid). This could easily be extended to include more states. Unlike a discrete-time HMM whose state transitions are defined in terms of transition probabilities, the copy number process is defined by its instantaneous transition intensities: • , the rate of deletions; • , the rate of diploidy (normal state of two copies); • , the rate of amplification. whose rows sum to zero. These intensities form a matrix
specifically, the rates are defined as (2) A different interpretation of the model is to consider the distribution of waiting times in between jumps or, equivalently, the distribution of interval lengths along the genome in between jumps, and the probabilities of these jumps. Given the occurrence of a deletion, the chain will remain there for a random stretch of the . Therefore, the expected length of a deletion genome following a exponential distribution with rate . Similarly, the expected length of a diploid region is ; and the expected length is . Next, the chain may return to the diploid state with probability of a amplification is , or change to a amplification with probability . Under this model, the equilibrium distribution is
(3) Since the observations are at a discrete number of loci, , corresponding to the SNPs on the array, we start from the transition probability matrix, , corresponding to our observed data. The transition probability matrix between loci at distance for this model is derived from the continuous-time ; see [8] for details. Under the Markov chain by taking the matrix exponential of , i.e. as described above, the corresponding distance transition probabilities are model with
where
, and
.
2.7 Hierarchical Prior Specifications We place prior distributions on the unknown parameters of the emission distribution which depend on the underlying copy number as follows.
45
ISBN: 978-972-8924-88-1 © 2009 IADIS
(4)
and
(5) where
are the degrees of freedom for the
and
is the variance of a typical
specification for the variance, ; see [3] for a definition. This locus. This is the scaled specification facilitates borrowing of strength of information across loci in estimating the locus-specific parameters. We also place priors on the parameters of the transition intensity matrix: (6) The hyperparameters, , are estimated from biological data with known structure. For example, Figure 1 shows the sample SNP-specific medians for data that are expected to have one copy of the X chromosome.
Figure 1. Histogram of SNP-specific Medians of Raw CNs for the 5605 SNPs on the Nonpseudoautosomal Portion of the X chromosome. At each SNP, Sample Medians are Based on 142 Male HapMap Samples
2.8 Parameter Estimation The hyperparameters are estimated using biological data with known structure. For example, we can use male Hapmap samples on the nonpseudoautosomal portion of the X chromosome to estimate the . More specifically, to estimate hyperparameters for method which is explained in detail in [10].
and
We find the marginal posterior mode of the remaining parameters,
, we use an efficient empirical Bayes . (7)
46
IADIS European Conference Data Mining 2009
When we have access to multiple samples we can use the EM aglorithm to find the posterior mode. Using the priors in Section 2.7, this leads to the following updates. given the other parameters by combining the 1. For each locus, , and copy number, , update with the normal mixture distribution for the samples on locus : normal population distribution for
(8) 2.
For each locus, , and copy number, , update population distribution for
given the other parameters by combining the
with the normal mixture distribuion for the
samples on locus :
(9) In the applications to the data we used the method of moments procedure to estimate and . Specific details are given in the data analyses section. One could use Markov chain Monte Carlo (MCMC) to estimate these parameters in a more optimal way. This is a focus of our current research. Given the parameter estimates, we then use the Viterbi algorithm to calculate the most probable sequence of CN states, (10)
2.9 Data Analyses We analyzed data from a set of HapMap samples containing aberrant SNPs that have been experimentally verified by quantitative real-time PCR (qPCR) in a separate study [6]. For estimation of the parameters of the transition intensity matrix, we used the method of moments based on the human population data. was set , was set to , and the expected length of an amplification was set to to . We used empirical Bayes methods to estimate the parameters of the bps, resulting in an of emission distribution based on the titration data which have known biological structure. For fitting the discrete-time HMM, we use the VanillaICE package with the default settings (see [9], for details), except that we set the emission distribution to be the same as the continuous-time model. The results are presented in Table 1. Among the models the CHMM performed the best with 13 out of 14 SNPs called correctly. The discrete-time HMM was next with 11 out 14 SNPs called correctly. Table 1. Predictions by Various HMMs on a Set of Aberrant SNPs that have been Experimentally Verified by qPCR. DHMM is the Discrete-time HMM. CHMM is the Continuous-Time HMM SNP SNP_A-1941019 SNP_A-4220257 SNP_A-2114552 SNP_A-1842651 SNP_A-4209889 SNP_A-2102849 SNP_A-2122068 SNP_A-1932704 SNP_A-1889457 SNP_A-4204549 SNP_A-2125892 SNP_A-2217320 SNP_A-2126506 SNP_A-1851359
Chr Sample 13 NA10851 8 NA10851 22 NA10863 17 NA10863 3 NA12801 8 NA10863 8 NA10863 7 NA10863 8 NA10863 8 NA10863 22 NA12707 22 NA12707 17 NA12707 17 NA12707
qPCR 0.86 1.40 2.74 4.27 1.24 0.88 0.85 0.00 1.05 0.82 0.00 1.40 4.51 2.53
DHMM 1.00 2.00 2.00 3.00 2.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 3.00 3.00
dChip 1.00 2.00 2.00 2.00 2.00 2.00 1.00 2.00 1.00 1.00 2.00 2.00 2.00 2.00
CHMM 1.00 2.00 3.00 3.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 3.00 3.00
47
ISBN: 978-972-8924-88-1 © 2009 IADIS
2.10 Results from a Simulation Study CNV breakpoints were simulated from a heterogeneous HMM with a loss/gain frequency of 0.01, except in one region of CN polymorphism of length 1 Mb which had an elevated loss/gain frequency of 0.2. These breakpoints were simulated over a 140 Mb stretch, the length of chromosome 8, independently for 100 samples, and then were mapped onto the genomic locations corresponding to the observed SNP markers for the Affymetrix 500K Nsp chip. For each sample, this resulted in underlying copy number calls for 14,839 SNPs. With these simulated copy number calls, observed data were then simulated from the following hierarchical model. and SNP ; 1. For each copy number class (a)
sample
(b)
sample
For This
and was
done
, sample for
. ,
,
, and . These values were chosen to mimic estimates from the titration data which have known biological structure. For estimation of each simulated data set, we used the method of moments to estimate the parameters of was set to , the transition intensity matrix based on the training data described in Section 2.4. was set to , and the expected length of an amplification was set to bps, resulting in of . For the parameters of the emission distribution, we fixed the class means to an across all SNPs, and we estimated variances from the data. For comparison, we also employed a different approach using the EM algorithm in which SNP-specific parameters, shown in equations Error! Reference source not found. and Error! Reference source not found., are updated until convergence. Note that this approach uses information across the 100 samples to estimate the parameters. We compared the results of these continuous-time HMMs to two other methods: GLAD [4] and CBS [7]. For GLAD, the default settings were used. GLAD provides output labels which correspond to loss/gain/diploid status for each SNP. For CBS, we post-processed the results by merging classes with predicted means within 0.25 of one another. Furthermore, the class with mean closest to zero was assigned the diploid class (normal class of two copies). The remaining classes were assigned to either gain or loss depending on whether their predicted class mean was larger or smaller than the diploid class. The results are presented in Table 2. The discrete-time HMM performed the best in terms of detecting aberrant loci. The reason for this is that the transition probability matrix of the discrete-time HMM in VanillaICE is actually quite similar to that derived from the continuous-time HMM, the difference being that , whereas in DHMM is fixed. The value used by the in CHMM discrete-time HMM appears to be more optimal for this simulated data set. Table 2. Prediction Results for the Simlution Study. The Second Column is the Misclassification Error Rate, the Third Column is the True Positive Rate of Detection, and the Fourth Column is the True Negative Rate. These Error Rates are based on Averages Across the 100 Samples Method CHMM CHMM.EM GLAD CBS CHMM
48
Misclassification rate 0.42% 0.17% 1.84% 0.18% 0.24%
TPR 53.30% 82.04% 64.12% 81.43% 88.41%
TNR 99.99% 99.98% 98.46% 99.98% 99.90%
IADIS European Conference Data Mining 2009
3. CONCLUSION In this paper we develop and apply a continuous-time Hidden Markov Model for the analysis of genotyping data, to infer regions of altered copy number. We use a number of previously published results to help specify priors for the Bayesian models underlying the HMM. The copy number analysis and databases are a novel development in the area of genomics, hence it is important for models to be flexible enough to enable novel discoveries. In particular the data analysis in this paper underlines the importance of developing a reliable estimation procedure for the parameters of the transition intensity matrix, as the results produced by the Viterbi algorithm are quite sensitive to the specification of these parameters. We are currently working on more sophisticated estimation procedures which would avoid the need for a training data set as is needed for the method of moments estimation. The continuous-time HMM framework we use is a more natural setting, compared to discrete-time HMMs, to develop new prior and parameter specification models. Our results indicate that our CHMM is already competitive with the specialized DHMM implementation for such data (a VanillaICE package) while allowing for a more consistent modeling framework. Future work also includes extending the model to the analysis of multiple samples, with the ultimate goal of detecting copy number polymorphisms.
REFERENCES [1] Henrik Bengtsson et al. Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics, 24(6):759–767, 2008. [2] The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449:851–861, 2007. [3] Andrew Gelman et al. Bayesian Data Analysis. Chapman and Hall, second edition, 2003. [4] Philippe Hupe et al. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20(18):3413–3422, 2004. [5] Ming Lin et al. dchipSNP: signficance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics, 20(8):1233–1240, 2004. [6] Laura E MacConaill et al. Toward accurate high-throughput SNP genotyping in the presence of inherited copy number variation. BMC Genomics, 8(211), 2007. [7] Adam B. Olshen et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5(4):557–572, 2004. [8] Sheldon M. Ross. Introduction to probability models. Academic Press, seventh edition, 2000. [9] Robert Scharpf. VanillaICE: Hidden markov models for the assessment of chromosomal alterations using highthroughput SNP arrays. R vignette, 2008. [10] Gordon K. Smyth. Linear model and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, 2004. [11] Susann Stjernqvist et al. Continuous-index hidden Markov modelling of array CGH copy number data. Bioinformatics, 23(8):1006–1014, 20076. [12] George Zogopoulos et al. Germ-line DNA copy number variation frequencies in a large North American population. Human Genetics, 122(3-4):345–353, 2007.
49
ISBN: 978-972-8924-88-1 © 2009 IADIS
OUT-OF-CORE DATA HANDLING WITH PERIODIC PARTIAL RESULT MERGING Sándor Juhász, Renáta Iváncsy Department of Automation and Applied Informatics Budapest University of Technology and Economics Goldmann Gy. ter 3., Budapest, Hungary
ABSTRACT Efficient handling of large amount of data is hindered by the fact that the data and the data structures used during the data processing do not fit into the main memory. A widely used solution for this problem is to use the partitioning approach, where the data set to be processed is split into smaller parts that can be processed in themselves in the main memory. Summarizing the results created from the smaller parts is done in a subsequent step. In this paper we give a brief overview of the different aspects of the partitioning approach, and seek for the most desirable approach to aggregate web log data. Based on these results we suggest and analyze a method that splits the original data set into blocks with equal sizes, and processes these blocks subsequently. After a processing step the main memory will contain the local result based on the currently processed block, that is merged afterwards with the global result of the blocks processed so far. By complexity analysis and experimental results we show that this approach is both fault tolerant and efficient when used in record-based data processing, if the results are significantly smaller than the original data, and a linear algorithm is available for merging the partial results. Also a method is suggested to adjust the block sizes dynamically in order to achieve best performance. KEYWORDS Out-of-core data processing, partitioning, efficient data handling with checkpoints
1. INTRODUCTION Nowadays there are even more applications where the computer-based logging of the different events is a straightforward demand, such as web log files, telephone call records, public service company records and so on. In this way thousands and millions megabytes of raw data are generated daily providing an important raw material for business analysis, workflow optimization and decision support. However, handling such huge amount of data is much more demanding than merely recording them into a log file. While in such data handling tasks (aggregating, creating statistics etc.) the calculations are rather simple, the most important problem to face is the memory limitation, as during the execution the computational costs are usually significantly lower compared to the costs of the memory and I/O operations. The first step of the most real-world data handling processes is preparing the data to fit the need of the following complex data processing algorithms. Such preprocessing task can be the filtering, data transformation, compression, or collecting derived data, such as generating statistics along different dimensions, joining related records from a log file, aggregating certain features of them. As these operations work on the complete large dataset their complexity should be kept as near as possible to linear in order to achieve a good efficiency. A widely used solution to increase the efficiency of handling of out-of-core data is using a so-called partitioning approach (Savasere et al., 1995), where the data set that has to be processed is split into smaller blocks that themselves can be processed in the main memory. The aggregation of the results created during the processing of the smaller blocks is carried out in a subsequent step. In this paper we give a brief overview of the different aspects of the partitioning method, and investigate the most desirable solution for our problem, namely aggregating different fields of web log data. Based on these results we suggest and analyze a method, called Periodic Partial Result Merging, that splits the original data set into smaller data blocks with
50
IADIS European Conference Data Mining 2009
equal sizes, and processes these blocks subsequently in a way, that after processing each one, the main memory will contain the local result belonging to that block, and this local result is merged afterwards with the global result on the disk created from the blocks processed so far. We show that this approach provides fault tolerance by supporting a check pointing technique, produces a valid global result any time form the data processed so far, and at the same time it is efficient for record-based data processing in the cases, where the results are significantly smaller than the original data, and a linear algorithm is available for merging the partial results. The most important benefits of our block-based algorithm include the following: • The original dataset is read only once from the disk. This is achieved by the linear partitioning, that allows processing each record right as they arrive, without leaving any part of the task to a subsequent processing turn. It means that generating a new complete result is based only on the block being currently processed and on the results of the blocks processed previously. • After each block a valid global sub-result is created. That means that the processing is complete regarding the data processed so far. This would not be the case, if the partitions were selected randomly or if the merging phase were a distinct final step of the whole process. • The algorithm is free from memory bottlenecks, the full control over the memory consumption allows processing data files of any length. • Similarly to nearly all partitioning algorithms our method can be well parallelized. Block-wise creation of a final result makes the complexity of the merging phase become quadratic, because of the nature of the algorithm. Of course in order to process long files, the quadratic complexity of merging should be handled somehow. As, the size of the results is usually small compared to the size of the original data records, the coefficient of the quadratic part is small, which can be reduced further by choosing the block size well, thus we will show that the execution time of the merging task can be approximated by a linear complexity. We will give a simple method allowing setting the optimal block size automatically without any user intervention. During our experiments we will compare our method to the “brute force” approach with eager memory use which is more efficient for short inputs, but suffers from serious limitations when processing large amount of data. We will also draw the attention to the fact that using our method makes the final complexity of the complete processing become independent from the complexity of the original algorithm creating the partial results when the original algorithm fulfills some constraints. Based on these facts we can draw the conclusion that under the later specified circumstances the time need of our algorithm is substantially linearly proportional to the number of the input records. The organization of the paper is as follows. Section 2 gives an overview of the related work regarding the ouf-of-core data handling. Section 3 introduces our novel algorithm called Periodic Partial Result Merging. Section 4 contains complexity analysis, while Section 5 introduces the experiments done on real data files. We summarize our work in Section 6.
2. RELATED WORK Efficient handling of large datasets that do not fit into the main memory is a challenging task because the I/O costs are significantly higher than that of the main memory accesses. For this reason different approaches are used to solve this problem. In this section we give a brief overview how large data is handled in the literature in a way that the data that has to be processed actually fit into the main memory. These main approaches are based on sampling, compression and partitioning. If the result to be generated does not necessary have to be complete, the sampling method (Toivonen, 1996) (Zaki et al, 1997) can be used for creating an approximate result in the main memory. In this case, a well chosen, representative part of the whole dataset is read into the memory (using some heuristics to obtain a good sample), and the processing task is carried out on this small part of the data only. In some cases the results are verified with a subsequent read of the complete database (Lin. and Dunham,1998). The disadvantage of the method is that it does not create a complete result, thus it may not find all the necessary results or in case of aggregation like tasks the method obtains an approximate result only. Furthermore, the heuristics for the sampling phase is not trivial for all cases. The advantage of sampling is that it has to read the database only twice in worst case.
51
ISBN: 978-972-8924-88-1 © 2009 IADIS
Another method for handling large datasets is to compress the original data, so that the compressed form fits into the memory. The subsequent steps generate the result based on the compressed form of the data (Han et al., 1999) (Grahne and Zhu. 2004). Of course not all dataset can be compressed to the necessary extent, thus this approach has to be combined with the other approaches mentioned in this section. A typical form of this solution is to use a compact data structure, such as a tree, for storing the data. A widely used approach of out-of-core data handling is the partitioning method (Savasere et al., 1995). Here the input data is split into smaller blocks that fit into the memory, and the processing algorithm is executed on these parts of data successively. The main difference between the sampling and the partitioning is that in case of partitioning all the input records are used, that is, the union of the used blocks completely covers the original input dataset, so that all records are used once and only once. The partitions may contain subsequent or arbitrary grouped records. The processing task is executed on the distinct partitions, the results are written to the disk, and the global result, that is, the result based on the whole input dataset, is created in a subsequent step by a new disk read by merging the locally found results. The way the local results can be used to generate a single global result is based on the way the local results are created. In some cases the global result is only the union of the local ones (Grahne and Zhu. 2004) (Nguyen et al., 2005) (Nguyen et al., 2006). In other cases a simple merging task has to be accomplished (Savasere et al., 1995) (Lin and Dunham, 1998) ,or another more complex algorithm has to be executed for generating the global result (Tang et al, 2005). For example for generating the maximal closed itemsets in itemset mining, for generating the global maximal closed itemsets all frequent itemsets have to be known, thus a further database read is needed for obtaining the final results (Lucchese et al, 2006).. It is a particular task in case of partitioning algorithms to determine the size and the content of each partition. The trivial way is to split the input data into successively following parts of the same sizes (Savasere et al., 1995) (Lin and Dunham, 1998) (Lucchese et al, 2006). However, in some cases it is worth to do some extra work for reduce the complexity of the processing task. For example block creation based on clustering the records of the input file reduces the dependencies between the items of different blocks, thus it can accelerate the processing and the merging of the partitions significantly, or it can help to reduce the volume of the false result candidates (Nguyen et al., 2005) (Nguyen et al., 2006). After introducing the state of the art of out-of-core data handling, in the next section we will suggest a novel partitioning method for this same purpose.
3. PERIODIC PARTIAL RESULT MERGING The here introduced Periodic Partial Result Merging method uses a partition-based approach for processing data that does not fit into the main memory. The main purpose of developing this algorithm was to handle time-ordered log files continuously and efficiently in a web log mining project (Iváncsy and Juhász, 2007) . The algorithm builds on the previously mentioned idea of splitting the input data set into subsequent blocks of equal size. The specialties of the approach are the following: • In the basic version each block processing phase is followed by a merge phase. The reason for that is to periodically provide a global result for all the records that have been already processed. Furthermore this approach fits well the immediate processing of the newly generated records, and support easy implementation of pipeline based parallelism. • The merging phase is facilitated using a hash table for storing the local and in a certain sense the global results as well. Hash tables are not only useful in creating the results (finding the correct aggregate structure for each record), but they also support fast merging of the partial results. The solution presented in this paper expects a file containing a sequence of records as input. Each record has the same structure and length; the order and sizes of the fields in the record are identical. Although the input file is a series of logical related records, it is not necessary that these records are contained by a single file. While the algorithm executes the processing step on the blocks individually as usual (Savasere et al., 1995) , instead of storing each partial result on the disk and merging them after each block in a final step, our method merges the locally generated results with the global result created based on the blocks processed so far. Periodic Partial Result Merging method splits the original input set of N records into S number of blocks containing the same amount of records (M records each, S = N/M). The size of the blocks has to be chosen
52
IADIS European Conference Data Mining 2009
arbitrary according the needs of the processing provided they fit into the main memory together with all the structures that are used during the processing step. These blocks are processed then one by one. During a block is processed an ordered local result is generated that is stored in the main memory. The size of this storage structure is estimated when calculating the size of the blocks. At the end of each iteration the memory-based local result is merged with the ordered global result created so far, which is stored on the disk. The merged new result is also written to the hard disk. The iterations are repeated continuously until all blocks are finished. An important advantage of this approach is that local results are never written to the disk, as merging is done from the memory, thus the disk only stores the different iterations of the global result. The detailed steps of the Periodic Partial Result Merging algorithm and an illustration for the merging phase are depicted in Figure 1, while the notations used in the code are shown in Table 1. The example for merging is presented by using an aggregation (summing) operation. Table 1. Notations and their Meaning Used in the Pseudo Code in Figure 1 Notation
Meaning
Bi RLi RGi RG RLi(j) RGi(j) |RGi|
The block i. The local result set generated from the block i. The global sub result set generated after processing the block i. The global result set The record j of the block i. The record j of the global result set after processing the block i. The size of the global result set after iteration i.
Figure 1. Pseudo Code of the Algorithm and Illustration of the Merging Step It is important to note that ordering is a key feature to allow easy and linear merging of the partial results. Normally this is done by a separate sorting operation with a complexity of m*log m preferring lower block sizes (m) as included in Figure 1 as well. Using hash tables provides an implicit ordering of records enforced by the storage structure itself, thus the cost of sorting is eliminated in our approach which place absolutely no restriction on the block sizes. A further advantage of the method is that after each block we have a complete global result for all the records processed so far. Thus in case of a system crash only the processing step of the last block has to be repeated that was not finished before the system was down. This is supported by the feature of merging the global result with a disk to disk operation preserving the previous global result as well.
53
ISBN: 978-972-8924-88-1 © 2009 IADIS
4. COMPLEXITY ANALYSIS Our analysis of complexity is based on summarizing the cost of the two periodically repeated steps, namely the cost of generating the local result and that of the merging phase. Note, that we assume the existence of a method to store the local and the global results in an ordered manner and also an algorithm that allows the linear, pair by pair merging these two types of results (as shown above).Considering the huge amount of data the complexity analysis is accomplished separately regarding the number of disk accesses and the processing costs. The disk complexity of the algorithm is calculated as follows. The disk usage cost kdisk is measured in records and has three components for each block: reading the input data, reading the global result produced in the previous step, and writing next iteration of global result. Assuming that processing x records of data generates αx amount of result (usually α«1), the cost of processing block j can be calculated using the following equation: kdisk( j ) = kdata read( j ) + kresult read( j −1) + kresult write( j )kdisk( j ) = m + m( j − 1)α + mjα = m(1 + α( 2 j − 1 )) where m is the number of records found in one block. The I/O cost of the algorithm for the whole process (where n/m=s is kept in mind): s
s
s
α
j =1
j =1
j =1
m
k disk = ∑ k disk ( j ) = ∑ m (1 + α ( 2 j − 1) ) == ms + m α ∑ (2 j − 1) = ms + m α s 2 = n + n 2
From this equation we can draw the conclusion that the disk demand can be considered nearly linear when α is small enough (that is the processing algorithm creates an aggregated (compressed) result of the input data) and/or when the block size m is great enough. The I/O independent part of the processing costs can be calculated by considering the processing complexity of the algorithm executed on each block, denoted by f(m), and considering the number of comparison steps when creating the global results. The number of the comparison steps after each block equals to the sum of the number of the local result records (mα) and the number of the global result records (mα(j-1)). The processing cost of the block j can be calculated as follows: k proc ( j ) = f ( m) + ( mα + m( j − 1)α ) β = f ( m) + jmαβ
where β is a proportional factor to handle the two types of operation (processing step and the merging step) in the same manner. The total cost of the whole process can be calculated as follows: k proc =
s
∑k j =1
proc ( j )
=
s
∑ ( f (m ) + j =1
jm αβ ) = sf ( m ) +
s (1 + s ) f (m ) ⎛ n n 2 ⎞ m αβ ⎛ f ( m ) αβ ⎞ 2 αβ m αβ = n = n⎜ + + ⎜⎜ + 2 ⎟⎟ ⎟+n 2 m m m 2 ⎠ 2m ⎝ m ⎠ 2 ⎝
From the above formula we can see, that the complexity becomes quadratic independently of the original processing complexity f(m), and even the importance of the quadratic member is low when choosing a great block size, provided that the type of the processing step significantly reduces the volume of the results compared to the volume of the input records (α is small enough). An important advantage of the method, as mentioned earlier, is that the partial processing is solved in a way, that at the end of every block it can be suspended, and restarted later again. This approach makes the algorithm fault tolerant, because the errors arising during the long processing time, and the unplanned halts do not require the system to process the whole data again from the beginning. Only the last, uncompleted block has to be processed again from its beginning. Considering these facts we can draw the conclusion, that the method can be used efficiently when the disk demand of the global results is significantly smaller than that of the records from the result is created. In general in case of processing large datasets the prerequisite is fulfilled, because the main task is to derive well arranged, informative and compact data from the disordered, unmanageably large input. Block-wise creation of a final result makes the complexity of the merging phase become quadratic due to the nature of the algorithm. It is important to note that this seems to be avoidable by applying a single merging step after producing all the local results as suggested in the literature. Unfortunately this approach has several drawbacks. It is not only the fact, that we loose the possibility of continuously having a global result but the method would not guarantee a better performance at all. First, it requires to write local results after each block to the disk, and then read them again at the end, which comes with extra (although linear) I/O costs. Secondly, and more important, that still the merging would not be linear as it is very unlikely to be
54
IADIS European Conference Data Mining 2009
completely feasible in the main memory, thus again a tree-like merging of logarithmic complexity or one by one merging of quadratic complexity (same as in our case) would be needed. This causes extra overhead and makes the algorithm more complicated to implement as well. Of course to process long files the quadratic complexity of merging should be handled somehow. As the above analysis showed, larger block sizes will help merging, but might be disadvantageous with above linear basic block processing complexity f(m). It is also hard to find out in advance what is the maximum block size, that fits into the main memory together with all the auxiliary structures and partial results based on the block itself. This difficulty can be overcome by using blocks significantly (10-50 times) smaller than the main memory would allow and modifying the basic algorithm to handle more blocks in the memory. The modified version keeps processing the blocks one by one, but monitors the amount of remaining memory after each processing step. If there is enough available memory (that is estimated from the needs of the previous block) the next block is processed by continuing to use the same aggregation structures as previously. This way the partial result belonging to several blocks is produced together in the memory. This process continues till the available memory seems to be insufficient (or a preset memory limit is reached), this time we apply the usual disk based global merging, clear the memory and start processing of the next block. This way without any further effort we can achieve the automatic control of the best memory usage, as the necessary number of smaller blocks is united during the processing to have the optimal granularity. Here the complete global result is still periodically created, but instead of producing it after each block, it is only available after processing a larger group (10-50 pieces) of blocks.
5. MEASUREMENT RESULTS To validate the above effort a series of measurements were carried out on two different hardware configurations. Configuration A represents a memory limited environment (CPU: AMD Athlon at 2800 MHz, memory: 512 MB, operating system: Windows XP Professional), while Configuration B is a stronger computer with a large memory (CPU: Intel Pentium 4 at 3200 MHz, memory: 4 GB, operating system: Microsoft Windows Server 2003). All the algorithms were implemented in C++. The main objective of our test were to highlight the necessity of enhanced out of core processing, and compare the characteristics of the Periodic Partial Result Merging approach to the eager, “brute force” approach. The input data used during the test came from one of our web log mining project (Iváncsy and Juhász, 2007) aiming to create meaning web usage user profiles. The information related to each specific user is extracted and aggregated based on certain identifiers found in the web log records. The input file is a compressed extract of log files gathered form a few hundreds of web portals, and contains information about 3-4 millions of users browsing these web servers. The users in our scope produce 200 millions of entry (24 byte of length that is 4.5 GB of data) each day. Because of technical reasons, the daily entries are grouped into files containing 4 millions of records, and an additional shorter one, that is closed at the end of the day. Our measurements will cover the processing results of 30 days (140 GB), and in the figures we will use the number of processed files (4 millions of records) as unit on the horizontal axes. The input-output size ratio (previously referred as α) was about 2-3% during the processing. The above example demonstrates that processing as low as one month of web log data might be a significant challenge, which requires special handling. To demonstrate the memory limited behavior and to create a performance baseline for the subsequent measurements we implemented a “brute force” algorithm, which continuously read the input files and builds the result structures in the main memory. This fact results in an eager use of the main memory creating an ever growing memory footprint for storing the results. The first experiment (Figure 2) presents the behavior of this eager method in a memory limited environment (configuration A). It is visible, that the behavior of the algorithm changes radically when it runs out of memory and the operation system starts swapping. This happened after processing 39 files, which corresponds to the log file amount produced during 0.8 day. After this point the processing time of a single file (4 millions of records) grow from the average 2-3 minutes abruptly by 100 times. (Attention to the logarithmic scaling of the chart in the middle of Figure 2 depicting the execution time of the individual files.)
55
ISBN: 978-972-8924-88-1 © 2009 IADIS
Total execution time [min]
M e m ory us age [M Byte s ]
File s pe cific e xe cution tim e [m in]
450
1000
3000
400 350
2500
300
100
2000
250 200
1500
150
10
1000
100 50
500
0 1
1
0
3 5 7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Num be r of proce s s e d log file s [pcs ]
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Num be r of proce s s e d log file s [pcs ]
Num be r of proces s e d log file s [pcs ]
Figure 2. Execution Time and Memory Demand of the Eager Processing in Limited Memory Environment
The eager method was also tested at configuration B (see later as part of Figure 6). This case the operation managed to finish 874 files, which corresponds the log file amount produced during 17.5 days. The reason of stopping here was the addressing limitation of the 32 bit operating system, which allowed no more than 2 GB of RAM to be allocated to a single process. The above results show that the eager method is rather efficient but after a certain point it fails to fulfill its goal (although causes are slightly different in case of configuration A and B). This outcome enforces the widely known fact that voluminous data processing really requires a structural change in the processing system. Although our block-based algorithm makes more operations than the eager approach, the basic goal is to have full control over the memory consumption, and by this to allow a continuous operation on arbitrary sized data, meantime the processing efficiency should be kept as high as possible. As in all partitioning based approaches, one key issue to the efficiency when using Periodic Partial Result Merging algorithm is the appropriate choice of the size of the blocks. The measures presented in Figure 3 were carried out using configuration B and vary the number of files processed between two global result merging phases between 1 and 16. We can draw the conclusion that the memory need increases as the block size grows, but in a smaller degree than the growth of the block sizes as the content of the files overlap at certain extent. Total execution time [min]
Blocksize = 1 file
Blocksize = 2 files
Blocksize = 8 files
Blocksize = 16 files
Blocksize = 4 files
Me m ory us age [MB]
2000
400
1750
350
1500
300
1250
250
1000
200
750
150
500
100
250
50
Blocksize = 1 file
Blocksize = 2 files
Blocksize = 8 files
Blocksize = 16 files
Blocksize = 4 files
0
0 1
16
31
46
61
76
91
106
121
136 151 166 181 Number of processed log files [pcs]
1
16
31
46
61
76
91
106
121
136
151
166
181
Num ber of proce ssed log files [pcs]
Figure 3. Execution Time and Memory Demand of the Block-Based Algorithm with Various Checkpoint Periods It is also apparent that the longer the period between the merging phases is the more linear total execution time gets thank to the reduction of the amount of merging of quadratic complexity. Periodic Partial Result Merging algorithm used up to this point works according to the pseudo code presented in Figure 1. The total execution time of the algorithm is divided into 3 phases, local processing, sorting the local results, and then merging the sorted local results with the global result. Figure 4 highlights the execution time distribution between the above 3 phases in the case of using 1, 4 and 16 files as block sizes to produce the local result.
56
IADIS European Conference Data Mining 2009
File specific execution tim e [m in]
File specific execution tim e [m in]
Blocksize = 1 file Processing
Sorting
Merging
File specific execution tim e [m in]
Blocksize = 4 files
Merging
Sorting
Processing
16
16
16
12
12
12
8
8
8
4
4
4
1
17
33
49
65
81
97
113
129 145 161 177 193 209 225 Num ber of processed log files [pcs]
Sorting
Merging
0
0
0
Blocksize = 16 files Processing
1
17
33
49
65
81
1
97 113 129 145 161 177 193 209 225
17
33
49
65
81
97 113 129 145 161 177 193 209 225 Num ber of proces sed log files [pcs]
Num ber of proce ssed log files [pcs]
Figure 4. Distribution of the File Specific Execution Times Between the Processing Phases with Various Checkpoint Periods
Figure 5 and 6 compare the execution time and the memory demand of the eager and the periodic approaches for configuration A and B respectively. As the processing algorithms is inherently slow (processing the 30 days of data takes several days), the partial results are saved time to time to the disk in case of the eager method as well (checkpoint after each 16 files). We present here three versions of the Periodic Partial Result Merging algorithm. The first one is the basic method with a block size of 16 files. The second one is the automatic approach with adaptive block sizes where the memory limit was set to 512 MB. The third version contains a further optimization, with uses the ordering provided by the internal hash table instead of complete sorting explicitly (sorting is still needed, but only inside the slots of the hash table). It is clearly visible that automatic choice of the block sizes provides a comfortable way of gaining the best performance of the algorithm while preserving the full control over the memory consumption. The elimination of the sorting cost in this cases resulted in a further 5-6% performance gain. Total execution tim e [m in]
Eager method Adaptive (300 MB)
Block based (16 files) Adaptive (300 MB) without sorting
Mem ory [MB]
Eager method Adaptive (300 MB)
Block based (16 files) Adaptive (300 MB) without sortingl
450
800
400
700
350
600 300
500 250
400
200
300
150
200
100
100
50 0
0 1
8
15
22
29
36
43
50
57
1
64
8
15
22
29
36
43
50
57
64
Number of processed log files [pcs]
Num ber of processed log files [pcs]
Figure 5. Execution Time and Memory Demand of the Different Algorithms in Memory Limited Environment Total execution time [min]
Adaptive [512 MB] without sorting
Eager method
3500
Memory [MB] 2000
Adaptive [512 MB] without sorting
Eager method
1800
3000
1600
2500
1400
2000
1200 1000
1500
800
1000
600 400
500
200 0
0 71
186
290
435
550
644
735 874 Number of processed log files [pcs]
71
186
290
435
550
644
735 874 Number of processed log files [pcs]
Figure 6. Execution Time and Memory Demand of the Eager and the Best Block-Based Algorithm Using Configuration B Figure 6 shows that where both methods work as they are supposed to Periodic Partial Result Merging is visibly (even by 50%) slower compared to the eager method. We pay this price for the increased amount of I/O operations (global results is periodically completely read and written back to the disk). Although in Figure 6 the quadratic complexity is still not apparent after 3300 minutes (55 hours) of execution, Figure 3 reminds, the optimization does not suppress this component totally. It is important to note that the memory limit of the adaptive algorithm was set to 512 MB, which is far less than the 2 GB the eager algorithm was allowed to use. Choosing the memory limit too high would cause the algorithm to run for long periods (here 55 hours) without writing out any result, which would considerably reduce the efficiency of check-pointing.
57
ISBN: 978-972-8924-88-1 © 2009 IADIS
6. CONCLUSIONS AND FUTURE WORKS When processing huge amount of data, it is important to control the memory handling of the algorithms by different methods. We have to seek for processing the highest amount of data possible in the memory, without the intervention of the virtual memory management of the operating system. In this paper we described and analyzed a partitioning-based approach called Periodic Partial Result Merging, and showed that under certain circumstances it shows a nearly linear behavior, while it provides a valid global result continuously during the processing and by its nature allows an easy check-pointing. One of the most important questions in case of all partitioning methods is how to estimate the size of the blocks. In this paper we suggested a method how to adjust the block sizes dynamically by (logically) creating a number of smaller blocks as basic units and processing them in groups respecting the preset memory limit. This is allowed by continuously monitoring the memory allocation during the processing phase and when the available memory does not seem sufficient, the processing of the next block is postponed after the merging of the current partial result to the complete global result. Another optimization suggested in this paper was to use hash tables to organize the partial results where possible as their provide a constant average access time to their elements, and also offer an implicit internal ordering of the partial and global results allowing the omit the explicit sorting phase. Although it was not shown in the paper the algorithm is well suited for parallel processing either in data parallel, or in pipelined manner, but in this case it is harder to take advantage of the automatic choice of block sizes, and it is harder to organize the continuous presence of a valid global result.
ACKNOWLEDGEMENTS This work was completed in the frame of Mobile Innovation Centre’s integrated project Nr. 3.2. supported by the National Office for Research and Technology (Mobile 01/2004 contract).
REFERENCES Benczúr A. A., Csalogány K., Lukács A. Rácz B. Sidló Cs., Uher M. and Végh L., Architecture for mining massive web logs with experiments, In Proc. of the HUBUSKA Open Workshop on Generic Issues of Knowledge Technologies Grahne, G. Zhu J. 2004, Mining frequent itemsets from secondary memory, ICDM '04. Fourth IEEE International Conference on Data Mining, pp. 91-98. Han, J., Pei, J., and Yin, Y. 1999 Mining frequent patterns without candidate generation. In Chen, W., Naughton, J., and Bernstein, P. A., editors, Proc. of ACM SIGMOD International Conference on Management of Data, pages 1-12. Iváncsy, R. and Juhász, S. 2007, Analysis of Web User Identification Methods, Proc. of IV. International Conference on Computer, Electrical, and System Science, and Engineering, CESSE 2007, Venice, Italy, pp. 70-76. Lin J. and Dunham M. H 1998., Mining association rules: Anti-skew algorithms, In 14th Intl. Conf. on Data Engineering, pp. 486-493. Lucchese C, Orlando, S. and Perego, R., 2006, Mining frequent closed itemsets out of core. In SDM ’06: Proceedings of the third SIAM International Conference on Data Mining, April 2006. Nguyen Nhu, S., Orlowska, M. E. 2005, Improvements in the data partitioning approach for frequent itemsets mining, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-05), pp. 625-633. Nguyen S. N. and Orlowska M. E. 2006, A further study in the data partitioning approach for frequent itemsets mining, ADC '06 Proceedings of the 17th Australasian Database Conference, pp. 31-37. Salvatore C. L., Perego O. R. 2006, Mining frequent closed itemsets out-of-core, 6th SIAM International Conference on Data Mining, pp. 419-429. Savasere A., Omiecinski E. and Navathe S. 1995, An efficient algorithm for mining association rules in large databases, VLDB '95: Proceedings of the 21th International Conference on Very Large Data Bases, pp. 432-444 Tang, P., Ning, L., and Wu, N. 2005. Domain and data partitioning for parallel mining of frequent closed itemsets. In Proceedings of the 43rd Annual Southeast Regional Conference - Volume 1 (Kennesaw, Georgia, March 18 - 20, 2005). ACM-SE 43. ACM, New York, NY, 250-255. DOI= http://doi.acm.org/10.1145/1167350.1167423 Toivonen H., 1996, Sampling Large Databases for Association Rules, Morgan Kauffman, pp. 134-145. Zaki, MJ, Parthasarathy, S., Li, W., and Ogihara, M. 1997, Evaluation of Sampling for Data Mining of Association Rules, 7th International Workshop on Research Issues in Data Engineering (RIDE--in conjunction with ICDE), pp 42-50, Birmingham, UK, April 7-8.
58
IADIS European Conference Data Mining 2009
A FUZZY WEB ANALYTICS MODEL FOR WEB MINING Darius Zumstein, Michael Kaufmann Information Systems Research Group, University of Fribourg Boulevard de Pérolles 90, 1700 Fribourg (Switzerland)
ABSTRACT Analysis of web data and metrics has become a crucial task of electronic business to monitor and optimize websites, their usage and online marketing. First, this paper shows an overview of the use of web analytics, different web metrics measured by web analytics software like Google Analytics and other Key Performance Indicators (KPIs) of e-business. Second, an architecture of a fuzzy web analytics model for web usage mining is proposed to measure, analyze and improve website traffic and success. In a fuzzy classification, values of web data and metrics can be classified into several classes at the same time, and it allows gradual ranking within classes. Therefore, the fuzzy logic approach enables a more precise classification and segmentation of web metrics and the use of linguistic variables or terms, represented by membership functions. Third, a fuzzy data warehouse as a promising web usage mining tool allows fuzzy dicing, slicing and (dis)aggregation, and the definition of new query concepts like “many page views”, “high traffic period” or “very loyal visitors”. Fourth, Inductive Fuzzy Classification (IFC) enables a automated definition of membership functions using induction. This inferred membership degrees can be used for analysis and reporting. KEYWORDS Fuzzy classification, fuzzy logic, web analytics, web metrics, web usage mining, electronic business.
1. INTRODUCTION Since the development of the World Wide Web 20 years ago, the Internet presence of companies has become a crucial instrument of information, communication and electronic business. With the growing importance of the web, the monitoring and optimization of a website and online marketing has become a central task too. Therefore, web analytics and web usage mining gain in importance for both business practice and academic research. Web analytics helps to understand the traffic on the website and the behavior of visitors. Today, many companies are already using web analytics software like Google Analytics to collect web data and analyze website traffic. They provide useful dashboards and reports with important metrics to the responsible persons of the website, marketing or IT. Like web analytics software, a data warehouse is an often-used information system for analysis and decision making purposes. So far, classifications of web metrics or facts in data warehouses have always been done in a sharp manner. This often leads to inexact evaluations. This paper shows a fuzzy logic approach applied to web analytics, where classification of metrics yields a gradual degree of membership in several classes. After an introduction to web analytics in section 2, section 3 proposes a process and an architecture of a web analytics model with seven layers. Section 4 explains the fuzzy classification approach and discusses two fuzzy induction methods for web usage mining. Section 5 gives a conclusion and an outlook.
2. WEB ANALYTICS According to the Web Analytics Association (2009), web analytics is the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing web usage. Phippen et al. (2004) quote the Aberdeen Group by defining web analytics as the monitoring and reporting of website usage so that enterprises can better understand the complex interactions between web site visitor actions and website offers, as well as leverage insight to optimise the site for increased customer loyalty and sales.
59
ISBN: 978-972-8924-88-1 © 2009 IADIS
However, the use of web analytics is manifold and is not restricted to the optimization of websites (Table 1; see e.g. (Peterson 2005, Kaushik 2007)). By analyzing log files or using page tagging, website traffic and visitors’ behaviour can be observed and analyzed exactly with different web metrics listed in Table 2. Table 1. Use of Web Analytics
Web analytics is necessary for the ongoing optimization of … website quality (navigation, structure, content, design, functionality & usability) online marketing (awareness, image, campaigns, banner & keyword advertising) online CRM (customer relationship management: customer acquisition/retention) individual marketing (personalized recommendation/content, mass customization) segmentation of the traffic, visitors and online customers internal processes and communication (contacts, interactions, relations) search engine optimization (visibility, search engines rankings, PageRank, reach) traffic (page views, visits, visitors) e-business profitability (efficiency & effectiveness of the web presence) Table 2. Definitions of Web Metrics (also Measured in Google Analytics (2009))
Web metric Page views
Definition The number of page views (page impressions) of a web page accessed by a human visitor (without crawlers, spiders or robots) A sequence of page views requests (a session or click-stream) of Visits a unique visitor without interruption (of usually 30 minutes) The number of unique visitors (users) on a website Visitors Pages/visits The Ø number of pages visitors have seen during a visit Time on site The Ø length of time of all visitors spent on the website The capability of a web page to keep a visitor on the website Stickiness Bounce rate The percentage of single page view visits The number of visits, a visitor made on the site (loyalty indicator) Frequency Recency The number of days passed by, since a visitor last visit on the site Length of visit The time of a visit, the visitors spent on the website (in seconds) Depth of visit The number of pages, the visitors visited during one visit Conversion rate The percentage of visitors who converts to customers
Search engines External links Bookmarks, URL entries Legend:
Entry page Ad click rate
Reach
Number of page views Time on site Number of visitors Number of visits Stickiness Click rate
Purchase behavior of online customers
Usage behavior of visitors
The number of page views, visits and visitors are standard metrics in web analytics practice and often discussed in literature. However, their significance is limited, since a high number of page views, visits or visitors do not necessarily generate high monetary or non-monetary e-business value. As Key Performance Indicators (KPIs) of web usage behaviour are considered: stickiness, visit, depth of visit and length of visit. If a website sells products or services in an online shop, further KPIs of e-business have to be considered (in Figure 1): conversion and order rates, loyalty (e.g. purchase frequency), the number of new and returning customers, and online revenues or profits (e.g. measured per visit, visitor, customer and per order or as total).
Online advertising Web metric
Click-stream & click-path Depth of visit Length of visit (Visit) frequency Bounce Rate
Web pages
Product site Display click rate
Basket
Order
Click-to-basket Basket-to-buy rate rate
Order rate
Conversion rate Ad conversion rate Metric-ratio (rate)
60
Exit page
Purchase frequency & recency Number of (new and returning) customers Online revenue & profit (e.g. per visit, visitor, order)
Key Performance Indicators (KPIs)
Figure 1. Relations between Web and e-Business Metrics
IADIS European Conference Data Mining 2009
3. WEB ANALYTICS MODEL 3.1 Web Analytics Process The selection of web metrics and KPIs always depends on the strategy and objectives of a website. Different metrics and KPIs have to be measured to analyse, for instance, a news website, an e-commerce site, a support website, a blog or a community website. Therefore, the web analytics process starts with the task of identifying and defining the main goals and KPIs of a website. The collection, preparation and analysis of web data belong to the website, integration and data layer of the web analytics model (compare Figure 2). Within the data warehouse and web mining layer, a web metric system is modelled and implemented by a data warehouse, which handles and classifies integrated web data fuzzily. On the presentation layer, query results, reports and dashboards are provided to analysts. They and managers of IT, marketing or accounting analyze and control website-related results, and they plan and act activities to optimize e-business success.
3.2 Website & Data Layer In the domain of web analytics, two technical methods are mainly used to analyse website traffic: server-site and client-side data collection. Server-site data collection methods extract and analyse data out of the log files (X in Figure 2). However, server-side methods do not measure web traffic exactly, because of the caching in web browsers and proxy servers and due to requests of search engines crawlers or robots. In addition, page views and visitors can not be identified distinctly with log files, and events (like mouse clicks) are recorded neither. Finally, log files extractions and reporting are complicated and time-consuming, particularly, if multiple websites or several web servers are used. As result, it has become less important. Using client-site data collection or page tagging (o), data about the visitor and each page view is send to a tracking server by a JavaScript code (or a one-pixel tag) inserted in each (X)HTML page. If a website is built with a Content Management System (CMS; Z), the JavaScript snippet can be embedded easily. With the client-side data collection method, all visits and all actions of a visitor on every web page can be tracked exactly (i.e. each mouse click and all keyboard entries), as well as technical information about the visitor (e.g. size/resolution of the screen; type/language of the used browser or system). Further advantages are that JavaScript is not allowed to be cached by proxies or browsers, and crawlers of search engines do not read it. Google, WebTrends, Nedstat and many other companies provide web analytics software using page tagging. Google Analytics (2009) is the most used freeware and well documented by books (Clifton 2008). However, from an e-business point of view, not only click-stream data about the visitor(s) is interesting, but especially customer-related information about online orders, the purchase history, payment behaviour, customer profile, and so on. Data of Customer Relationship Management (CRM; [) is stored in operational data bases. To ensure a holistic view of online customers, data from various sources has to be cleaned, transformed and consolidated in a data preparation step, before it is integrated to a data warehouse (\).
3.3 Data Warehouse & Web Mining Layer For analysis and reporting purposes, web and online customer data is loaded into a Data Warehouse (DWH; \). A DWH is defined as a multidimensional, subject-oriented, integrated, time-varying, non-volatile collection of data in support of the management’s decision-making process (Inmon et al. 2008). A DWH is a promising tool for web analytics, since facts like page views or online revenues can be analysed and aggregated over different dimensions like website, visitor and time (]). Another strength of DWH is the possibility to provide user- and context-relevant information using slicing and dicing. Slicing, dicing, drill-down and roll-up (i.e. dis-/aggregation) based on the fuzzy logic approach discussed in the following section, enable the definition of linguistic variables and extended dimensional concepts. For example, new classification concepts like “high traffic period” (e.g. in the evening or at the weekend), “many page views” (of a web page or of a visitor), “very loyal visitors” (with high visit or purchase frequency) or “high value customers” (with high online revenues) can be defined and queried by data analysts.
61
ISBN: 978-972-8924-88-1 © 2009 IADIS
It is proposed by Fasel and Zumstein (2009) to define fuzzy classes, dimensions, linguistic variables and the membership degrees of the fuzzy data warehouse by meta data (meta fuzzy table). This method is more flexible than other approaches and it fulfils the additional requirements of a DW 2.0 (Inmon et al. 2008). Presentation layer
IT
Reports Queries ` Results
Web usage
Web mining laye r
Accounting
Dashboards
Inductive fuzzy classification
^ mining
Predictive web
_ analytics Modelling
Implementation
Strategies & goals of the website & e-business
Dimension time
Data warehouse layer
Day Y Dimension web page
Page 3 Page n
Page 2
Page 1
Dimens. visitor
Visitor X Fact (page views)
Data layer
]
Dimensions (operations: slicing, dicing, aggregation)
Web metrics system (cause-and-effect-chain) Web analytics system
Data preparation Application Programming Interface (API) Identification: Login IP address Cookies
Web analytics
Visitor
(Google Analytics, eTracker, WebTrends, Omniture, etc.)
Y
Client-side data collection Visits
CRM software
tracking server
External web data
Website layer
Facts
Web metrics, e.g. Page views Visits Visitors New visitors Pages/visits Time on Site Stickiness Depth of visit Purchase frequency & recency Revenues Conversion rates
\ Fuzzy data warehouse Integration layer
Analyze Plan a Control Act
eMarketing
Meta data (meta fuzzy tables)
Standard dashboards
Management layer
Web page 1
Page Tagging
CMS Logfiles
Z
X
Internal web data
[
Customer data base
Other data bases
Customer & e-commerce data
Server-side data collection
var gaJsHost =(("https…
Web page 2
Web page n
Order
Purchase history
… Profile
Figure 2. Architecture of the web analytics model
Additionally, a fuzzy DWH is also a powerful basis for web usage mining (^). Web usage mining is a part of web mining, beside web structure and web content mining (Liu 2007, Markov & Larose 2007). It refers to discovery and analysis of patterns in click-streams and associated data collected or generated as a result of user interactions with web resources on one or more websites (Mobasher 2007). In this paper, web usage mining is considered as the application of data mining techniques like fuzzy classification or fuzzy clustering to web analytics in order to detect, analyse or query promising segments of website traffic, and of visitor or customer behaviour. As shown in section 4.2, inductive fuzzy classification allows the automated calculation of probabilities based on membership degrees for reporting and analysis. This method can improve predictive web analytics (_), as well as predictive analytics in online and individual marketing, as Kaufmann and Meier (2009) show in a case study.
62
IADIS European Conference Data Mining 2009
3.4 Presentation & Management Layer On the presentation layer, content- or user-specific queries, reports and dashboards are prepared and presented by analysts to the responsible person(s) of the IT, marketing or accounting department and to the management board (`), who plans and decides about website-related activities to optimize the website (a).
4. FUZZY WEB ANALYTICS 4.1 Fuzzy Classification of the Web Metric Page Views The theory of fuzzy sets and fuzzy logic goes back to Lofti A. Zadeh in 1965. It takes the subjectivity, imprecision, uncertainty and vagueness of human thinking and language into account, and expresses it with mathematical membership functions. A fuzzy set can be defined formally as follow (Zimmermann 1992, Meier et al. 2008, Werro 2008): if X is a set, then the fuzzy set A in X is defined in (1) as (1 A = {(x, μA (a))} ) where x ∈ X, µA : X → [0, 1] is the membership function of A and µA (x) ∈ [0, 1] is the membership degree of x in A. a)
µ
few
1
medium
b)
many
µ 1
Visitor 1 Visitor 2
few
0.60 0.55 0.45 0.40
0
Page views 0
10 20 30
40 50 60 70 80 90 100 (per month)
0
medium
Visitor 1
many
Visitor 2
Page views 0
10 20 30
40 50 60 70 80 90 100 (per month)
Figure 3. Sharp (a) and Fuzzy (b) Classification of the Web Metric Page Views
For example, in a sharp set (Figure 3a), the terms “few”, “medium” or “many” of the linguistic variable page views can be either true (1) or false (0). A value of 1 of the membership function µ (Y-axis in Figure 3a) means that the number of page views (on the X-axis) is corresponding to one set. A value of 0 indicates that a given number of page views do not belong to one of the sets. The number of page views of a visitor per month are defined as “few” between 0 and 32, 33 to 65 page views are “medium” and more than 66 are classified as “many” page views. However, to classify page views – or any other web metric – sharply, is problematic near the classification boundaries, as following examples shows. If visitor 1 has 65 page views, he is classified in the “medium” class, visitor 2 with 70 has “many” page views. Although the two visitors have visited nearly the same number of pages (visitor 2 visited only 5 pages more), they are assigned to two different sets, or classes respectively. By defining fuzzy sets (Figure 3b), represented by the membership functions, there are continuous transitions between the terms “few”, “medium” and “many”. Fuzzily, the number page views of visitor 1 are classified both as “medium” (0.55 resp. 55%) and “many” (0.45 resp. 45%). Also visitor 2 belongs partly to two classes (60% to “many” and 40% to “medium”) at the same time. Obviously, the use of fuzzy classes allows a more precise classification of web metric values, and risks of misclassifications can be reduced. The fuzzy classification of KPIs like online revenue, profit or conversion rates is especially pertinent, since their valuations usually have far-reaching consequences for e-business.
4.2 Inductive Fuzzy Classification for Web Usage Mining Web usage mining is a data mining method to recognize patterns in web site navigation by web site visitors (Spiliopoulou 2000, Srivastava et al. 2000). A common web usage mining task is the analysis of associations between visited pages (Escobar-Jeria et al. 2007). Two inductive fuzzy classification methods are proposed
63
ISBN: 978-972-8924-88-1 © 2009 IADIS
here to discover knowledge about web usage patterns. Inductive Fuzzy Classification (IFC) is the process of grouping elements into a fuzzy set whose membership function is inferred by induction from data (Kaufmann and Meier 2009). First, Inductive Fuzzy Classification by Percentile Rank (IFC-PR) generates fuzzy membership functions to common linguistic terms like “low”, “medium” and “high” for the number of page views. Second, Inductive Fuzzy Classification by Normalized Likelihood Ratios (IFC-NLR) can be applied to infer a membership function of a web page in a target page (like a product or order page).
4.2.1 Motivation for the Proposed Methods The two methods for IFC proposed in this paper use simple probabilistic measures such as percentile ranks and likelihood ratios for generating fuzzy membership functions. The aim is to apply this fuzzyfication in web data analysis for knowledge discovery, reporting, and prediction, which has several advantages. First, most data mining methods are dichotomous in nature, especially classification. As proposed by Zimmermann (1997), fuzzy classification methods become appropriate when class membership is supposed to be gradual. Thus the advantage of fuzzy classification in web mining is the possibility to rank web pages by a gradual degree of membership in classes. Second, the results of knowledge discovery are often not directly understandable by human decision makers (Mobasher 1997). The advantage of fuzzy logic based methods is that the generated models (i.e. membership functions) are easy to interpret. In fact, using simple probabilistic measures with a clear semantic make the membership functions more understandable and thus suitable for human decision support. Third, not only does the IFC methods provide gradual rankings that are easily interpretable, but also these fuzzy classifications can be derived and defined from data automatically, by induction. Consequently, they are suitable for the application to web usage mining.
4.2.2 Web Usage Mining with IFC-PR In web usage mining, the importance of web pages is measured by the number of page views per page. This number alone does not have much meaning. Only the context of the number in relation to the number of views of other pages provides valuable knowledge. Fuzzy classification is used here to put the number of web page views in a linguistic context (i.e. the use of linguistic terms like “low”, “medium” and “high”). The IFC-PR method is proposed here to induce the membership functions to these linguistic terms automatically, using the sampled probability distribution (P) of values, which is defined as follows: For a metric M, the empirical percentile rank of a value x determines the membership to the fuzzy class “high” (2):
μhigh (x ) := P (M < x )
(2)
For M, the classification as being “low” (formula 3) is the negation of the membership to the class “high”:
μlow (x ) :=1− μhigh (x)
(3)
For a web metric M, the classification as being “medium” is defined by the following formula (4): μmedium(x) :=1− abs(μhigh (x) − 0.5)− abs(μlow (x) − 0.5)
(4)
For example, web page 1 (W1) has 1035 visits per month. To classify this number, the distribution of the number of visits per page is used. Assume, 56 of 80 pages have less visits than W1. Thus, the calculation is: μhigh (visit (W 1)):= P (Number of visits < visits(W1)) =
56 = 0.7 80
(5)
Therefore, the fuzzy classification of the visits per month for the web page W1 being “high” is 0.7. The fuzzy classification for the linguistic term “low” is 0.3, and for “medium” it is 0.6.
4.2.3 Web Usage Mining with IFC-NLR For web usage mining it is interesting to know which web pages (page X) are visited together with a target page such as the online shop (target page Y). Therefore, a fuzzy classification for each web page is calculated with a degree of membership to the target page. That type of fuzzy classification indicates the degree of association in web usage between two web pages. To analyze the influence of an analytic variable X to a target variable Y in terms of fuzzy classification, the IFC-NLR method is applied to calculate the membership degree of values x ∈ dom(X) in the values y ∈ dom(Y). Thus, the values of the target variable
64
IADIS European Conference Data Mining 2009
become fuzzy classes with a membership function for the values of the analytic variable. To define this function, the IFC-NLR method proposes to calculate a normalized Likelihood (L) ratio. P(x | y ) 1 = μ y (x ) := L ( ¬y | x ) P ( x | y ) + P ( x | ¬y ) (6) 1+ L( y | x) Equation 6 shows the degree of membership of x in y. For example, following web usage data is considered. Table 3. Example of Visits of the Web Pages W2 and W3 and the Online Shop Web page W2 Online shop was visited no was visited yes 345 234 yes 123 456 no total 468 690
Web page W3 was visited yes no total
Online shop was visited no yes 253 389 215 220 468 609
The fuzzy classification for page W2 as leading customers to the online shop is calculated as follows (7): μonline shop (W 2) :=
1 1 = = 0.68491 P (W2 was visited = yes | online shop was visited = no) 234 690) ( 1+ 1+ P (W2 was visited = yes | online shop was visited = yes) (345 /468)
(7)
The inductive fuzzy classification of the web pages W2 and W3 shows that the visit of the online shop is more likely after a page view of web page W2 (0.68491) than after a page view of web page W3 (0.45839). As a result, probabilistic induction facilitates identifying web pages that generate additional page views for the online shop. These insights can be applied to augment click rates (in Figure 1), online sales (i.e. high order and conversion rates) and, in the end, to increase online revenues.
4.2.4 Real Data Example of Web Usage Mining with IFC In order to provide an example, the anonymous Microsoft.com web usage data (1998) has been analyzed with the proposed methods. The data set consists of web visit cases, each containing a subset of web pages viewed per case. First, an IFC-PR of the number of page views per web page has been calculated. Second, an IFCNLR of web pages with the target page “Products” has been computed. This can be combined to a twodimensional fuzzy classification, as shown in figure 4. This scatter plot allows identifying web pages associated with the products page that have a high number of page views, in the top right corner of figure 4. 1
1
μProducts (w)
μProducts (w)
0.7
0
Windows NT Workstation
Exchange MS Access
Windows 95 MS Off ice
OutLook
FrontPage Windows NT Serv er
MS Word
0.7 0
0.7
μHigh(page views(w))
1
1
0.7
μHigh(page m views(w)) (w) Products
Figure 4. Inductive Fuzzy Classification of Microsoft.com Web Usage Data
5. CONCLUSION & OUTLOOK In the Internet age, websites (should) create value both for their visitors and operators. Web analytics provide insights about added value, traffic on the website and about behavior of visitors or customers on web pages. So far, reports and dashboards of web analytics mostly classify and evaluate web metrics values sharply. Nevertheless, sharp classifications of metrics are often inadequate, as academic research shows.
65
ISBN: 978-972-8924-88-1 © 2009 IADIS
Therefore, this paper proposes a fuzzy web analytics model to overcome the limitations of sharp data handling in data warehouses and web usage mining. In a fuzzy classification, elements can belong to several classes at the same with a gradual membership degree. In addition, inductive fuzzy classification provides methods to define these membership degrees automatically. The advantage of the fuzzy methods is that they provide a gradual ranking of web metrics induced by data suitable for web manager decision support. A real data example showed how to present the knowledge discovered by web usage mining graphically. The architecture of the proposed fuzzy web analytics model provides a theoretical framework to master the huge amount of Internet data companies are confronted with. To proof the discussed web analytics model, real web data of e-business practice has to be analyzed in future studies. In addition, further case studies with companies are planned, to show the advantages and limitations of the fuzzy classification approach. The research center Fuzzy Marketing Methods (www.FMsquare.org) applies the fuzzy classification to data base technologies and online marketing. It provides several open source prototypes, for example, the fuzzy Classification Query Language (fCQL) toolkit, which allows fuzzy queries and the calculation of the membership degree of data stored in MySQL or PostgreSQL data bases.
REFERENCES & FURTHER READING Clifton, B., 2008: Advanced Web Metrics with Google Analytics, Wiley, New York, USA. Escobar-Jeria, V. H., Martín-Bautista, M. J., Sánchez, D., Vila, M., 2007: Web Usage Mining Via Fuzzy Logic Techniques. In: Melin, P., Castillo, O., Aguilar, I. J., Kacprzyk, J., Pedrycz, W. (Eds.), 2007: Lecture Notes In Artificial Intelligence, Vol. 4529, Springer, New York, USA, pp. 243-252. Fasel, D., Zumstein, D., 2009: A Fuzzy Data Warehouse for Web Analytics, In: Proceedings of the 2nd World Summit on the Knowledge Society (WSKS 2009), September 16-18, Crete, Greece. Galindo, J. (Ed.), 2008: Handbook of Research on Fuzzy Information Processing in Databases, Idea, Hershey, USA. Google Analytics, 2009: http://www.google.com/analytics (accessed 12th of May 2009). Inmon, W., Strauss, D., Neushloss, G., 2008: DW 2.0 – The Architecture for the Next Generation of Data Warehousing, Elsevier, New York, USA. Kaufmann, M., Meier, A., 2009: An Inductive Fuzzy Classification Approach applied to Individual Marketing, In: Proceedings of the 28th North American Fuzzy Information Processing Society Annual Conference, Ohio, USA. Kaushik, A., 2007: Web Analytics, Wiley, New York, USA. Liu, B., 2007: Web Data Mining – Exploring Hyperlinks, Contents, and Usage Data, Springer, New York, USA. Markow, Z., Larose, D., 2007: Data Mining the Web, Wiley, New York, USA. Meier, A., Schindler, G., Werro, N., 2008: Fuzzy Classification on Relational Databases, In: (Galindo 2008, pp.586-614). Microsoft web usage data, 1998: http://archive.ics.uci.edu/ml/databases/msweb/msweb.html (accessed 12th of May 2009). Mobasher B., Cooley R., Srivastava J. 1997: Web Mining: Information and Pattern Discovery on the World Wide Web, In Proc. of the 9th IEEE International Conf. on Tools with Artificial Intelligence (ICTAI'97). Mobasher, B., 2007: Web Usage Mining, In: (Liu 2007, pp. 449-483). Phippen, A., Sheppard, Furnell, S., 2004: A practical evaluation of Web analytics, Internet Research, Vol.14, pp.284-93. Peterson, E., 2005: Web Site Measurement Hacks, O’Reilly, New York, USA. Spiliopoulou, M, 2000: Web usage mining for web site evaluation. Communication of the ACM, Vol. 43, pp. 127–134. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N., 2000: Web Usage Mining: Discovery and Application of Usage Patterns from Web Data, In: ACM SIGKDD, Vol. 1, Is. 2, pp.1-12. Waisberg, D., Kaushik, A., 2009: Web Analytics 2.0: Empowering Customer Centricity, In: SEMJ.org, Vol. 2, available: http://www.semj.org/documents/webanalytics2.0_SEMJvol2.pdf (accessed 12th of May 2009). Web Analytics Association, 2009: http://www.webanalyticsassociation.org/aboutus (accessed 12th of May 2009). Weischedel, B., Huizingh, E., 2005: Website Optimization with Web Metrics: A Case Study, In: Proceedings of the 8th international conference on Electronic commerce (ICEC’06), August 14-16, Fredericton, Canada, pp. 463-470. Werro, N., 2008: Fuzzy Classification of Online Customers, Dissertation, University of Fribourg, Switzerland, available: http://ethesis.unifr.ch/theses/downloads.php?file=WerroN.pdf (accessed 12th of May 2009). Zadeh, L. A., 1965: Fuzzy Sets. In: Information and Control, Vol. 8, pp. 338-353. Zimmermann, H.-J., 1992: Fuzzy Set Theory and its Applications, Kluwer, London, England. Zimmermann, H.-J., 1997: Fuzzy Data Analysis. In: Computational Intelligence: Soft Computing and Fuzzy-Neuro Integration with Applications; Kaynak, O., Zadeh, L. A., Turksen, B., Rudas, I. J. (Eds.), Springer, New York, USA.
66
IADIS European Conference Data Mining 2009
DATE-BASED DYNAMIC CACHING MECHANISM Christos Bouras, Vassilis Poulopoulos Research Academic Computer Technology Institute, N. Kazantzaki, Panepistimioupoli and Computer Engineering and Informatics Department, University of Patras 26504 Rion, Patras, Greece
Panagiotis Silintziris Computer Engineering and Informatics Department, University of Patras 26504 Rion, Patras, Greece
ABSTRACT News portals based on the RSS protocol are becoming nowadays one of the dominant ways that Internet users follow in order to locate the information they are looking for. Search engines, which operate at the back-end of a big portion of these web sites, receive millions of queries per day on any and every walk of web life. While these queries are submitted by thousands of unrelated users, studies have shown that small sets of popular queries account for a significant fraction of the query stream. A second truth has to do with the high frequency that a particular user tends to submit the same or highly relative search requests to the engine. By combining these facts, in this paper, we design and analyze the caching algorithm deployed in our personalized RSS portal, a web-based mechanism for the retrieval, processing and presentation in a personalized view of articles and RSS feeds collected from major Internet news portals. By using moderate amounts of memory and little computational overhead, we achieve to cache query results in the server, both for personalized and non-personalized user searches. Our caching algorithm operates not in a stand-alone manner but it co-operates and binds with the rest of the modules of our Portal in order to accomplish maximum integration with the system. KEYWORDS Query results caching, search engine, personalized search, query locality, data retrieval, date-based caching.
1. INTRODUCTION The technological advances in the World Wide Web combined with the low cost and the ease of access to it from any place in the world has dramatically changed the way people face the need for information retrieval. More and more users migrate from traditional mass media to more interactive digital solutions such as Internet news portals. As the number of users exponentially increases and the volume of data involved is high, it is very important to design efficient mechanisms to enable search engines to respond fast to as many queries as possible. One of the most common design solutions is to use caching, which may improve efficiency if the cached queries occur in the near future. Regarding earlier classic works in the topic of caching, Markatos investigated the effectiveness of caching for Web search engines (Markatos E.P. 2001). The reported results suggested that there are important efficiency benefits from using caches, due to the temporal locality in the query stream. Xie and O’Hallaron also found a Zipf distribution of query frequencies (Xie and Halaron, 2002), where different users issue very popular queries, and longer queries are less likely to be shared by many users. On cache management policies, Lempel & Moran (Lempel and Moran, 2003) proposed one that con- siders the probability distribution over all queries submitted by the users of a search engine. (Fagni et al. 2006) described a Static Dynamic Cache (SDC), where part of the cache is read-only or static, and it comprises a set of frequent queries from a past query log. The dynamic part is used for caching the queries that are not in the static part. Regarding locality in search queries, in their earlier works, Jansen & Spink (Jansen and Spink, 2006) provide insights about short-term interactions of users with search engines and show that there is a great amount of locality in the submitted requests. Teevan et al. (Teevan et al. 2006) examined the search behaviour of more than a hundred anonymous users over the course of one year. The findings were that across the year, about
67
ISBN: 978-972-8924-88-1 © 2009 IADIS
one third of the user queries were repetitions of queries previously issued by the same user. Although these studies have not focused on caching search engine results, all of them suggest that queries have significant locality, which particularly motivates our work. In our work, we take advantage of this space and time locality, and we cache the results from very recently used queries in order to reduce the latency on the client and the database-processing load from the server. Because of the fact that the caching is server side, both registered and unregistered users of the portal can take benefit. Furthermore, for registered users, the algorithm takes into account the dynamic evolution of their profile and provides them with results even more close to their preferences and interests. The rest of the paper is structured in the following way: in the next section description of the architecture of the system with the focus in the algorithm of caching is presented. The caching algorithm is analyzed in Section 3. In section 4, we present some of the experimental results and evaluation of our work, regarding algorithmic performance, results accuracy and storage space requirements. We conclude in Section 5 with some remarks about the described techniques and future.
2. ARCHITECTURE The architecture of the system is distributed and based on standalone subsystems but the procedure to reach at the desired result is actually sequential, meaning by this that the data flow is representative of the subsystems of which the mechanism consists. Another noticeable architectural characteristic is the existence of modularity throughout the system’s lines. This section is a description of how these features are integrated into the mechanism. We are putting the focus on the subsystem responsible for caching, though analysis of the other modules is presented in order to cross-connect the features of our system.
Figure 1. Search Module Architecture
The general procedure of the mechanism is as follows: first web pages are captured and useful text is extracted from them. Then, the extracted text is parsed followed by summarization and categorization. Finally we have the presentation of the personalized results to the end user. For the first step, a simple web crawler is deployed, which uses as input the addresses extracted from the RSS feeds. Theses feeds contain the web links to the sites where the articles exist. The crawler fetches only the html page, without elements such as referenced images, videos, CSS or JavaScript files. Thus, the database is filled with pages ready for input to the 1st level of analysis, during which, the system isolates the “useful” text from the html source. Useful text contains the article’s title and main body. This procedure analysis can be found in (self reference). In the 2nd level of analysis, XML files containing the title and the body of articles are received as input, targeting at applying pre-processing algorithms on this text in order to provide as output the keywords, their location in the text together with their absolute frequency in it. These results are the primary key to the 3rd level of analysis. In the 3rd level of analysis, the summarization and categorization technique takes place. Its main scope is to characterize the articles with a label (category) and come up with a summary of it. This procedure is described in (self reference).
68
IADIS European Conference Data Mining 2009
In Figure 1, we can see the general schema and flow of the advanced and personalized search subsystem of our portal. The caching algorithm, which is the core of our work and will be analyzed in the next section, is used to search for identical queries submitted in the past from the same or other users and if matches are found, cached results are directly obtained improving in this way the search speed and reducing the workload on the server. The cached data is stored in the server’s database.
3. ALGORITHMIC ASPECTS In this section of the paper, we shall analyze the caching algorithm of our search implementation.
3.1 Search Configuration and Keyword’s Stemming Before the engine triggers the search procedure, the user has first to configure the query request. Apart from the specified keywords, a few other options are offered, including the date period for the target result (DATE FROM – DATE TO), the selection of the logical operation (“OR” and “AND), which will be performed in the articles matching and the thematic category of the desired output. Before proceeding with the query search operation, we should also notice that the engine passes the keywords through a stemmer, which implements the Porter Stemming Algorithm on the English language. Thus, we enable the integration of the search engine with the rest of the system, which is implemented by using stems rather than full words for the articles categorization. Additionally, simple duplicate elimination is executed on the stems.
3.2 Caching Algorithm Prior to searching for the result articles, the system searches for cached data from previous search sessions. All cached data are stored on the server’s storage space and the caching algorithm also operates in the server’s memory so the procedure, which will be described, will benefit both registered users (members) as well as unregistered users Search Module (guests) of the portal without creating any computational overhead for their machines. For each submitted query, we store in a separate table in our database information about the configuration of the search request. This information includes the id of the user who submitted the search, the exact time of the search (as a timestamp), the keywords used in the query formatted in a comma-separated list and a string containing information about the desired category of the results and the logical operation, which was selected for the matching. For the above data, our caching algorithm operates in a static manner. For example, if a user submits a query containing the keywords “nuclear technology”, by selecting the “science” category as the target category for the returned articles, this query will not match against an existing (cached) query which contains the same keywords but which was in the first case cached for results on the “politics” category. Also, when a query containing more than one keyword is submitted, it will not match against cached queries containing subsets or supersets of the keyword set of the submitted query. For example, if the incoming query contains the keywords “Monaco circuit formula” probably referring to the famous Grand Prix race, it will not be considered the same with a cached query containing the keywords “circuit formula” which probably refers to an electrical circuit formula of physics. This decision for the implementation was taken in order to avoid semantic ambiguities in the keywords matching process. The dynamic logic of our caching algorithm lies in the target date intervals of a search request, which are represented by the DATE FROM and DATE TO fields in the search configuration form of the portal. This perspective of caching was chosen after considering the fact that is very common for many web users to submit identical queries repeatedly in the same day or during a very short period of days. The algorithm designed for this reason takes into account the following 4 cases for the cached DATE FROM and DATE TO fields and submitted DATE FROM and DATE TO fields: • 1st Case: the DATE FROM-TO interval of the submitted query is a subset of the DATE FROM-TO interval of the cached query. In this case, we have the whole set of the desired articles in our cache plus some articles out of the requested data interval. The implementation fetches all the cached results and it filters out the articles, which were published before DATE FROM and these, which were published after DATE TO
69
ISBN: 978-972-8924-88-1 © 2009 IADIS
attribute of the submitted request. The server’s cache is not updated with new articles because in this case no search is performed in the articles database. • 2nd Case: the DATE FROM of the submitted query is before the DATE FROM of the cached query and the DATE TO of the submitted query is after the TO DATE TO of the cached query. In this case, the desired articles are a superset of the articles, which are cached in the database. As a consequence, the algorithm fetches all the cached results but it also performs a new search for articles in the date intervals before and after the cached date interval. When the search procedure finishes, the algorithm updates the cache by extending it to include the new articles and by changing the DATE FROM and DATE TO attributes so that they can be properly used for future searches. • 3rd Case: the DATE FROM of the submitted query is before the DATE FROM of the cached query and the DATE TO of the submitted query is between the DATE FROM and DATE TO of the cached query. In this case, a portion of the desired articles exists in the cache. The algorithm first fetches all the results and then it filters out the articles, which are after the DATE TO date of the submitted request. Furthermore, a new search is initiated for articles not existing in the cache memory. For the new search the DATE FROM and DATE TO dates become the DATE FROM date of the submitted query and the DATE FROM date of the cached query. • 4th Case: The form case is similar to the third case but in the opposite date direction. The final results consist of the cached results between DATE FROM date of the submitted request, the DATE TO date of the cached request and the new articles coming from a new search between the DATE TO date of the cached query and the DATE TO date of the submitted query. We should notice that for the cached results data, an expiration mechanism is deployed. Every cached query is valid for a small number of days, in order to keep the engine’s output to the end user as accurate as possible. Whenever a search for a matching with the cached results is performed, cached date that have expired are deleted from the database and are replaced with new ones. It is also possible for the same query to exist more than one cached records as long as they have not expired. The selection of the proper expiration time for the cached data will be discussed in section 5 of the paper. By examining the cache matching algorithm, operating in the server machines, we can see that in all four cases, we achieve to limit the computational overhead of a fresh search in the database by replacing it with some overhead for cached results filtering. However, this filtering is implemented with simple XML parsing routines and cannot be considered as a heavy task for the server. The most significant improvement happens in the first case, where no new search is performed and all the results are fetched directly from the cache. This is a great benefit to our method as this is the most common case, where the user submits the same query over and over without changing the DATE FROM and DATE TO date fields or by shrinking the desired date borders. The worst case is the second, where the user expands his query in both time directions (before and after) in order to get more results in the output. In this case, the engine has to perform two new searches, followed by an update query in the database cache. However, this is the rarest case, as the average user tends to shrink the date interval rather than expanding it, when he repeatedly submits an identical query in order to get more date-precise and date-focused results. In the other two situations, one new search is executed each time and one update is committed in the database. This means that in an average case, we can save more than 50% of our computational overhead when the expansion of the date borders (with the newly submitted query) are not bigger than the cached results date interval.
4. EXPERIMENTAL EVALUATION In our experiment to evaluate the caching algorithm, which was described in the previous section, we create a virtual user to submit queries to the server. The executed queries consist of keywords from several thematic categories (sports, science, politics, etc) used throughout the articles database of our system. We choose to test caching performance on queries containing no more than three keywords, in order for the output to contain a big number of articles and for the overall procedure to last as much as needed for our time measurements to be sufficient and capable of analysis and conclusions.
70
IADIS European Conference Data Mining 2009
4.1 Caching Algorithm In the previous section of this paper, we analyzed the way in which the algorithm tries to match a submitted query to find an identical cached record. During the experiment, we tested several queries, requesting articles from different categories, covering the period of the last six months. In the first phase, we used an empty cache memory and the server was configured to have the caching feature disabled. As it was expected, queries consisting of very focalized and specific keywords were processed very quickly. These queries are not of high interest concerning our analysis, as the number of articles containing such keywords are always quite limited and require small computational time to process. The major problem exists with queries consisting of generic keywords, which can be found on a plethora of articles in the database. This class of queries makes heavier usage of system resources and can be considered as a good starting point to evaluate our method. In Figure 2, we can examine the results of caching on execution procedure speedup for three generic queries (‘sports’, ‘computers’, ‘health or body’), which returned over 5000 articles. This graphic depicts the time in seconds that the system needed to fetch the matching output from the database. The cases considered in this figure are cases 2, 3 and 4 of our algorithm, where only a subset of the results for the submitted query exists in the cache memory and the system will initiate a new search in the database to fetch articles for the missing date periods. The selection of the date period, for which the results were cached in the first place, before the actual queries were submitted, was a random number of days varying from 60 to 90. The actual query, which was to be evaluated, required articles published in the last 180 days. This means that the system had still to search for more articles than the number of articles it had already stored in its cache memory. In the results presented, we can notice that under some situations, the benefit reached almost half the time of the actual (without caching) time needed. As it was expected, the worst case is case 2, where two new un-cached searches have to be executed, one before and the other after the date period of the cached set. After that, we come up with three different sets of articles. Prior to presenting them to the end user, we have to resort them according to their degree of relevance to the initial query. For cases 3 and 4, the results are almost similar. The higher times in case 4, could be a consequence of a possibly high concentration of desired articles in the date period, for which the new search was initiated, combined with a reduced concentration of articles in the date period stored in the cache memory of the server.
Figure 2. Time in seconds for un-cached searches and cached searches for Cases 2, 3 and 4
In the execution times measured throughout the experiment, an average 0.1 seconds were needed to fetch the articles from the cache memory, which is at average almost 3% of the overall time needed. Another 2% of the time was spent on resorting the two or three sets of results, according to their relevance to the query, in order to present them to the end user in the right order of relevance to the query. This said, it is expected for
71
ISBN: 978-972-8924-88-1 © 2009 IADIS
the case 1 of our algorithm to achieve an almost 95% speed up on the search. After the first execution of these queries, every next submission of the same request is serviced in under 0.1 seconds. Whenever results are cached for a query, every following identical one which demands articles inside the date period of the cached result, will be processed in almost zero time – only the time needed to fetch the results from the cache - no resorting is required in this case as we have only one already sorted set of articles. This reduces the computational overhead on the server for time demanding queries to the cost of the search procedure for only the first time they are executed. Every next time they are processed through the cache memory and the algorithm operating on it.
4.2 Cache Memory Size Our second concern was to examine how the number of the cached articles per query in our cache, affects the overall algorithm performance and the size of the database table used to store the cached data. We executed a generic query for several numbers of cached articles by increasing each time the date period in which the caching occurred. The total number of articles for this query was 4782 over a period of 4 months. For this experiment, we tested cases 2,3 and 4 of our algorithm, so that in every submitted request, a part of the results were not contained in the cache and the engine could not rely only on the cached data to create the output.
Figure 3. How the number of cached articles affects the speedup of a new search
From the graphical representation of Figure 3, relating the percentage of execution time speedup with the percentage of cached results on total results, we can notice that the search execution time reduces at an average 50% when a little less than 40% of the output has been cached. As the total number of articles in this test covered a period of four months, we can say, by statistic, that the 40% of the results would be retrieved by a search in a period of less than two months, which, speaking modestly, is a rather limited date interval on a common search. By that, it is meant that if a user submitted a search query, requesting articles for a period of more than two months, then every next time he submits an identical request, it would take at most half the time to be processed. If we add to this the fact that the algorithm updates the cache memory with new results, every time an extended (in terms of dates) version of an already cached query is submitted (the percentage of cached results probably increases and never decreases in every search), we could get even more improved execution times. Due the fact that the algorithm stores for each query in the cache a limited set of information relative to the retrieved articles, such as ids, dates and relevance factors, the size of the cache per record in the server memory is kept at minimum. As an example, for caching the 4782 results of the above query, which is a rather generic one with a lot of articles to be found relative, the corresponding row size in the cache database table was measured to be less than 150KB. If we combine the small row size with the periodical deletion of
72
IADIS European Conference Data Mining 2009
cached query records that expire, the technique can guarantee low storage space requirements in the server. The selection of the expiration date will be discussed in the next paragraph.
4.3 Expiration Date and Result Accuracy In the last phase of the experiment, we will examine the impact of selecting a proper expiration time for the cached records on the accuracy and the quality of the final output to the end user. As it was mentioned in the previous paragraph, the proposed algorithm periodically deletes cached records from the corresponding table in the database. The implementation of such an expiration mechanism in the algorithm is essential not only because it helps in keeping the storage space of the server’s cache low, but mainly for keeping the accuracy and the quality of the search results at high levels. Our purpose in this last step of the experiment is to examine how extending the expiration time of the cached records degrades the accuracy of the output result. For this reason, we created a virtual user and constructed a profile for him with favourite thematic categories and keywords. Having no cached data for this user on the first day of the experiment, we had him submit several queries to the system and we cached the results for some of these queries. For the next days of the month, we had the user navigating inside the system by submitting every day several different queries, this time, without caching any of them or expanding the already existing cached results. Among the submitted queries, we included queries identical to the cached ones for comparison to be feasible. The personalization mechanism the portal takes into account the daily behavior of each registered user (articles he reads, articles he rejects, time spent on each article) and dynamically evolves the profile of the user. For example, it is possible for a user to choose the sports as his favorite category upon his registration, but he may occasionally show an increased interest over science related news. The personalization system then evolves his profile and starts feeding him with scientific news among the sport news and this evolution has an obvious impact in his search sessions inside the system.
Figure 4. How extending date expiration affects results accuracy
In figure 4, we can see how is the accuracy of the search result degraded over the days passing when comparing the actual search results with the cached ones. For our virtual user, on the first day, the average accuracy is obviously 100% as it is the day that the actual queries are cached. Every next day, we get the results (uncached) of the actual queries, which have relevance over 35% to the submitted requests, and we count at average, how many of them existed in the cached queries results. As time passes, the output of the actual queries change (according to the user’s evolving profile) and the average percentage of the cached results in the actual output decreases. Until the tenth day of the experiment, we can see that the accuracy is close to 90% to the actual results. However, after the first two weeks the accuracy is degraded at 70% and toward the end of the third week, it is close to 55%. In other words, if the user were to be presented, at this point, with the cached results (cached on the first day), instead of the actual results for his queries, he would see almost, only half of the results that match his changed (since first day) profile.
73
ISBN: 978-972-8924-88-1 © 2009 IADIS
As a conclusion, caching the results of a search for more than two weeks is not a preferable solution for a registered user, as it might significantly produce invalid results, not matching his evolving profile and preferences. However, for unregistered users (guests) of the system, for whom no profile has been formed, an extended expiration date could be used. In our implementation, there is a distinction for registered and unregistered users when checking for cached data, which makes the caching algorithm more flexible.
5. CONCLUSION AND FUTURE WORK Due to the dynamism of the Web, the content of the web pages change rapidly, especially when discussing about a mechanism that fetches more than 1500 articles on a daily basis and presents them personalized back to the end user. Personalized portals offer the opportunity for focalized results though, it is crucial to create accurate user profiles. Based on the profiles that are created from the mechanism we created a personalized search engine for our web portal system in order to enhance the searching procedure of the registered and the non-registered. In this paper, we discussed about the caching algorithm of the advanced search sub-system. We presented the algorithmic procedure that we follow in order to cache the results and how to utilize the cache in order to enhance the speed of the searching procedure. Finally, we described experimental procedures that prove the aforementioned enhancement in speed. Comparing the results to the generic search’s results it is obvious that the system is able to enhance the searching procedure and help the users locate the desired results quicker. For the future what we would like to do is to further enhance the whole system with a more accurate search personalization algorithm supporting “smarter” caching of data in order to make the whole procedure faster and in order to omit any results that are of very low user interest.
REFERENCES Beitzel S. M., Jensen E. C., Chowdhury A., Grossman D. A. and Frieder O., 2004. Hourly analysis of a very large topically categorized web query log. ACM SIGIR, pp. 321-328. Bouras C., Poulopoulos V., Tsogkas V. 2008. PeRSSonal's core functionality evaluation: Enhancing text labeling through personalized summaries. Data and Knowledge Engineering Journal, Elsevier Science, 2008, Vol. 64, Issue 1, pp. 330 - 345 DMOZ. Open directory project. http://www.dmoz.com Fagni, T., Perego, R., Silvestri, F., Orlando, S., 2006. Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24, pp. 51– 78. Google. Google Search Engine. http://www.google.com (last accessed March 2009) Jansen B. J., Spink A., 2006. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing and Management, 42(1) ,pp. 248-263. Lempel, R., Moran, S., 2003. Predictive caching and prefetching of query results in search engines. Proceedings of the 12th WWW Conference, pp. 19–28 Markatos, E.P., 2001. On caching search engine query results. Computer Communications 24, pp. 137–143. Teevan J., Adar E., Jones R. and Pott M., 2006. History repeats itself: Re-peat queries in Yahoo’s logs. ACM SIGIR, pp. 703-704. Xie, Y., O’Hallaron, D.R., 2002. Locality in search engine queries and its implications for caching. IEEE Infocom 2002, pp. 1238 – 1247. Yahoo. Yahoo Web Search. http://www.yahoo.com (last accessed March 2009)
74
IADIS European Conference Data Mining 2009
GENETIC ALGORITHM TO DETERMINE RELEVANT FEATURES FOR INTRUSION DETECTION Namita Aggarwal VIPS, Guru Gobind Singh Indraprastha University, Delhi
R K Agrawal School of Computer & Systems Science, Jawaharlal Nehru University, New Delhi
H M Jain Trinity College, Guru Gobind Singh Indraprastha University, Delhi
ABSTRACT Real time identification of intrusive behavior based on training analysis remains a major issue due to high dimensionality of the feature set of intrusion data. The original feature set may contain irrelevant or redundant features. There is need to identify relevant features for better performance of intrusion detection systems in terms of classification accuracy and computation time required to detect intrusion. In this paper, we have proposed a wrapper method based on Genetic Algorithm in conjunction with Support Vector Machine to identify relevant features for better performance of intrusion detection system. To achieve this, a new fitness function for Genetic Algorithm is defined that focuses on selecting the smallest set of relevant features which provide maximum classification accuracy. The proposed method provides better result in comparison to the other commonly used feature selection techniques. KEYWORDS
Feature selection, Support Vector Machine, Intrusion Detection, Genetic Algorithm.
1. INTRODUCTION In last two decades, networking and Internet technologies have undergone a phenomenal growth. This has exposed the computing and networking infrastructure to various risks and malicious activities. The need of the hour is to develop strong security policies which satisfy the main goals of security (Hontanon, 2002) i.e. data confidentiality, data integrity, user authentication and access control, and availability of data and services. The most important factor in all these is the identification of any form of activity as secure or abusive. Intrusion detection is the art of detecting these malicious, unauthorized or inappropriate activities. Intrusion Detection Systems (IDS) have been broadly classified into two categories (Lee, 1999): misuse detection and anomaly detection. Misuse detection systems identify attacks which follow well-known patterns. Anomaly detection systems are those which have the capability to raise alarms in case of patterns showing deviations from normal usage behavior. Various data mining techniques have been applied to intrusion detection because it has the advantage of discovering useful knowledge that describes a user’s or program’s behavior from large audit data sets. Artificial Neural Network (Cho and Park, 2003; Lippmann and Cunningham, 2000), Rule Learning (Lazarevic et al., 2003), Outlier Detection scheme (Han and Cho, 2003), Support Vector Machines (Abraham, 2001; Sung and Mukkamakla, 2003), Multivariate Adaptive Regression Splines (Banzhaf et al., 1998) and Linear Genetic Programming (Mukkamala et al., 2004) are the main data mining techniques widely used for anomaly and misuse detections. Most of the research has been carried out on the kddcup DARPA data. The real time accurate identification of intrusive behavior based on training analysis remains a major issue due to the high dimensionality of the kddcup data (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). Although the data has 41 attributes, not all of them may contribute to the identification of a given attack. In fact, the presence of irrelevant and redundant
75
ISBN: 978-972-8924-88-1 © 2009 IADIS
features may deteriorate the performance of the classifier and requires high computation time and other resources for training and testing the data. Hence, in order to make the classifier accurate and efficient we need to identify a set of relevant features for better performance of IDS. But feature selection techniques, though extensively studied and widely employed for many real time applications, have been used scarcely in the intrusion detection domain. Hyvarinen et al. (2001) used PCA/ICA to compress data which did not yield satisfactory feature reduction from an intrusion detection perspective. Sung and Mukkamala (2003) have determined a smaller set of 19 features for intrusion detection task without compromising the performance in terms of classification accuracy. Chebrolu et al. (2004) determined a subset of 12 features and 17 features using feature selection algorithm involving Bayesian networks and, Classification and Regression trees respectively. In this paper, we have proposed a new fitness function for Genetic algorithm to identify relevant feature subset. The proposed method selects the smallest set of relevant features that provides maximum classification accuracy. In order to check the efficacy of proposed method for feature selection, we have compared our experimental results with other commonly used feature selection methods in machine learning and data mining. In Section 2, the various feature selection techniques are discussed. Section 3 deals with proposed approach to identify relevant features for intrusion detection. Section 4 briefly describes Multi-class SVM. Section 5 gives the detailed description of the experimental setup and results. Finally the conclusions and the future work are discussed in Section 6.
2. FEATURE SELECTION TECHNIQUES High dimensional features describing real life data in general contains noisy, irrelevant and redundant features which makes them inefficient for machine learning. So it becomes important to identify a small subset of relevant features to improve the efficiency and performance of learning algorithms. Thus, the removal of the noisy, irrelevant, and redundant features is a challenging task. There are two major approaches to overcome this: feature selection and feature extraction (Devijver and Kittler, 1982). While feature selection reduces the feature set by eliminating the features inadequate for classification, feature extraction methods build a new feature space from the original features. In literature, feature selection techniques are classified into two categories: filter methods and wrapper methods (Kohavi, and John, 1997; Langley, 1994). The Filter model depends on general characteristics of the training data to select a feature set without involving any learning algorithm. Most filter methods have adopted some statistical feature selection to determine relevant feature subset which requires less computation time. The feature selection algorithm used for feature subset selection cannot be exhaustive. Hence one has to compromise with suboptimal solution. However the choice of criterion for evaluating the feature subsets is a sensitive issue. It has to estimate the usefulness of a subset accurately and economically. In literature, various suboptimal feature selection search techniques are suggested. Among them, the simplest method of constructing a subset of relevant feature is to select d individually best features present in original feature set. Features are ranked based on some statistical properties of given data. Information gain, Gini index and Relief-F are few most commonly used approaches for evaluating the ranking of features.
2.1 Information Gain The Information Gain is a measure based on entropy which is popularized in machine learning by Quinlan (1986). It measures the decrease of the weighted average impurity of the partitions compared with the impurity of the complete set of examples. Expected information needed to classify a given sample is calculated by m
I(s1 , s2 ,..., sm ) = −∑ i =1
set.
76
si s log( i ) s s
(1)
Where si (i=1, 2,…,m) represents samples of class i and s is the total number of samples in the training
IADIS European Conference Data Mining 2009
A feature F with values { f1, f2, …, fv } can divide the training set into v subsets { S1, S2, …, Sv } where Sj is the subset which has the value fj for feature F. Assuming Sj contain sij samples of class i, entropy of the feature F is given by
s1 j + ... + smj I ( s1 j ,..., smj ) s j =1 v
E(F ) = ∑
(2)
Information gain for feature F can be calculated as Gain(F) = I(s1, …, sm) – E(F)
(3)
2.2 Gini Index The Gini index is another popular measure for feature selection in the field of data mining proposed by Breiman et al. (1984). It measures the impurity of given set of training data and can be calculated as Gini Index = {nl* GiniL + nr * GiniR}/n 2
(4) 2
k ⎛ ⎛ ⎞ ⎞ where GiniL = 1.0 − ∑ ⎜ Li ⎟ and GiniR = 1.0 − ∑ ⎜ Ri ⎟ ⎟ ⎜ ⎜ ⎟ i =1 ⎝ nl ⎠ i =1 ⎝ nr ⎠ GiniL and GIniR are Gini Index on the left and right side of the hyperplane respectively. Li and Ri are number of values that belong to class i in left and right partition respectively, nl and nr are number of values in left and right partition respectively, k is the total number of classes, n is the total number of expression values.
k
2.3 Relief-F Relief algorithm (Kira and Rendell, 1992) is a feature selection method based on feature estimation. Most of the feature estimators are unable to detect any form of redundancy as the features are valued individually. However, Relief does not formulate conditional independence of features and hence captures their interdependencies and makes Relief a suitable choice for the task at hand. Relief assigns a positive weight to those features which have different values on pairs that belong to different classes and negative weight to those features that have different values on pairs that belong to the same class. Relief randomly selects one tuple and finds its nearest neighbors: one from the same class and the other from the different class. It computes the importance measure of every feature using these neighbors. Relief is limited to two-class problem and cannot deal with noisy and incomplete data whereas its extension by Robnik and Kononenko (2003) is more robust and can deal with noisy, incomplete and for multi-class problem. The disadvantage of ranking method is that the features may be correlated among themselves (Ding and Peng, 2003). Sequential forward feature selection is another simple bottom up search approach used in literature where one feature at a time is added to the current subset of relevant features. The relevance of feature subset can be measured in terms of some distance metrics i.e. interclass distance, probabilistic distance etc. Most commonly used metrics in literature are: Euclidean distance, Mahalanobis distance and Inter-Intra distance.
2.4 Euclidean Distance In this method, we are trying to find a subset of features X, for which X ⊂ Y, where Y is the entire feature set. The subset X is chosen such that it optimizes some criterion J. J(x) can be easily calculated as: J(x) = max{(μi − μj)'(μi − μj)}
(5)
For all classes i, j where i ≠ j and μi is the mean feature vector of class i.
77
ISBN: 978-972-8924-88-1 © 2009 IADIS
2.5 Mahalanobis Distance This method is similar to the Euclidean distance, but for Mahalanobis distance J(x) is calculated as: J(x) = max{(μi − μj)' ∑ij-1 (μi − μj)} Where ∑ij = πi ∑i + πj ∑j
(6)
Here, πk is the a priori probability of class k and ∑k is the variance-covariance matrix of class k for the given feature vector.
2.6 Inter-Intra Distance This feature selection method is based on inter–intra class distance ratio (ICDR). The ICDR is defined as: ICDR = log |SB + SW| / |SW|
(7)
Where SB is the average between-class covariance matrix and SW is the average within-class covariance matrix. These can be estimated from the data set as follows: m
SB =
∑
p(ci) (μ(i) − μ) (μ(i) − μ)t
(8)
p(ci)W(i)
(9)
i=1
m
SW =
∑ i=1
W(i) = 1/ (ni−1)
and
ni
∑
(Xj(i) −μ(i)) (Xj(i) −μ(i))t
(10)
i =1
where p(ci) is the a priori probability of class i, ni is the number of patterns in class i, Xj(i) is the j th pattern vector from class i, μ(i) is the estimated mean vector of class i, m is the number of classes. Since, the filter approach does not take into account the learning bias introduced by the final learning algorithm, it may not be able to select most suitable set of features for the learning algorithm. On the other hand, the wrapper model requires one predetermined learning algorithm in feature selection process. Features are selected based on how they affect the performance of learning algorithm. For each new subset of features, the wrapper model needs to learn a classifier. It tends to find features better suited to the predetermined learning algorithm resulting in superior learning performance. However, the wrapper model tends to be computationally more expensive than the filter model. SVM_RFE (Golub et al., 1999) belongs to category of wrapper approach.
2.7 SVM_RFE Rather than a prior ranking of features, we could use approach in which one can determine and eliminate redundant features based on matrix weights of Support vector Machine (SVM) classifier. During the training process, when a linear kernel employed, the weight matrix for SVM can be given by m
W = ∑ yiαixi i =1
(11)
Where yi is class label of ith training tuple xi, αi is value of support vector point. The smallest component values of the weight matrix will have least influence in the decision function and will therefore be the best candidates for removal. The SVM is trained with the current set of features and the best candidate feature for removal is identified via the weight vector. This feature is removed and the process is repeated until termination.
78
IADIS European Conference Data Mining 2009
3. PROPOSED APPROACH FOR FEATURE SELECTION A genetic algorithm (GA) is a search technique used for computing true or approximate solution to optimization and search problems (Goldberg, 1989). GA employ evolutionary biology phenomenon such as inheritance, mutation, selection, and crossover. Genetic algorithms are implemented as a computer simulation in which a population of abstract representations (called chromosomes) of candidate solutions (called individuals) to an optimization problem evolves toward better solutions. In general, the evolution starts from a population of randomly generated individuals and occurs in generations. In each generation, the fitness of every individual in the population is evaluated. Based on their fitness, multiple individuals are stochastically selected from the current population, and modified (recombined and possibly randomly mutated) to form a new population. Next iteration is carried out with the newly obtained population. Generally, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached. A typical genetic algorithm requires two things to be defined: (i) A genetic representation of the solution domain (ii) A fitness function to evaluate the solution domain. In proposed method each chromosome represents a set of features selected for classification. It is represented by a sequence of M 0’s and 1’s. In the chromosome 1 means that the corresponding feature is selected and 0 indicates the corresponding feature is not selected. The initial population is generated randomly. The performance of IDS can be measured in terms of its classification accuracy. Since at the same time we want to remove irrelevant or redundant features, the problem becomes multi-objective. It is desired to increase the classification accuracy as well as to minimize the number of features. For this we define a new fitness function, which takes care of both. The fitness function of chromosome x is calculated as Fitness(x) = A(x) + P / N(x)
(12)
Where A(x) is the classification accuracy using chromosome x, N(x) is the size of feature set present in chromosome x (number of 1’s in chromosome x), P = 100 / (M * number of test samples used in classifier) and M is total number of features. The value of P in function will take care that number of features are not minimized at the cost of accuracy.
4. MULTI-CLASS SVM Support Vector Machines as conceptualized by Vapnik (1995) are basically binary classification models based on the concept of maximum separation margin between the two classes. Extension to the multi-class domain is still an ongoing research issue though a number of techniques have been proposed. The most popular among them are (Hsu and Lin, 2002): i) One-against-all and ii) One-against-one. The One-against-all method constructs k binary SVMs where k is the number of classes under consideration. The ith SVM is trained with all the data belonging to the ith class having positive labels and all data belonging to the different classes having negative labels. Given a point X to classify, the binary classifier with the largest output determines the class of X. Whereas in the One-against-one the number of binary SVMs is k(k-1)/2, i.e., a model for each combination of two classes from the k given classes. The outputs of the classifiers are aggregated to decide the class label of X.
5. EXPERIMENTAL SETUP To check efficacy of our proposed GA approach for IDS, we compare its performance in comparison to different filter feature selection methods and one of commonly used wrapper approach i.e. SVM_RFE. We employ following features selection methods to kddcup dataset. The ranking of features are carried out using
79
ISBN: 978-972-8924-88-1 © 2009 IADIS
Information gain, Gini index, and Relief-F. Euclidean, Mahalanobis and Inter-Intra distance metrics are used in sequential forward feature selection search method to determine a subset of relevant features. The original kddcup 1999 training data set contains records belonging to one of the five major classes (1) normal, (2) probe, (3) DOS, (4) U2R and (5) R2L. The actual data set contains five million records. Since, the classification technique and feature selection methods require more computational time to learn for large size of training data, we used samples of dataset from given data for tractable learning. So we have constructed training file containing 4043 records each of which are randomly generated from the training data set. Testing file was also created by randomly selected 4031 records from the testing data set to check the performance of our classifier on this unseen data. The description of our training file and testing file are shown in Table I. Genetic Algorithm parameters used for feature subset selection for our experiments are given in Table II. Table 1. Description of Tuples in Training File and Testing File
Attack Types Normal Probe DOS U2R R2L Total
Number of tuples in Training File 998 998 998 51 998 4043
Number of tuples in Testing File 998 998 998 39 998 4031
Table 2. Parameters of Genetic Algorithm
Parameters
Value
Size of Population Length of chromosome
10 41
Number of generations Crossover rate Mutation rate
500 0.98 0.02
90 Classification Accuracy
80
Info Gain
70
Gini Index
60
ReliefF
50
Eucliden
40
Mahalanobis
30
Inter-Intra
20
SVM_RFE
10 0 1
3 5
7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Number of Attributes
Figure 1. Comparison of Classification Accuracy with Number of Features using Different Methods
SVM classifier is used to determine the classification accuracy. We have used One-against-all approach for the Multi-class SVM. We have used RBF function (K(x,y) = exp [-γ║x-y║2] ) in our experiment. After determining feature subsets, we have performed experiments to observe the variation of classification accuracy with the number of features for different feature selection techniques. The variation of classification accuracy with number of features for different feature selection methods are shown in Figure 1. The maximum classification accuracy achieved corresponding to minimum number of features for each features selection method is given in Table III. From Table III, it can be observed that the performance of SVM_RFE is comparatively better than other approaches.
80
IADIS European Conference Data Mining 2009
Table 3. Maximum Classification Accuracy with Minimum Features Methods
Classification Accuracy
Minimum number of features
Info Gain
80.92%
7
Gini Index
74.7%
3
Relief – F
79.53%
21
Euclidean
75.07%
5
Mahalanobis
76.09%
20
Inter-Intra
78.02%
35
SVM_RFE
83.08%
22
Table 4. Classification Accuracy using Genetic Algorithm S. No
No. of Features
1 2 3 4 5 6 7 8 9 10
19 22 13 18 21 16 23 14 17 20
Classification Accuracy 84.32 82.06 83.70 85.59 85.54 85.56 85.46 85.81 84.37 85.81
We have run GA in conjunction with SVM 50 times. It was observed that the feature subset selected might not be same in different runs. This is because GA is a stochastic method. The distinct results using GA in different runs are shown in Table IV. We can observe that the variation in classification accuracy is not significant among different runs. It can also be observed that GA is able to achieved maximum classification accuracy of 85.81% using only 14 features, which outperforms SVM_RFE both in terms of classification accuracy and number of features. It can be observed that GA achieves better classification accuracy in comparison to other feature selection methods. It is also observed that the minimum number of feature required to achieve maximum classification accuracy is not same for different feature selection methods. The minimum number of features required to achieve maximum classification accuracy in case of GA is 14 which is more than Information Gain, Gini Index, Euclidean and less than Relief-F, Mahalanobis, Inter-Intra and SVM_RFE.
6. CONCLUSIONS For real time identification of intrusion behavior and better performance of IDS, there is need to identify a set of relevant features. Because the original feature set may contain redundant or irrelevant features which degrade the performance of IDS. In this paper, we have proposed a wrapper method which is based on Genetic Algorithm in conjunction with SVM. A new fitness function for genetic algorithm is defined that focuses on selecting the smallest set of relevant features that can provide maximum classification accuracy. The performance of proposed method is evaluated on the kddcup 1999 benchmark dataset. We have compared the performance of the proposed method in terms of classification accuracy and number of features with other feature selection methods. From the empirical results, it is observed that our method provides better accuracy in comparison to other feature selection methods. The results obtained show a lot of promise in modeling IDS. A reduced feature set is desirable to reduce the time requirement for real time scenarios. Identification of precise feature set for different attack categories is an open issue. Another direction can be reduction of the misclassification regions in multi-class SVM.
REFERENCES Abraham, A 2001, ‘Neuro-fuzzy systems: state-of-the-art modeling techniques, connectionist models of neurons, learning processes, and artificial intelligence’, In: Mira Jose, Prieto Alberto, editors. Lecture notes in computer science. vol. 2084. Germany: Springer-Verlag; pp. 269-76. Granada, Spain. Banzhaf, W et al. 1998, ‘Genetic programming: an introduction on the automatic evolution of computer programs and its applications’, Morgan Kaufmann Publishers Inc. Breiman, L et al. 1984, ‘Classification and Regression Trees’, Wadsworth International Group, Belmont, CA.
81
ISBN: 978-972-8924-88-1 © 2009 IADIS
Chebrolu, S et al. 2004, ‘Hybrid feature selection for modeling intrusion detection systems’, In: Pal NR, et al, editor, 11th International conference on neural information processing, Lecture Notes in Computer Science. vol. 3316. Germany: Springer Verlag; pp. 1020-5. Cho, SB & Park, HJ 2003, ‘Efficient anomaly detection by modeling privilege flows with hidden Markov model’, Computers and Security, Vol. 22, No. 1, pp. 45-55. Devijver, PA & Kittler, J 1982, ‘Pattern Recognition: A Statistical Approach’, Prentice Hall. Ding, C & Peng, HC 2003, ‘Minimum redundancy feature selection from microarray gene expression data’, In IEEE Computer Society Bioinformatics Conf, pp. 523-528. Goldberg, DE 1989, ‘Genetic algorithm in search, optimization and machine learning’, Addison Wesley. Golub, TR et al. 1999, ‘Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring’, Science, vol. 286, pp. 531-537. Han SJ & Cho SB, 2003. ‘Detecting intrusion with rule based integration of multiple models’, Computers and Security, Vol. 22, No. 7, pp. 613-23. Hontanon, RJ 2002, ‘Linux Security’, SYBEX Inc. Hsu, CW & Lin, CJ 2002, ‘A Comparison of Methods for the Multi-class Support Vector Machine’, IEEE Transaction on Neural Networks, 13 (2), pp. 415-425. Hyvarinen, A et al. 2001, ‘Independent component analysis’, John Wiley & Sons. Kira, K & Rendell LA 1992, ‘A Practical Approach to Feature Selection’, In the Proc. of the Ninth International Workshop on Machine Learning, Morgan Koufmann Publishers Inc, pp. 249-256. Kohavi, R & John, G 1997, ‘Wrapper for feature subset selection’, Artificial Intelligence, Vol. 97, No. 1-2, pp.273-324. Langley, P 1994, ‘Selection of relevant features in machine learning’, In AAAI Fall Symposium on Relevance. Lazarevic, A et al. 2003, ‘A comparative study of anomaly detection schemes in network intrusion detection’, In: Proceedings of Third SIAM Conference on Data Mining. Lee, W 1999, ‘A Data Mining Framework for Constructing Features Models for Intrusion Detection Systems’, Ph.D. Thesis, Columbia University Press. Lippmann, R & Cunningham, S 2000, ‘Improving intrusion detection performance using keyword selection and neural networks’, Computer Networks, Vol. 34, No. 4, pp. 594-603. Mukkamala, S et al. 2004, ‘Intrusion detection systems using adaptive regression splines’, In: Seruca I, Filipe J, Hammoudi S, Cordeiro J, editors. Sixth international conference on enterprise information systems. ICEIS’04, Portugal, vol. 3. pp. 26-33. Quinlan, JR 1986, ‘Induction of decision trees’, Machine Learning, 1, pp 81-106. Robnik, M & Kononenko, I 2003, ‘Theoretical and Empirical Analysis of ReliefF and RreliefF’, Machine Learning Journal. Sung, AH & Mukkamala, S 2003, ‘Identifying important features for intrusion detection using support vector machines and neural networks’, In: Proceedings of International symposium on Applications and the Internet. pp. 209-17. Vapnik, VN 1995, ‘The Nature of Statistically Learning Theory’, Springer, Berlin Heidelberg, New York.
82
IADIS European Conference Data Mining 2009
ACCURATELY RANKING OUTLIERS IN DATA WITH MIXTURE OF VARIANCES AND NOISE Minh Quoc Nguyen Edward Omiecinski Leo Mark College of Computing Georgia Institute of Technology 801 Atlantic Drive Atlanta, GA 30332-0280
ABSTRACT In this paper, we introduce a bottom-up approach to discover outliers and clusters of outliers in data with a mixture of variances and noise. First, we propose a method to split the outlier score into dimensional scores. We show that if a point is an outlier in a subspace, the score must be high for that point in each dimension of the subspace. We then aggregate the scores to compute the final outlier score for the points in the dataset. We introduce a filter threshold to eliminate the small scores during the aggregation. The experiments show that filtering is effective in improving the outlier detection rate. We also introduce a method to detect clusters of outliers by using our outlier score function. In addition, the outliers can be easily visualized in our approach. KEYWORDS Data Mining, Outlier Detection
1. INTRODUCTION Outlier detection is an interesting problem in data mining because outliers can be used to discover anomalous activities. Historically, the problem of outlier detection or anomaly detection has been studied extensively in statistics by comparing the probability of data points against the underlying distribution of the data set. The data points with low probability are outliers. However, this approach requires the knowledge of the distribution of the dataset to detect the outliers, which is usually unknown. In order to overcome the limitations of the statistical-based approaches, the distance-based (Knorr , E. and Ng, R. 1998.) and densitybased (Breunig et al 2000) approaches were introduced. The points that deviate from the remaining dataset are considered to be the outliers (Knorr , E. and Ng, R. 1998., Breunig et al 2000). The main advantage of this approach over the statistical-based ones is that the knowledge of the distribution of the data set is not required in order to compute the outliers. However, these approaches are ineffective in data with multiple dimensions. Generally, we do not know which features should be used to detect outliers. By dismissing any feature, we may not be able to discover the outliers (Breunig et al 2000). Unfortunately, the problem of feature selection, i.e. finding the appropriate sets of features for computation, is NP-hard. Thus, it is essential to run the algorithm on the entire feature space to detect outliers. However, this approach may affect the quality of outlier detection because of the problems which we call mixture of variances and accumulated subdimensional variations. In this paper, we split the traditional outlier score (Breunig et al 2000) into dimensional scores. The splitting allows us to measure the degree of an outlier in each dimension instead of the entire feature space. Then, we can apply a filter to only retain the strong dimensional scores. Therefore, the outlier will be correctly detected. In the next sections, we will precisely show how the mixture of variances and the accumulation of subdimisional variations affect the quality of outlier detection.
83
ISBN: 978-972-8924-88-1 © 2009 IADIS
1.1 Mixture of Variances in Multiple Features We use a dataset with seven data points to illustrate the first problem of using the traditional k-nearest neighbors (L2) to detect outliers. The data has three features x, y and z in which the domain of x and y is the interval [0, 2] and that of z is the interval [0, 8]. Figure 1a shows a 2D plot for the data points for features x and y. According to the figure, the nearest neighbor distance of any points excluding p is less than 0.64. The nearest neighbor distance of p is 1.39. From those two values, we see that p has an unusually high nearest neighbor distance compared with the other points. Point p is an outlier in this figure. Figure 1b shows the complete plot for the data points for all of three features x, y and z. The range of z is four times that of x and y, which makes the difference in the distance between p and the other points in features x and y insignificant compared with that in feature z. As we can see, the nearest neighbor distance of p is very similar to or less than the average nearest neighbor distance of six other points in the data. According to this figure, p is a normal point.
(a) 2D
(b) 3D
Figure 1. The 2D Outlier is Suppressed in the 3D Space
Those two figures illustrate the problem of using traditional pairwise distance to detect outliers. One may ask if we can normalize the dataset to solve the problem. However, if those points are taken from a larger dataset and they are nearest neighbors of each other, the problem still remains. We can generalize the problem into any arbitrary number of features. Let say {σ i } the variances of the features in a subspace that point q is an outlier. If there is a feature j with the variance of {σ j } , where {σ j = ki × σ i } and k i is large, q becomes normal in the new subspace that contains feature j. The variances can be computed from the local area of point q or from the entire dataset, which corresponds to the problem of local outlier and global outlier detection respectively. An approach to solve the problem is to compute the outlier scores for the data points for all possible combinations of features separately. If a point is an outlier in a subspace of the entire feature space, the outlier score of the point is high. However, the problem of feature selection is NP-hard.
1.2 Accumulated Subdimensional Variations Let consider three points p, q and r in an n-dimensional space. In this example, p and q are normal points; whereas r is an outlier in an m-dimensional subspace. We denote the i th feature of a point by subscript i . We assume that the difference between pi and qi is δ n for all i ∈ [1, n] . Thus, we have d ( p, q ) =
We further assume that | pi − ri |= δ m for d ( p, r ) =
∑δ n
1
2 n
(1)
∀i ∈ [1, m] and | pi − ri |= 0 for i ∈ [m + 1, n] . We have
∑
m
1
δ m2 = δ m m
(2)
If d ( p, r ) = d ( p, q) , we have δ n = δ m ⇒ δ m = n , where δ n , δ m ≠ 0 n m δn m
84
(3)
IADIS European Conference Data Mining 2009
Let define r = δ m
. We obtain the following expression: r = n (4) m Expression 4 implies that the ratio of the nearest neighbor distance between an outlier and normal points can be as large as r so that the outlier in an m-dimensional space will look normal in n-dimensional space. With n = 100 and m = 4, we will have r = 100 4 = 5 . Hence, outliers which have a ratio of 5:1 or less of the
δn
distance of their nearest normal group of points to the density of the group may not be detected. The number of 5d-subspaces is approximately 7.5×107. The problem of not being able to distinguish if an outlier is a true outlier or a normal point in this example is the problem of accumulated subdimensional variations.
2. OUR APPROACH 2.1 Outlier Criteria in High Dimensions In this section, we will provide concrete intuitive criteria for what it means to be an outlier in high dimensions. The next sections will give precise definitions of our outlier score function based on our criteria. In previous work, the distance between a point and its neighbors is used to define the degree of being an outlier for a point. The results are based on the Euclidean distance. This approach is self explanatory and intuitive in low dimensions. However, it is a problem in high dimensions as shown in the earlier section. Thus, we choose to use the Chebyshev distance because the variances are not cumulative in high dimensions in L∞ (by definition, the Chebyshev distance between any two points p and q is the maximum of | pi − ri |, ∀i ∈ [1, n] ). Let say we have a sample S such that each feature of the points in S follows the distribution N ( μ i , σ ), ∀i ∈ [1, n] . With L2 norm, the distance between two points can vary from 0 to σ 2n . The variance is proportional to the number of dimensions. However, the range of the difference will be limited to the interval [0,2σ ] in L∞ regardless of n. We will use an axis-parallel hyper squared rectangle R (or hypercube) to define the local region of a point p where p is its center in Chebyshev space. The rectangle defines the region that a new point q is still considered a normal neighbor of p. Point q is an outlier with respect to p in region R with length 2d (the distance between any two parallel sides) if its distance to R is significantly larger than the bounds, denoted by ||p−R|| >> d. To be more precise, we have the following postulate: Postulate 1. Given a boundary hyper squared rectangle R with length 2d of a point p, A point q is an outlier with respect to point p if distance( p, R) > κd for some large κ . Theorem 1. A point q is an outlier with respect to p in region R with length 2d in n-dimensional space iff q is an outlier with respect to p in at least one dimension i , where i ∈ [1, n] . Proof. The projected rectangle into a dimension i is a line segment Di where p is its center. Since the length of the rectangle is 2d, the length of the line segment is 2d. Since q is an outlier w.r.t. p, we have distance( p, R) > κd . As defined, the distance from a point to a rectangle is the maximum distance from the point to the surfaces of the rectangle in the Chebyshev space. Since the surfaces are orthogonal or parallel to the line segment, ∃i : distance( pi , Di ) > κd . Thus, p is an outlier in at least one dimension i.
Conversely, if q is an outlier w.r.t p in at least one dimension i, we have distance( p, R) > κd by the Chebyshev distance definition. Therefore, q is the outlier w.r.t p in the n-dimensional space. We can extend the concept to outlier with respect to a set of points S. Postulate 2. Given a set of points S, if a point q is an outlier with respect to all points p in S for some rectangle R of p, then q is an outlier in S.
85
ISBN: 978-972-8924-88-1 © 2009 IADIS
It is straightforward to see that if p is an outlier to all points in S, then p is an outlier with respect to S. However, if p is outlier only to a few points in S, p is not an outlier with respect to S. From theorem 1, we observe that we can compute the outlier score in each dimension instead of computing the outlier in all dimensions so that the dimensions where a point does not show up as outliers are not included in the outlier score. Then, we can aggregate all the scores into a final score. This approach can prevent small variances from being accumulated. From the problem of mixtures of variances in figure 1b, we observe that the differences in the variances suppress the outliers. The dimensions with high variances will dominate those with low variances. Since the outlier detection is based on unsupervised learning, we treat all dimensions as equal. In other words, the rate of deviation is more important than the module of deviation. This suggests that we compute the ratio of the distance against the variances in each dimension of a point instead of using the distance to measure the degree of an outlier. With this approach, the distances in the dimensions will be normalized with respect to the variances of the corresponding dimensions. Thus, the problem of mixture of variances is solved. In the following sections, we will discuss how to compute the outlier ratio for each dimension.
2.2 Definitions We use kthnn(p) to denote the kth nearest neighbor of p in L∞ and kdist(p) (k-distance) is the distance from p to its kth-nearest neighbor. The k-distance measures the relative density of the points in a dataset. Next, we want to compute the density of a point p projected into each dimension which we call dimensional density. The densities are used to construct the boundary hyperrectangle for the point. A simple approach to compute the dimensional densities is to average the local distances from a point p to its neighbors for the dimension under consideration. However, the result depends on parameter k. It raises the question of how many neighbors should we consider in computing the dimensional densities? With small k, the dimensional density is less biased but the variance is high. In contrast, the dimensional density will be more biased with large k. In Nguyen et al 2008, the authors introduce a definition of adaptive nearest neighbors which allows us to determine the natural dimensional density in terms of the level of granularity at each point. According to Nguyen et al (2008), if a point is in a uniformly distributed region, k should be small since the distance between the point and its few nearest neighbors is approximately the local density of the point. Otherwise, the decision boundary and the level of granularity are used to select k. We adapt these concepts to define local dimensional density. We create an ordered list Li of the nearest neighbors of p ordered by di for each dimension. All
q ∈ KNN ( p) , where KNN is the list of nearest neighbors, whose d i ( p, q ) = 0 should be eliminated from the list. To simplify the problem, we assume that there is no q such that d i ( p, q ) = 0 . Let say we have j th Li ≡ {q1 , K , q k } , where q is j nearest neighbor of p. j j −1 ξ i j = d i ( p, q ) − d i ( p, q )
d i ( p, q
j −1
)
For each j ∈ [2, k ] , we compute the ratio
(where d i ( p, q) =| pi − qi | ). If p is in a uniformly distributed region, ξ i j will
uniformly increase with j, in such cases we can use d i ( p, q1 ) to represent the local dimensional density of p
ξi j is called the decision boundary of the local distance of point p. We can measure the sharpness by a parameter λ , i.e. ξ i j ≥ λ . The decision boundaries are used to adjust the level of granularity. We use a parameter z to in dimension i regardless of the level of granularity. A point where there is a sharp increase in
determine the level of granularity in detecting the outliers. We then define the local dimensional density of a point p with a granularity of level z as follows: Definition 1. Given q jz is the zth decision boundary point of a point p, the local dimensional density of p with the granularity level z in dimension i is 1 j ⎪⎧d ( p, q ), ξ i < λ or z = 1, ∀j ∈ [1, k ] (5) γ i ( p) = ⎨ i ⎪⎩d i ( p, q jz ), otherwise
86
IADIS European Conference Data Mining 2009
Next, we compute the average local distance in each dimension for a local region S. Region S is a set of points in a local region of the dataset. With |S| large enough, formula 6 is the estimate of the expected mean of local dimensional densities of the points in the region. In the formula, the local distances whose value is zero are removed from the computation. Definition 2. Dimensional average local distance γ (q) δ = ∑ i m , m =| {γ i (q) / q ∈ S ∧ γ i (q ) ≠ 0
(6)
In definition 2, m is the number of points in S which has the local distances that are not zero. Definition 3. Dimensional variance ratio | pi − qi |
(7) δi Formula 7 measures the deviation of point p from point q with respect to the average variance of the points in the ith -dimension. It follows the outlier criteria where {2δ i } is the length of the rectangle of q. On the ri ( p, q) =
average, the ratio is close to 1 if p is within the proximity of q. In contrast, those with ri ≥ 1 imply that they deviate greatly from the normal local distance in terms of dimension i. They are outliers with respect to q in dimension i. Since it has been proved in theorem 1 that an outlier in an m-dimensional space will be an outlier in at least one dimension, formula 7 is sufficient to detect outliers with respect to q in any subspace which can be shown in the following theorem. Theorem 2. Let denote r ( p, q ) = max{ri ( p, q )}, ∀i . If r ( p, q) > κ , for some large
κ,
then p is an
outlier to q. Proof. We can consider that {δ i } as the normalizing constants for all points in region S. Since S is small,
we can approximately consider that the points within a rectangle R with unit length of 2 where q is its center are normal neighbors of q. Then, r ( p, q) > κ is the distance from p to rectangle R. Since r ( p, q) > κ , for some large κ , then p is an outlier to q according to postulate 1. Theorem 3. Given a set S, a point q is an outlier in S if r ( p, q) > κ , ∀p ∈ S . Proof. The result follows directly from postulate 2 and theorem 2.
Since a point can be an outlier in some subspaces (with respect to its KNNs in the original n-space), it is natural to aggregate the dimensional variance ratios into one unified metric to represent the total deviation of point p. However, a naive aggregation of the ratios in all dimensions can lead to the problem of overlooking the outliers as discussed in section 1.2. If the dimensional variance ratios in the sample follow the distribution N (1, ε ) , the total ratio can be as large as (1 + ε ) n , for normal points according to formula 8, which is significant when n is large. The ratio is large not because the point deviates from others but because the small dimensional variations are accumulated during the aggregation, which makes the total ratio large. Therefore, we introduce a cutoff threshold ρ 0 . Only ratios that are greater than ρ 0 are aggregated in order to compute the total value. Definition 4. Aggregated variance ratio r ( p, q ) =
∑r
i i
2
( p, q ), ∀ri ( p, q ) > ρ 0
(8)
Instead of naively combining all the ratios, we only combine the ratios that are significant. The cutoff threshold ρ 0 is used as a filter to remove the noisy dimensions that do not contribute to the outlier score of point p. Our experiments confirmed that the filter is effective in improving the outlier detection rate.
87
ISBN: 978-972-8924-88-1 © 2009 IADIS
Property 1. If p is an outlier with respect to q, r ( p, q ) >
ρ0 .
Proof. If p is an outlier with respect to q, there is at least one dimension i such that ri ( p, q ) > κ . If we
set ρ 0 = κ , ri ( p, q ) > ρ 0 . Since ri ( p, q ) > ρ 0 and r ( p, q) > ri ( p, q ) , we have r ( p, q) > ρ 0 . Property 2. If p is not an outlier with respect to q, r ( p, q) = 0 . Proof. If p is not an outlier with respect to q, then ri ( p, q ) ≤ κ , ∀i . If we set ρ 0 = κ , ri ( p, q) ≤ ρ 0 , ∀i .
Thus, from definition 8, we have r ( p, q ) = 0 According to property 1, if a point is an outlier in some subspace, its aggregated ratio should be greater than ρ 0 with respect to all points within its proximity. Therefore, we can define a score function to measure the degree of p as an outlier as follows: Definition 5. Outlier score oscore( p / S ) = min q∈S r ( p, q)
(9)
Formula 9 aggregates the outlier information for a point from all dimensions. Since the dimensions where p is not an outlier are excluded, we can guarantee that p is an outlier in S if its oscore is high. In addition, if p is an outlier in any subspace, the value of oscore for p must be high (theorem 3). Thus, oscore is sufficient to measure the degree of outlying of a point in any subspace. Formula 9 defines the degree to which a single point in the data set is considered an outlier. It should be noted that it is possible for points to appear as a group of outliers. In such cases, the value of oscore will be zero. We observe that every point in a small group C of outliers should have a large value for oscore if we compute the value of oscore for that point without considering the points in its cluster. If there exists a point q in the cluster whose oscore value with respect to S – C is zero, the group is a set of normal points instead. It is because q is normal and all points that are close to q in terms of the aggregated variance ratio are also normal. Therefore, all the points must be normal. Using these observations, we can define a cluster of outliers as follows: Definition 6. Outlier cluster in a set S is a set of points C such that oscore( p / S − C ) > ρ 0 , ∀p ∈ C and r ( p, q) = 0, ∀p, q ∈ C .
When the pairwise deviation between the outliers is small with respect to the average local distance in all dimensions, the outliers naturally appear as a cluster. This fact is captured by the second condition in the formula. The degree of an outlier cluster is defined in the following definition: Definition 7. Outlier cluster score oscore(C / S ) = min p∈C oscore( p / S − C ) (10)
Thus far, we have introduced the definitions to detect outliers which conform to the intuitive outlier criteria in section 3.1. The rectangles for points in a sample are bounded by {δ i } . Definition 3 defines the ratio of deviation between any two points with respect to the average local variance in a dimension. We can interpret this as a similarity function between two points relative to the average variance in one dimension. As given in section 3.1. If a point is dissimilar to all points in at least one dimension, it is an outlier. Definitions 6 and 7 extend the concept of outlier to an outlier cluster, which provides complete information about the clusters of outliers in a data set. With definition 6, we can discover the clusters of outliers where their degree of being an outlier can be computed by oscore(C/S). A nice feature of this approach is that we can identify which dimensions that a point is an outlier by using the dimensional ratio. This can then be used to visualize the outliers.
88
IADIS European Conference Data Mining 2009
2.3 Clustering As discussed above, clusters of outliers can be detected by using the outlier score function. We can use an edge to represent the link between two points. If the aggregated variance ratio between two points is zero, there will be an edge connecting two points. A cluster is a set of connected points. When the size of a cluster grows large, we are certain that the points in the cluster are normal since a point can always find at least one point close to it in the graph. However, if the points are outliers, there will be no edge that connects the outliers with other points. Thus, the cluster will be small. We apply the clustering algorithm in Nguyen et al 2008 to cluster the dataset by using the computed aggregated variance ratio values in linear time. First, we put a point p into a stack S and create a new cluster C. Then, we take point p and put it in C. In addition, all of its neighbors which are connected to p are put into S. For each q in S, we expand C by removing q from S and adding q to C. The neighbors of q which are connected to q are then put into S. These steps are repeated until no point can be added to C. We then create a new cluster C’. These steps are repeated until S is empty. The pseudocode of the algorithm is shown in algorithm 1. Algorithm 1 Clustering Pseudocode
1: procedure Cluster(HashSet D) 2: Stack S 3: Vector clsSet 4: HashSet C 5: while D ≠ ∅ do p ← remove D 6: 7: push p → S 8: C ← new HashSet 9: add C → clsSet while S ≠ ∅ do 10: q ← pop S 11: 12: add q → C 13: for r ∈ neighbors(q) ∧ r(q, r) ≡ 0 do 14: push r → S 15: remove r from D end for 16: 17: end while 18: end while 19: end procedure Theorem 4. Let say{Ci} is the set of clusters produced by the algorithm, Ci contains no outlier with respect to Ci , ∀i . Proof. Assuming that a point r ∈ Ci is an outlier in Ci, we have r ( p, r ) > ρ 0 , ∀q ∈ Ci (property 1 and postulate 2). According to clustering algorithm 1 from lines 10 to 17, a neighbor r of a point q is put into Ci iff r(q, r) = 0, which contradicts the condition above. Therefore, Ci contains no outlier with respect to Ci.
Theorem 4 shows that the clusters produced by algorithm 1 do not contain outliers. If a cluster C is large enough, we consider it as the set of normal points. Otherwise, we will compute the outlier cluster score for C. If the score is large, C is a cluster of outliers. Therefore, it is guaranteed that algorithm 1 returns the set of outlier clusters.
89
ISBN: 978-972-8924-88-1 © 2009 IADIS
Table 1. Seven Outliers are Detected
Point
others
Score 7.24 6.68 5.98 2.97 2.92 0
Figure 2. Points p1, p2, p3 and Cluster C1 are Generated Outliers
3. EXPERIMENT 3.1 Synthetic Dataset We create a small synthetic data set D to illustrate our outlier score function. We use a two dimensional data set so that we can validate the result of our algorithm by showing that the outliers and groups of outliers can be detected. The data consists of 3000 data points following a normal distribution N(0,1). Three individual outliers {p1, p2, p3} and a group C1 of 10 outliers {q1,..., q10} are generated for the data set. The data set is illustrated in figure 2. First, we compute the oscore for all the points in D with ρ 0 = 2 and α = 0.4 . The algorithm detected 5 outliers. Our manually generated outliers appear in the top three outliers. The next two outliers are generated from the distribution. However, their score is low, which is approximately half of the scores of the manually generated outliers as shown in table 1. Next, we run the clustering algorithm based on the computed oscore as described in the clustering section. The algorithm detected 9 clusters. Among them, two clusters have the score of zero. Thus, seven outlier clusters are detected. Table 2 shows the score of the outliers. As we can see, ten points in the manually generated cluster are detected and correctly grouped into a cluster. In addition, it appears to be the highest ranked outliers. A micro cluster C2 of five outliers is also detected. Its low score is due to the fact that it is randomly generated from a normal distribution. In this example, we have shown that our algorithm can discover micro clusters. However, it should be noted that our algorithm can detect clusters of any size, which makes it also suitable to detect outlier clusters for any application for which the size of the outlier clusters is large but still small relative to the size of the entire dataset.
3.2 KDD CUP ’99 Dataset In this experiment, we use the KDD CUP 99 Network Connections Data Set from the UCI repository (Newman and Merz 1998) to test the ability of outlier detection in detecting the attack connections without any prior knowledge about the properties of the network intrusion attacks. The detection will be based on the hypothesis that the attack connections may behave differently from the normal network activities which makes them outliers. We create a test dataset from the KDD original dataset with 97,476 connections. Each record has 34 continuous attributes representing the statistics of a connection and its associated connection type, i.e. normal, buffer overflow attack. A very small number of attack connections are randomly selected. There are 22 types of attacks with the size varying from 2 to 16. Totally, there are 198 attack connections which account for only 0.2% of the dataset.
90
IADIS European Conference Data Mining 2009
Figure 3. Detection Curve
Figure 4. Detection Rate for the Algorithm with/without Using the Filter
In this experiment, we run the LOF algorithm as a baseline to test our approach since it is the well-known outlier detection method that can detect density based outliers. First, we run LOF on the dataset with different values of min_pts from 10 to 30. The experiment with min_pts = 20 has the best result. In this test, no attack is detected in the top 200 outliers. In the next set of outliers, 20 attacks are detected. The ranking of those attacks are distributed from 200 to 1000. In the top 2000 outliers, only 41 attacks are detected. We then ran our algorithm on the dataset with the same value of ρ 0 and α . Since the data set for KDD is larger than the synthetic dataset, the sample size is 100. The algorithm returns the list of outlier clusters ordered by score. The sizes of those clusters are small and most of them are single outlier clusters. According to the results, one attack is found in the top 10 outlier clusters and 16 attacks are found in the top 50 outlier clusters. Among them, 9 attacks are grouped into one cluster and its ranking is 38. We found that all outliers in this group are warezmaster attacks. Since there are only 12 warezmaster connections in the dataset, the clustering achieves high accuracy for this tiny cluster. In addition, 42 attacks are found in the top 200 outliers and 94 attacks are detected in top 1000. Comparing with the results from LOF where no outliers are detected in top 200 outliers and only 20 outliers are detected in the top 1000 outliers, our algorithm yields a higher order of magnitude for accuracy. Figure 3 shows the detection curve with respected to the number of the outliers. In this curve, we show the detection rate for LOF with min pts = 20 and min_pts = 30. In addition, we show the curves for our algorithms with the ranking in terms of individual outliers and in terms of outlier clusters where the individual outlier is cluster whose size is 1. As we can see, the recall rate of our algorithm is consistently higher than that of the LOF. The recall rate of our algorithm is 60% when the size of outliers is 0.02% of the dataset, whereas that of LOF is 21%. Given the context that outlier detection in general can have very high false alarm rates, our method can detect a very small number of attack connections in a large dataset. Table 2. Nine Outlier Clusters are detected in 2D Dataset Cluster Size Score Items 1 10 7.7 q1 , q2 , q3 , q4 , q5 , q6 , q7 , q8 , q9 , q10 (C1)
Table 3. Detected Attack Connections in KDD CUP Dataset Rank Size Score Rank Size Score 7th 1 152.6 72nd 1 15.7 30th 1 38.7 79th 6 14.8
2
32nd
3 4 5
1 1 1 1
7.24 6.68 5.98 2.98
p1 p2 p3
th
36
th
37
th
r1
38
54 62nd
6 7
1 5
2.92 2.43
r2
8
2
0.00
r9 , r10
9
1
0.00
r11
r3 , r 4 , r5 , r6 , r7 , r8 (C2)
th
1 1 1 9 1 1
34.4 32.5 32.2 32.1 22.3 19.4
80th
1
14.7
st
1
11.9
rd
1
11.5
th
1
8.5
th
9 1
8.5 8.3
111 113 158
159 163rd
91
ISBN: 978-972-8924-88-1 © 2009 IADIS
Table 3 shows the ranking and the cluster size for the top 200 outlier clusters. According to the table, three clusters of attacks are found. The first cluster whose ranking is 38th contains nine warezmaster attacks (recall rate = 75%). The next cluster contains six satan attacks (recall rate = 75%). The last cluster in the table contains 9 neptune attacks (recall rate = 100%). The misclassification rate for those clusters is zero. The recall rate for those attacks is very high given that each of them accounts for less than 1.2 × 10-4% of the dataset.
3.3 The Effect of the Filter Parameter The experiment above shows the result of the algorithm when the filter is applied with ρ 0 = 2.2 . The choice of 2.2 implies that if the deviation of a point with respect to its dimensional variance greater than 2.2, the point is considered an outlier. In this experiment, we want to study the effectiveness of the filter parameter on the quality of the detection rate of our method. Therefore, we ran the algorithm without the filter by setting ρ 0 = 1 , which means the ratios in all dimensions are aggregated. Figure 4 shows the detection rate for our algorithm when ρ 0 = 2.2 , ρ 0 = 1 and the detection rate for LOF with min_pts = 20. According to the figure, our algorithm without using the filter parameter still consistently performs better than the LOF algorithm. The graph also shows that the algorithm can discover 27 attacks in the top 200 outliers. The better performance can be explained by the fact that the variances for all dimensions are normalized by using the dimensional ratios. However, the algorithm with the filter parameter outperforms the algorithm without the filter. In the top 200 outliers, the detection rate for the filter approach is twice that of the test without the filter. The experiment shows that the filter is effective in eliminating the noise attributes in computing outlier scores. Thus, the quality of detecting a true outlier is significantly improved.
3.4 Visualization Theorem 3 shows that if a point is an outlier in an n-dimensional space, it must be an outlier in at least one dimension. This result implies that we can use lower dimensional spaces, i.e. 2D and 3D to visualize the outliers in order to study the significance of the outliers. We take the results of the KDD experiments to study the outliers. In addition to the ranking of the outliers, our algorithm also returns the dimensions in which a point p becomes an outlier by checking for dimensions i in which ri ( p) > ρ 0 . Table 4 shows the dimensional score for two points p7 and p36 which are multihop and back attacks respectively. In the table, p7 is an outlier in the 2nd and 29th dimensions which correspond to the attribute dist_bytes and dst_host_srv_diff_host_rate, whereas p36 is the outlier in the 1st (src_bytes) and 26nd (dst_host_same_srv_rate) dimensions. Figures 5 and 6 show the 2D-subspace for point p36 and its nearest neighbors (Chebyshev space). Figure 5 shows two dimensions in which p36 is not an outlier. As we can see, we cannot recognize p36 from its neighbors. However, p36 appears as an outlier in the 1st (src_bytes) and 26nd (dst_host_same_srv_rate) dimensions as shown in figure 6. Point p36 is clearly distinct from its surrounding points. Figure 7 shows the distribution of p36 ’s neighbors in this 2D-space without point p36. Figures 5, 6 allow us to explain why p36 is not an outlier when computed by LOF. According to LOF, its score is 2.1 and it ranks 6793th in the list of outliers. The score implies that its kdist (Euclidean space) is only twice the average of k –dist of its neighbors. In Chebyshev space, kdist ( p36 / k = 30) is 0.066 and the average kdist (qi / k = 30) is 0.04 for {qi } are the 4nearest neighbors of p. The k−dist of p38 approximates that of its surrounding neighbors in both Euclidean and Chebyshev space. As a result, p36 cannot be detected in the traditional approach. Whereas in our sub dimensional score aggregation approach, p36 is a strong outlier in the 1st dimension. Thus, p36 can be detected. Table 4. Subspace Outliers Point p7
p36
92
Rank
Total Score
7
152.6
r2 = 152.57
r29 = 2.3
36
32.5
r1 = 32 .4
r26 = 2.3
IADIS European Conference Data Mining 2009
Figure 5. Point p36 is not an Outlier in this 2d-Subspace
Figure 6. Point p36 is an Outlier in this 2d-Subspace
4. RELATED WORKS Distance-based (Knorr , E. and Ng, R. 1998.) and density-based (Breunig et al 2000) approaches are introduced to detect outliers in datasets. In these approaches, if the distances between a point and all its other points (distance-based) or its neighbors (density-based) are large, the point is considered an outlier. Since all dimensions are considered, the outliers in subspaces cannot be detected. Recently, Papadimitriou et al (2003) introduced the use of the local correlation integral to detect outliers. The advantage of the local correlation integral approach is that it can compute outliers very fast. However, similar to the approaches mentioned above, this method does not focus on subspace outlier detection. The problem of feature selection and dimensionality reduction, e.g. PCA, have been studied extensively in classification and clustering in order to select a subset of features of which the loss function for eliminating some features is minimized. This approach is inappropriate for outlier detection since the outliers are rare relative to the size of dataset. The set of features that minimize the loss function may not be the features for which the points become outliers. Thus, we may not be able to detect those outliers. Another approach is to randomly select a set of features to detect the outliers (Lazarevic and Kumar 2005). Since the number of possible subspaces is large, the points may not be the outliers in the chosen subspaces and there is no guarantee that the points appearing to be outliers in the remaining subspaces can be detected. Another work similar to the problem of subspace outlier detection is the problem of subspace clustering (Argrawal et al 2005, Aggarwal 2000, Aggarwal 2005) which focuses on detecting clusters in the subspaces by detecting the dimensions where a set of points are dense. In addition, their primary focus is to cluster the dataset rather than detect outliers. Therefore, they are not optimized for outlier detection.
5. CONCLUSION In this paper, we have shown the problem of the mixture of variance and the accumulation of noise in high dimensional datasets in outlier detection. We then introduced a bottom-up approach to compute the outlier score by computing the ratios of deviations for each dimension. We aggregate the dimensional scores into one final score. Only dimensions in which the ratios are high will be aggregated. Since the dimensions with high variances are treated the same as those with low variances, this method solves the mixture of variances problem. In addition, we also introduce the use of the filter threshold to solve the problem of random deviation in a high dimensional dataset by preventing many small deviations from being accumulated into the final outlier score. According to the experiment, the filter has significantly boosted the performance of outlier detection. In addition, the method allows us to visualize the outliers by drawing the graphs in the dimensions where the points deviate from others. By studying the graphs, we can eliminate the dimensions in which the outliers are not interesting to us and it can be explained why the outliers are interesting. In this paper, we also apply the clustering technique from Nguyen et al (2008) to cluster the outliers by observing that two points whose oscore between them is zero are close to each other and should be in the same cluster. Thus, our method can also produce clusters of outliers. The experiments in KDD CUP ’99 dataset have shown that the detection rate in our method is improved compared with that in the traditional density-based outlier detection method.
93
ISBN: 978-972-8924-88-1 © 2009 IADIS
REFERENCES Aggarwal, C and Yu, P. (2000) ‘Finding generalized projected clusters in high dimensional spaces’. SIGMOD Rec., 29(2), pp. 70–81. Aggarwal.C and Yu.P. (2001). ‘Outlier detection for high dimensional data’. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 37–46. Aggarwal, C., Han, J., Wang, J., and Yu, P. (2005) ‘On high dimensional projected clustering of data streams’. Data Mining and Knowledge Discovery, 10(3), pp. 251–273. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (2005) ‘Automatic subspace clustering of high dimensional data’. Data Mining and Knowledge Discovery. Breunig, M., Kriegel, H., Ng, P. , and Sander, J. (2000) ‘LOF: identifying density-based local outliers’. Proceedings of the 2000 ACM SIGMOD international conference on Management of data, May 15-18, 2000, Dallas, Texas, United States, pp.93-104 Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. (2004) ‘SMOTEBoost: Improving prediction of the minority class in boosting’. Lecture Notes in Computer Science, volume 2838/2003, Springer Berlin /Heidelberg, Germany. Das, K. and Schneider, J. (2007) ‘Detecting anomalous records in categorical datasets’. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, pp. 220–229 Jarvis, R. and Patrick, E. (1973) ‘Clustering using a similarity measure based on shared near neighbors’. IEEE Transactions on Computers, C-22(11), pp.1025–1034. Knorr , E. and Ng, R. (1998) ‘Algorithms for mining distance-based outliers in large datasets’. In VLDB ’98: Proceedings of the 24rd International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 392–403 Korn, F., Pagel, B., and Faloutsos, C. (2001) ‘On the ’dimensionality curse’ and the ’self-similarity blessing’’. IEEE Transactions on Knowledge and Data Engineering, 13(1), pp.96–111 Kriegel, H., Hubert, M., and Zimek, A. (2008) ‘Angle-based outlier detection in high dimensional data’. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge Discovery and data mining, New York, NY, USA, pp. 444-452 Lazarevic, A. and Kumar, V. (2005) ‘Feature bagging for outlier detection’ In KDD ’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, New York, NY, USA, pp. 157–166 Mannila, H., Pavlov, D., and Smyth, P. (1999) Prediction with local patterns using cross-entropy. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, pp. 357–361 Nguyen, M., Mark, L., and Omiecinski, E. (2008) ‘Unusual Pattern Detection in High Dimensions’. In The Pacific-Asia Conference on Knowledge Discovery and Data Mining Newman, C. and Merz, C. (1998) UCI repository of machine learning databases Papadimitriou, S., Kitagawa, H., Gibbons, P., and Faloutsos, C. (2003) ‘LOCI: Fast outlierdetection using the local correlation integral’. In Proceedings of the international conference on data engineering. IEEE Computer Society Press, , pp. 315– 326 Shaft , U. and Ramakrishnan, R. (2006) ‘Theory of nearest neighbors indexability’. ACM Trans. Database Syst., 31(3), pp.814–838 Steinwart, I., Hush, D., and Scovel, C. (2005) ‘A classification framework for anomaly detection’. J. Mach. Learn. Res., 6, pp. 211–232. Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., and Gunopulos, D. (2006) ‘Online outlier detection in sensor data using non-parametric models’. In VLDB ’06: Proceedings of the 32nd international conference on Very large data bases, pp. 187–198.
94
IADIS European Conference Data Mining 2009
TIME SERIES DATA PUBLISHING AND MINING SYSTEM* Ye Zhu, Yongjian Fu Cleveland State University, 2121 Euclid Ave., Cleveland, OH, USA
Huirong Fu Oakland University Rochester, MI 48309, USA
ABSTRACT Time series data mining poses new challenges to privacy. Through extensive experiments, we find that existing privacypreserving techniques such as aggregation and adding random noise are insufficient due to privacy attacks such as data flow separation attack. We also present a general model for publishing and mining time series data and its privacy issues. Based on the model, we propose a spectrum of privacy preserving methods. For each method, we study its effects on classification accuracy, aggregation error, and privacy leak. Experiments are conducted to evaluate the performance of the methods. Our results show that the methods can effectively preserve privacy without losing much classification accuracy and within a specified limit of aggregation error. KEYWORDS Privacy-preserving data mining, time series data mining.
1. INTRODUCTION Privacy has been identified as an important issue in data mining. The challenge is to enable data miners to discover knowledge from data, while protecting data privacy. On one hand, data miners want to find interesting global patterns. On the other hand, data providers do not want to reveal the identity of individual data. This leads to the study of privacy-preserving data mining (Agrawal & Srikant 2000, Lindell & Pinkas 2000). Two common approaches in privacy-preserving data mining are data perturbation and data partitioning. In data pertur-bation, the original data is modified by adding noise, aggregating, transforming, obscuring, and so on. Privacy is preserved by mining the modified data instead of the original data. In data partitioning, data is split among multiple parties, who securely compute interesting patterns without sharing data. However, privacy issues in time series data mining go beyond data identity. In time series data mining, characteristics in time series can be regarded as private information. The characteristics can be trend, peak and trough in time domain or periodicity in frequency domain. For example, a company’s sales data may show periodicity which can be used by competitors to infer promotion periods. Certainly, the company does not want to share such data. Moreover, existing approaches to preserve privacy in data mining may not protect privacy in time series data mining. In particular, aggregation and naively adding noise to time series data are prone to privacy attacks.
* This work was partly supported by the National Science Foundation under Grant No. CNS-0716527. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
95
ISBN: 978-972-8924-88-1 © 2009 IADIS
In this paper, we study privacy issues in time series data mining. The objective of this research is to identify effective privacy-preserving methods for time series data mining. We first present a model for publishing and mining time series data and then discuss potential attacks on privacy. As a counter measure to privacy threat, we propose to add noise into original data to preserve privacy. The effects of noise on preserving privacy and on data mining performance are studied. The data mining task in our study is classification and its performance is measured by classification accuracy. We propose a spectrum of methods for adding noise. For each method, we first explain the intuition behind the idea and then present its algorithm. The methods are implemented and evaluated in terms of their impacts on privacy preservation, classification accuracy, and aggregation error in experiments. Our experiments show that these methods can preserve privacy without seriously sacrificing classification accuracy or increasing aggregation error. The contributions of our paper are: (a) We identify privacy issues in time series data mining and propose a general model for protecting privacy in time series data mining. (b) We propose a set of methods for preserving privacy by adding noise. Their performance is evaluated against real data sets. (c) We analyze the effect of noise on preserving privacy and the impact on data mining performance for striking a balance between the two. The rest of the paper is organized as follows. In section 2, we discuss related work in privacy preserving and time series data mining. A general model for publishing and mining time series data is proposed in Section 3, along with discussion on its privacy concerns. Methods for preserving privacy by adding noise are proposed in Section 4. The effects of noise on privacy preserving, classification accuracy, and aggregation error are studied in Section 5. Section 6 concludes the study and gives a few future research directions.
2. RELATED WORK Recently, some researchers have studied specifically the topic of privacy in time series data. A privacy preserving algo-rithm for mining frequent patterns in time series data has been proposed by Silva and Klusch (da Silva & Klusch 2007). A frequent pattern is a subsequence which occurs frequently in a time series. The algorithm uses encryption and secure multiparty computing to ensure the privacy of individual party. Privacy of time series data has been studied by Papadimitriou et. al. (Papadimitriou, et al. 2007). They argue that time series data has unique characteristics in terms of privacy. In order to preserve privacy, they propose two perturbation methods based on Fourier and wavelet transformations. It is shown that white noise perturbation does not preserve privacy while the proposed methods are effective. We agree with these researchers that time series data poses new challenges to privacy in data publishing and data mining. Unlike previous research on this topic, we present a general model for privacy preserving in time series data publishing and mining. We propose to add noise to preserve privacy instead of secure multiparty computing as proposed in (da Silva & Klusch 2007). Another difference is that our data mining problem is classification rather than frequent patterns in (da Silva & Klusch 2007). Like (Papadimitriou et al. 2007), we propose to add noise for privacy preservation. Unlike (Papadimitriou et al. 2007), our privacy problem is constrained with classification accuracy and aggregation error which are beyond the scope of (Papadimitriou et al. 2007). As we will see in Section 3, classification accuracy and aggregation error make privacy preservation more complex.
3. TIME SERIES DATA PUBLISHING AND MINING SYSTEM In this section we first present a real-world model of time series Data Publishing and Mining System (DPMS). We then analyze the weakness of the DPMS in preserving privacy of time series data providers, that motivates us to propose new approaches to preserve privacy in a DPMS.
96
IADIS European Conference Data Mining 2009
3.1 System Model A DPMS consists of data providers and research companies. A data provider is a data source which generates time series data. In a DPMS, data providers are willing to share data with trusted research companies. Research companies in a DPMS have the following two functions: Publishing data: Research companies aggregate data from different data providers according to different criteria and then publish aggregate data through public announcement such as web sites or paid reports such as consumer reports.
Figure 1. An Example of DPMS
Providing data mining solutions: Research companies can generate data mining models from time series data that they collect from data providers. The generated models can be shared with data providers or other data miners. Since these models are created from global or industry-wide data, they are generally more accurate and reliable than models created from individual provider’s data. One incentive for data providers to share data with research companies is to obtain these models. An example of DPMS is shown in Figure 1 which consists of two auto manufacturers as data providers, and a set of research companies which publish aggregate sales data the two manufacturers according to various criteria. The performance of a DPMS is measured by the following three criteria. Data providers and research companies have conflicting objectives based on these criteria. Aggregation error: Research companies want to minimize aggregate errors. At least, they want to guarantee aggregate data is within certain error limit. Privacy of data providers: To protect privacy, data providers may add noise to their time series data before sharing the data with research companies. Data providers desire to add as much noise as possible. But to guarantee the accuracy of aggregate data, research companies will limit the amount of noise that can be added by data providers. Data mining performance: Research company will generate data mining models from noisy time series data provided by various providers. Performance of these models depends on the noise added by data providers. In this paper, we consider classification of time series data and the performance metric is classification accuracy. In a DPMS, we assume data providers can trust research companies. Therefore, the privacy of data providers should be protected from outside adversaries, not from research companies. We present the threat model in Section 3.2. It is clear from the model that aggregating and publishing data is one of the main tasks of research companies in a DPMS. Aggregation also serves as a means to preserve data providers’ privacy by mixing individual time series data and thus preventing direct access to individual time series data by adversaries. However, aggregation itself is incapable of protecting privacy as shown in Section 3.3.
3.2 Threat Model In this paper we assume adversaries are external to a DPMS. More specifically, adversaries have the following capabilities: (a) Adversaries can obtain aggregate data from research companies for a small fee or for free. (b) Adversaries can not obtain data contributed by data providers because of lack of trust with data providers. This assumption excludes the possibility of a data provider being a privacy attacker. We do not study the case of com¬promised data provider in this paper. Obviously it is easier to launch privacy attacks if an adversary, being a provider of original data, can know a part of original data aggregated by research companies. (c) Adversaries can obtain data aggregated according to different criteria. (d) Research companies have various data providers as their data sources and research companies do not want to disclose the composition of data sources.
97
ISBN: 978-972-8924-88-1 © 2009 IADIS
The goal of adversaries is to obtain as much information as possible about data providers through various privacy attacks.
3.3 Privacy in a DPMS A DPMS must protect privacy of data providers from external adversaries. Otherwise, external adversaries can recover individual time series from data providers by applying blind source separation algorithm to aggregate time series data. Before we continue on attacks based on blind source separation algorithms, we would like to introduce definitions used in this paper.
3.3.1 Definitions Definition A data flow F is a time series F = [f1, f2,···, fn] where fi, i =1,...,n, is a data point in the flow. When the context is clear, we use flow and point for data flow and data point, respectively. How much privacy of a flow F is leaked by a compromised flow Fˆ is decided by their resemblance. Correlation has proven to be a good resemblance measure for time series data. Definition Given an original flow F and a compromised flow Fˆ , the privacy leak between F and Fˆ is defined as the correlation between them: pl(F, Fˆ )=|corr(F, Fˆ )|, where corr is the linear or Pearson correlation and privacy leak is the absolute value of correlation. The greater correlation between F and Fˆ implies that the more information about F is learned by Fˆ , and therefore the higher privacy leak. Based on the definition of privacy leak for individual flows, we can define privacy leak for a set of flows. Definition The privacy leak between a set of original flows, F = {F1 , F2 ,L , Fn } , and a set of compromised n (maxnj=1 pl ( Fi , Fˆ j )) ∑ i =1 ˆ ˆ ˆ ˆ ˆ flows, F = {F1 , F2 ,L , Fn } , is defined as pl (F, F ) = . n
3.3.2 Blind Source Separation Blind source separation is a methodology in statistical signal processing to recover unobserved “source” signals from a set of observed mixtures of the signals. The separation is called “blind” to emphasize that the source signals are not observed and that the mixture is a black box to the observer. While no knowledge is available about the mixture, in many cases it can be safely assumed that source signals are independent. In its simplest form (Cardoso 1998), the blind source separation model assumes n independent signals F = F1 (t ),L, Fn (t ) and n observations of mixture O = O1 (t ),L , On (t ) where t is the time and Oi (t ) =
∑
n
a Fj (t ) , i = 1,K , n . The parameters aij are mixing coefficients. The goal of blind source
j =1 ij
separation is to reconstruct the source signals F using only the observed data O , with the assumption of independence among the signals in F . A very nice introduction to the statistical principles behind blind source separation is given in (Cardoso 1998). The common methods employed in blind source separation are minimization of mutual information (Comon 1994), maximization of nongaussianity (Hyϋarinen 1999), and maximization of likelihood (Gaeta & Lacoume 1990). For dependent signals, BSS algorithms based on time structure of the signals can be used for separation, e.g., (Tong, et al. 1991).
3.3.3 Data Flow Separation as a Blind Source Separation Problem For an attacker who is interested in sensitive information contained in individual data flow, it will be very helpful to separate the individual data flows based on the aggregate data flows. Further attacks such as frequency matching attack (Zhu, et al. 2007) based on separation of data flows can fully disclose sensitive information of data sources. In this paper, we are interested in patterns carried in the time series data. For example, in Figure 1, the attacker can get aggregate data flows O1 from Research Company A, O2 from Research Company B, etc. The attacker's objective is to recover the time series Fi of each individual data flow. Note that an individual
98
IADIS European Conference Data Mining 2009
data flow may appear in multiple aggregate flows, e.g., in Figure 1, F3 is contained in both aggregate flows O1 and O2 , i.e., O1 = F3 + F6 , O2 = F2 + F3 + F4 + F5 . In general, with l observations O1 ,L , Ol , and m individual data flows F1 ,L , Fm , we can rewrite the problem in vector-matrix notation, ⎛O1 ⎜ ⎜O2 ⎜ M ⎜ ⎜O ⎝ l
⎞ ⎟ ⎟ ⎟ = A l× m ⎟ ⎟ ⎠
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
F1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
F2 M Fm
(1)
where Al×m is called the mixing matrix in blind source separation problems. Data flow separation can be achieved using blind source separation techniques. The individual data flows are independent from each other since the individual data flows are from different sources. Given the observations O1 ,L , Ol , blind source separation techniques can be used to estimate the independent individual flows F1 ,L , Fm by maximizing the independence among the estimated individual data flows. We did extensive experiments on data flow separation attack (Zhu et al. 2007). Our experiments demonstrated that data flow separation attack was very effective to find original flows from aggregate flows. Aggregation was ineffective for privacy protection under data flow separation attack.
4. METHODS FOR PRESERVING PRIVACY As presented in Section 3, aggregation alone cannot protect privacy in a DPMS. We propose to add noise to original time series data to preserve privacy. However, noise will adversely affect classification accuracy and aggregation error. In this section, we discuss various approaches to add noise and their effects on privacy leak, classification accuracy, and aggregation error. Our objective is to identify effective approaches to add noise that can preserve privacy with minimal effect on classification accuracy and aggregation error. Since data flow separation attacks employ blind source separation techniques which are based on the independence among original data flows, the countermeasure to flow separation attack should add noise to increase dependence among noised data flows. According to dependence change caused by noise, we classify the approaches to preserve privacy into three categories: naive approaches, guided approaches, and optimal approach. In naive approaches, data providers add noise independently. In guided approaches, research companies send guidance to data providers on how to add noise so that noised data flows from different data providers are more dependent than original data flows. In optimal approach, data providers are willing to let research companies decide how to add noise to maximize the dependence among noised data flows. We first give two naive approaches to add noise in Section 4.1. They are simple methods that do not consider dependence among flows. In Section 4.2, we propose three methods for adding noise which try to increase the dependence among flows. The intuition is that by increasing dependence among flows, it will be harder to separate aggregate flows and therefore improve privacy preservation. Naive approaches: The first naive approach, random, adds noise to each point in a flow independent of other points in the flow and to each flow independent of other flows. The second naive approach, same noise, is to add the exact same noise to every flow. Due to space limitation, we leave the details of naive approaches into the companion technical report (Zhu et al. 2007). Guided approaches: There are two objectives for adding noise and any method should try to meet both objectives, which are usually conflicting with each other. First, to increase dependence among flows, we would like to add noise that is dependent. Increasing dependence among flows makes separation harder and privacy leak lower. Second, adding noise should not significantly affect classification accuracy and aggregation error. To achieve the first objective, we propose to use segments, instead of individual points as units for adding noise. A segment is a subsequence of a flow. Every flow is broken into a set of segments. All segments have the same size, i.e., the number of points. Similar noise is added to all points in a segment. To achieve the second objective, a threshold is introduced for noise. The threshold limits the maximum level of noise that may be added. The noise threshold is represented as a percentage of a point’s value. For example, a noise threshold of 10% lets us change a point whose value is 10, to between 9 and 11. Based on these two objectives, three methods for adding noise are proposed to balance privacy
99
ISBN: 978-972-8924-88-1 © 2009 IADIS
preservation and accuracies. In our discussion, we assume a time series can be separated into segments of equal size. It is straightforward to deal with the case when the last segment has a smaller size. The first method, independent, adds the same level of noise to the points in each segment, i.e., a percentage of a point’s value is computed as noise and added to its value. Each series independently adds its noise. The algorithm for independent is given in Algorithm 1. It is obvious that the naive approach random is a special case of independent, when the segment length is 1.
The second method, conform, is similar to independent that noise levels are measured as a percentage of a point’s value. The difference is that in conform, for each segment, all series add the same level of noise. In other words, the i-th segment of all series adds the same level of noise. The algorithm for conform is given in Algorithm 2. The third method, smooth, tries to introduce dependence by smoothing flows. In each segment, the mean value of the segment is calculated. For each point in the segment, if the difference between its value and the mean is within noise threshold, it is replaced by the mean. Otherwise, it is unchanged. The algorithm for smooth is given in Algorithm 3. Optimal approach: In both naive approaches and guided approaches, because data providers control their noise addition, the dependence among noised data flows can not be maximized. In this section, we assume data providers are willing to let research companies decide noise addition. All data providers will share the original data flows with research companies. With the knowledge of all data flows, a research company can possibly maximize the dependence among noised data flows to protect privacy or select optimal way of adding noise to balance privacy protection and accuracies. The optimal approach can be formulated as a nonlinear programming (NLP) problem. The cost function of the NLP problem is max [ Dep ( F1 + N 1 , F2 + N 2 ,L , Fn + N n ) + Precision ( F1 + N 1 , F2 + N 2 ,L , Fn + N n )] { N , N ,L N } 1 2 n
where Fi is the ith original data flow and N i is the noise vector added to the Fi . We use function Dep( F1 , F2 ,L, Fn ) to denote the total dependence of every pair in data flows F1 , F2 ,L , Fn . The dependence among the data flows decides the performance of data flow separation, i.e., the privacy protection. The function Precision( F , F ,L, F ) represents the percentage of flows which are in the same class as their closest neighboring flow. In other words, it is the accuracy by the classification algorithm k nearest neighbors (k = 1) i i | O j − O′j | ≤ T , where T which is used in our experiments. The constraint of the NLP problem is ∀i, j i Oj 1
2
n
denotes the noise threshold. We use O j and O′j to denote the jth data point in ith flow aggregated from i
i
original data flows and noised flows respectively. Please note that the linear combinations to form aggregate flows Oi and Oi′ are the same.
100
IADIS European Conference Data Mining 2009
5. PERFORMANCE EVALUATION To evaluate the effectiveness of methods proposed in Section 2, we conduct a set of experiments using the UCR Time Series Classification/Clustering data collection (Keogh, et al. 2006). The collection consists of 20 time series data sets each of which has a training set and a test set. The data sets are of various sizes and the time series are of various lengths. Unless stated otherwise, our experiments are conducted using all 20 data sets and the results are averaged across all data sets. Table 1. Notations Symbol F F' fˆ F acc err pl ( F , Fˆ ) PLN PLC
Description a clean flow a noised flow a separated flow from an aggregate flow a set of flows classification accuracy aggregation error privacy leak between ˆF and F
Table 2. Parameters Parameter T W
Description noise threshold segment size
Default Value 10% 8
Privacy leak before noise Privacy leak before noise
In each experiment, noise is added to the training set. To distinguish the two versions of the training data, we call the original flow and training set clean flow and clean set, and the other noised flow and noised set respectively. Each experiment consists of two steps and we repeat experiments with random noise to minimize randomness in experiment results. In the first step of an experiment, the test set is classified using kNN (k Nearest Neighbors) to find out classification accuracy. In our experiments, k is set to 1 and Euclidean distance is used. For every flow in test set, the kNN finds its closest neighbor in the noised set, and if they are from the same class, the test set flow is correctly classified. In the second step of an experiment, 10 noised flows are selected randomly. The selected noised flows and their corresponding clean flows are used to compute privacy leak and aggregation error. The noised flows are aggregated and their aggregates are compared to aggregates from the clean flows to calculate aggregation error. Next, the aggregate flows are separated using the data flow separation attack mentioned in Section 1. The separated flows are compared to the clean flows to find privacy leaks. For comparison, we also aggregate the clean flows, then separate the aggregate flows and calculate the privacy leak of separated flows. Performance metrics: The performance metrics include classification accuracy, aggregation error, and privacy leak. The classification accuracy measures the percentage of flows in test set that are correctly classified by kNN using the noised set. It is defined as follows: acc = cl/N where cl is the number of flows in test set correctly classified by kNN and N is the total number of flows in test set. The aggregation error measures the difference between aggregate noised flows and aggregate clean flows. Given a set of clean flows, F1 , F2 ,L , Fn , their corresponding noised flows, F1′, F2′,L , Fn′ , and an aggregate function agg , let O and O′ be aggregate flows from clean flows and noised flows, respectively, i.e., O = agg ( F1 , F2 ,L , Fn ) and O′ = agg ( F1′, F2′,L , Fn′) . Aggregation error err (O, O′) is defined as follows: L | O′ − O | i )/L , where Oi and Oi′ are the i th point of O and O′ respectively, and L is the err (O, O′) = (∑i =1 i Oi length of the flows. As we mentioned, 10 noised flows are selected for aggregate in each run of each experiment, which generates 10 aggregate flows. The aggregation error is averaged over all aggregate flows. To measure the effects on privacy preservation, we calculate privacy leak between separated flows and noised flows, and between separated flows and clean flows, to evaluate how much privacy is added by noise and how much privacy is added by aggregation. For comparison, we also use clean flows as sources and calculate privacy preservation by aggregation only. Given a set of noised flow F ' = {F1′, F2′,L , Fn′} , and their clean counterparts, F = {F1 , F2 ,L , Fn } , we measure the privacy leaks before and after adding noise. That is, privacy leaks of separated flows from clean
101
ISBN: 978-972-8924-88-1 © 2009 IADIS
aggregate flows PLC = pl ( Fˆ , F ) , and the privacy leaks of separated flows from noised aggregate flows PLN = pl ( Fˆ ′, F ) . The definition of pl ( Fˆ , F ) is given in Definition 3 in Section 3.4. 0.70
0.80
0.75
0.65
Privacy Leak
Classification Accuracy
0.70
0.65
0.60
0.60
0.55 0.55
independent conform smooth
independent conform smooth
0.50 0.00
0.05
0.10
0.15
0.20
0.25
Noise Threshold
Figure 2. Classification Accuracy
0.50 0.00
0.05
0.15
0.10
0.20
0.25
Noise Threshold
Figure 3. Privacy Leak
Figure 4. Comparison
The notations are summarized in Table 1. Table 2 lists the two parameters used in naive and guided approaches and their default values. Due to space limitation, we leave experiments on naive approaches, segment size, and aggregation error in the companion technical report (Zhu et al. 2007). Guided approaches and optimal approach: The three segment-based methods for adding noise, independent, conform, and smooth, are compared with respect to different noise thresholds and segment sizes. Figure 2 shows the classification accuracy for the three methods. It is observed smooth is insensitive to noise threshold while in conform, andmore in independent, classification accuracy reduces significantly as noise threshold increases. This means we can add more noise in smooth without hurting classification accuracy. In Figure 3, privacy leak for various noise thresholds is compared. As expected, privacy leak decreases as noise threshold increases for all three methods. Here, independent beats conform, which in turn beats smooth. Figure 4 shows the comparison between the optimal approach and the guided approaches. We use simulated annealing algorithm to solve the NLP problem defined in Section 4.3. We can observe that the optimal approach can achieve highest classification accuracy and lowest privacy leak among all approaches. The aggregation error is comparable for all approaches.
6. CONCLUSION In this paper, we proposed a spectrum of methods to preserve privacy and evaluated their performance using real datasets. Our experiments show that these methods can preserve privacy without seriously sacrificing classification accuracy or increasing aggregation error. We have also analyzed the effect of noise on privacy preservation, aggregation error, and classification accuracy.
REFERENCES R. Agrawal & R. Srikant (2000). ‘Privacy-Preserving Data Mining.’. In SIGMOD Conference, pp. 439–450. J. Cardoso (1998). ‘Blind signal separation: statistical principles’. Proceedings of the IEEE 9(10):2009–2025. Special issue on blind identification and estimation. P. Comon (1994). ‘Independent component analysis, a new concept?’. Signal Process. 36(3):287–314. J. C. da Silva & M. Klusch (2007). ‘Privacy-Preserving Discovery of Frequent Patterns in Time Series’. In Industrial Conference on Data Mining, pp. 318–328. R. O. Duda, et al. (2000). Pattern Classification. Wiley-Interscience Publication. M. Gaeta & J.-L. Lacoume (1990). ‘Source separation without prior knowledge: the maximum likelihood solution’. In Proc. EUSIPCO’90, pp. 621–624. A. Hyv¨arinen (1999). ‘Fast and Robust Fixed-PointAlgorithms for IndependentComponentAnalysis’. IEEE Transactions on Neural Networks 10(3):626–634. E. Keogh, et al. (2006). ‘The UCR Time Series Classification/Clustering Homepage’. http://www.cs.ucr.edu/˜eamonn/time_series_data/. Y. Lindell & B. Pinkas (2000). ‘Privacy Preserving Data Mining.’. In CRYPTO, pp. 36–54. S. Papadimitriou, et al. (2007). ‘Time series compressibility and privacy’. In VLDB, pp. 459–470. VLDB Endowment. L. Tong, et al. (1991). ‘Indeterminacy and identifiability of blind identification’. Circuits and Systems, IEEE Transactions on 38(5):499–509. Y. Zhu, et al. (2007). ‘On Privacy in Time Series Data Mining’. Electrical and Computer Engineering Technical Report CSU-ECE-TR-07-02, Cleveland State University.
102
IADIS European Conference Data Mining 2009
UNIFYING THE SYNTAX OF ASSOCIATION RULES Michal Burda Department of Information and Communication Technologies, University of Ostrava Ceskobratrska 16, Ostrava, Czech Republic
ABSTRACT The discovery of association rules is one of the most Essentials disciplines of data mining. This paper studies various types of association rules with focus to their syntax. Based on that study, a new formalism unifying the syntax and capable of handling a wide range of association rule types is formally established. Such logic is intended as a tool for further study of theoretical properties of various association rule types. KEYWORDS Association rules, data mining, typed relation, formalism.
1. INTRODUCTION Knowledge Discovery from Databases (or Data Mining) is a discipline at the borders of computer science, artificial intelligence and statistics (Han & Kamber 2000). Roughly speaking, its goal is to find something interesting in given data. A part of data mining concentrates on finding potentially useful knowledge in the form of (association) rules. Association rule is a mathematical formula expressing some relationship that (very probably) holds in data. The result of association rules mining process serves frequently as a tool for understanding the character of the analyzed data. Association rules have become one of the very well researched areas of data mining. However, scientists are mostly interested in finding new types of association rules or in improving the mining algorithms. The emergence of a new rule type mostly leads to an introduction of a new notation. That is, association rules of different types often have completely dissimilar syntax. Such non-uniformity makes a high-level study of rule types and uncovering similarities between them very hard and uncouth. In order to be able to study different rule types deeply, to enable exploring of similarities and relationships between various rule types, and to infer formal conclusions about the rules, we should have a tool for uniform rule notation. That is, we need a formal language capable of expressing association rules of many different types.
1.1 Related Work The formalization of association rules based on flat usage of first-order predicate logic quickly reaches a dead point. In GUHA method (Hajek & Havranek 1978), the so-called generalized quantifiers were utilized to cope with the problem. The generalized quantifiers are the natural generalization of classical quantifiers ∀ (universal) and ∃ (existential). For example, Rescher's plurality quantifier W(F) says that "most objects satisfy formula F". A second example, Church's quantifier of implication (not to be confused with the logical connective of implication) ⇒(F1, F2) says that the formula F2 is true for all objects for which formula F1 is true. Authors of the GUHA method (Hajek & Havranek 1978) have introduced many such generalized quantifiers to model various relationships. However, even Hajek and Havranek (1978) have introduced specialized calculus for each type of rules rather than a general language capable of handling rules of very different types. In this paper, I am presenting an alternative approach to the formalism of association rules by establishing a formal logic that is general enough to express very different association rule types. I have tried not to treat association rules as formulae interconnected with quantifiers but rather as pieces of data described with
103
ISBN: 978-972-8924-88-1 © 2009 IADIS
relational operations and interconnected with predicates. The difference is in the level of used logical notions where the knowledge is represented. While GUHA uses predicates simply to denote attributes and it uses quantifiers to model relationships, I have attempted to hide the fashion of describing the objects figuring in a rule in functional symbols and use predicates for relationship modeling. As a result, the Probabilistic Logic of Typed Relations (PLTR) is developed.
2. STATE OF THE ART The following sections describe shortly some important types of association rules as well as various techniques related to the association rules mining process.
2.1 Market Basket Analysis Association rules are an explicit representation of knowledge of possibly interesting relationships (associations, correlations) that hold good in data. They appear mostly in the form of mathematical formulae. There exist many approaches of obtaining such rules from data. Statistical methods or sophisticated empirical procedures are used to measure relevance of the rule. The market basket analysis (Agrawal, et al., 1993) produces probably the best-known rule types: e.g. evidence that "76 % of customers buying milk purchase bread, too" is symbolically written as follows: bread ⇒ milk (support: 2 %, confidence: 76 %).
(1)
The conditional probability (here 76 %) called confidence is often accompanied by a characteristic called support (here 2 %) which denotes a relative number of records that satisfy both the left and right side of a given rule. Formally, the problem of market basket analysis is stated in (Agrawal, et al., 1993) and (Agrawal & Srikant 1994) as follows: Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. Associated with each transaction is a unique identifier TID. We say that a transaction T contains X, a set of some items in I, if X ⊆ T. An association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y ⊂ I, and X ∩ Y = ∅. The rule X ⇒ Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. The rule X ⇒ Y has support s in the transaction set D if s% of transactions in D contain X ∪ Y. Given a set of transactions D, the problem of mining association rules is to generate all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf) respectively. Such rules are called strong.
2.2 The GUHA Method Although the world still considers (Agrawal, et al., 1993) as the pioneering paper in the field of association rules, in fact, it was not the very first publication which had introduced association rules. GUHA method (General Unary Hypothesis Automaton (Hajek & Havranek 1978)) is a unique method of Czech scientists developed in the seventies of the twentieth century, remains practically ignored by the rest of the world. GUHA is a complex system of data analysis methods that systematically apply various statistical tests on source data to generate rules (historically: relevant questions). The main principle is to describe all possible assertions that might be hypotheses, not to verify previously formulated hypotheses (Hajek & Havranek 1978). The rules produced by the GUHA method consist of formulae (F1, F2) interconnected with a generalized quantifier ∼: F1 ∼ F2 (2) The quantifier's definition is the core of the rule's meaning and interpretation since the definition sets up the rule's truth value.
104
IADIS European Conference Data Mining 2009
As an illustrative result of the GUHA method consider a database of patients suffering certain disease. We can obtain e.g. the following rule: weight > 100kg & smoker & not(sport) ⇒0.95 heart-failure.
(3)
(Note the similarity to multi-dimensional association rules (Han & Kamber 2000).) The quantifiers are defined using the following contingency table, which summarizes the amount of objects satisfying certain configurations: e.g. a denotes the number of objects satisfying both F1 and F2 etc. Table 1. The contingency table for GUHA rules
F2
not( F2)
F1)
F1 not(
a c
b d
Definitions of some quantifiers (Hajek & Havranek 1978) follow: 1A quantifier ⇒p(a, b, c, d) called also a founded implication is defined for 0 < p ≤ 1 as follows: The rule F1 ⇒p F2 is true iff a / (a + b) ≥ p. 2A quantifier ∼d(a, b, c, d) called also a simple associational quantifier is defined for d ≥ 0 as follows: The rule F1 ⇒p F2 is true iff ad > edbc (especially for d = 0 we get ad > bc). Let us get even more complicated. Hajek and Havranek (1978) presents further the subsequent rules: X corr Y / C
(4)
saying that “the values of attributes X and Y are correlated if considering only the objects that fulfill the condition C.” Rules of that type are well applicable on numeric data. The quantifier corr is called the correlational quantifier. Correlational quantifiers of the GUHA method are based on ranks. Assume we have obtained a set O of objects while having measured two quantitative characteristics, t and u (to represents a value of t characteristics of object o, o ∈ O). Let Ri be an amount of objects such that their characteristic t is lower than ti. Ri is called the rank of object i accordingly to characteristic t. Similarly we define Qi to be the rank of object i accordingly to characteristic u. Based on ranks, e.g. a Spearmen's quantifier s-corrα is defined for 0 < α ≤ 0.5 as follows: s-corrα = 1 iff
∑
m
i =1
Ri Qi ≥ kα , where kα is a suitable constant. For more information see e.g. Hajek and Havranek
(1978).
2.3 Emerging Patterns Emerging patterns mining is a technique for discovering trends and differences in a database (Dong & Li 1999). Emerging patterns capture significant changes and differences between datasets. They are defined as itemsets whose supports increase significantly from one dataset to another. More specifically, emerging patterns are itemsets whose growth rates (the ratios of the two supports) are larger than a threshold specified by user. Dong and Li (1999) show an example on a dataset of edible and poisonous mushrooms is presented, where the following emerging pattern was found: odor = none & gill-size = broad & ring-number = 1.
(5)
Such pattern has growth rate of "∞" if comparing poisonous and edible mushrooms, because the support in a dataset of poisonous mushrooms was 0 % while for edible mushrooms the support was 63.9 %. Accordingly to Dong and Li (1999), the emerging patterns mining problem is defined as follows. Let I (set of items), D (dataset) and T (transaction) be symbols defined as in section 2.1. A subset X ⊂ I is called a
105
ISBN: 978-972-8924-88-1 © 2009 IADIS
k-itemset (or simply an itemset), where k = |X|. We say a transaction T contains an itemset X if X ⊆ T. The support of an itemset X in a dataset D is denoted as supD(X). Given a number s > 0, we say an itemset X is slarge in D if supD(X) ≥ s, and X is s-small in D otherwise. Let larges(D) (resp. smalls(D)) denote the collection of all s-large (resp. s-small) itemsets. Assume that we are given an ordered pair of datasets D1 and D2. The growth rate of an itemset X from D1 to D2, denoted as growthrate(X), is defined as
⎧ ⎪ 0, if sup D1 ( X ) = 0 & sup D2 ( X ) = 0, ⎪⎪ growthrate( X ) = ⎨ ∞, if sup D1 ( X ) = 0 & sup D2 ( X ) ≠ 0, ⎪ sup D2 ( X ) otherwise. , ⎪ ⎪⎩ sup D1 ( X )
(6)
Given r > 1 as a growth-rate threshold, an itemset X is said to be an r-emerging pattern from D1 to D2, if growthrate(X) ≥ r. Clearly, the emerging pattern mining problem is, for a given growth-rate threshold r, to find all r-emerging patterns. For more information see Dong and Li (1999).
2.4 Impact Rules The so-called impact rules are concisely described by Aumann and Lindell (1998), Aumann and Lindell (1999), Webb (2001). Impact rules utilize statistical tests to identify potentially interesting rules. Such approach is also similar to the ideas of the GUHA method. Generally, impact rules are of the following form: population subset ⇒ interesting behaviour, (7) where "population subset" is some reasonable condition and "interesting behaviour" is some unusual and rather exceptional characteristic if comparing that characteristic on a sample given by the condition on the left with the rest of data. That characteristic could be e.g. the mean or variance. Below is some example of an impact rule: sex = female ⇒ wage: mean = $7.90/hr (overall mean wage = $9.02).
(8)
Such rule indicates the women's wage mean being significantly different to the rest of examined objects. For determining the significance of such rules, Aumann and Lindell (1998) use Z-test, a two-sample statistical test of differences in means.
3. PROBABILISTIC LOGIC OF TYPED RELATIONS The Probabilistic Logic of Typed Relations (PLTR) is a logic intended for use by people interested in association rules research. Its initial intent is to provide a tool for formal representation of association rules. As we will see later in this paper, it is general enough for expressing arbitrary association rule types using the same formalism. Thus, it can be used in precise formal definitions and considerations of similarities and other properties of various association rule types. The base of PLTR was developed by Burda, et al. (2005). PLTR is based on the notion of typed relation (MacCaull & Orlowska 2004), which corresponds to the intuitive conception of "data table", and well-known relational operations of projection and selection. For the need of our intent, the original definitions of MacCaull and Orlowska (2004) were modified. Specifically, the set Y of objects will be added for the operation of projection to hold duplicities. The rest of definitions (truth values, relationship predicates etc.) as well as the fundamental idea of using that formalism for association rule representation is new. Definition 1. Let Ω be a set of abstract elements which we will call the attributes. Let each a ∈ Ω has assigned a non-empty set Da called the domain of the attribute a. Type A (of relations) is any finite subset of the set Ω. We denote TΩ a set of all types.
106
IADIS European Conference Data Mining 2009
A type A of relation is something like description of the data table. It says, what attributes (columns) are present in the data table and what data can be stored in that attributes (attribute domain). Example 1. For instance, let a, b, c ∈ Ω, Da = N, Db = R and Dc be equal to a set of all words made from English letters of length maximum 30, then a set A = {a, b, c} is a type. Definition 2. Let Y be a set of abstract elements which we call the objects. Let A ∈ TΩ be a type. A tuple of type A is pair , where k ∈ Y and l is such mapping that ∀a ∈ A: l(a) ∈ Da. A set of all tuples of type A is denoted by 1A. A set of all tuples of type a (a ∈ Ω) is denoted by 1a. The relation R of type A is a finite subset of 1A, R ⊂ 1A. Example 2. A tuple of type A is intuitively a representation of a single row of a data table. Consider the type A from example 1. We can present an exemplary tuple of type A: , where k ∈ Y. Definition 3. A selection from relation R of type A accordingly to a condition C is a relation R(C) = {u: u ∈ R & C(u)} of type A. The notation C(u) denotes a selection condition and it constitutes the fact that condition C holds on a tuple u. A projection of relation R to the type B is a relation R[B] = {u = ∈ 1B: (∃v = ∈ R)(∀b ∈ B)(lu(b) = lv(b)) of type B. Function Orig(R) assigns an original relation to R: Orig(R(C)) = Orig(R), Orig(R[B]) = Orig(R). Example 3. See table 2 for an example of relation R of type A and the results of selection and projection operations. Table 2. Concrete Example of Selection and Projection on Data Table R
1 2 3 4 5 6
Data table R a b 1 0.25 5 0.65 7 0.34 8 0.88 8 0.25 9 0.11
R(a > 6) c Tom Jack Bill John Tom Tom
3 4 5 6
a 7 8 8 9
b 0.34 0.88 0.25 0.11
R[c] c Bill John Tom Tom
1 2 3 4 5 6
c Tom Jack Bill John Tom Tom
Please note that projection is defined as to hold duplicities. This is an important difference to MacCaull and Orlowska (2004), because we need to hold the duplicities for not to loose the information important for statistical tests. Please also observe the definition of the original relation, above. Original relation of a relation R' is a relation R that was used to "compute" R' from R by using operations of selection and projection. That is, had we a relation X = R(C)[A, B], the original relation of X is Orig(X) = R and the original relation of R is R itself; Orig(R) = R. Such lightweight complication will allow us to define some types of relationship predicates later in this text. We can go forth with these basic definitions and define a general notion of a predicate of relationship. The predicate of relationship is simply a mapping assigning truth value to a vector of relations. Since we are building probabilistic logic, the truth value will be a probability. Definition 4. A set V of truth values is a set of all real numbers, R, let A1, A2, …, An ∈ TΩ. Then n-ary relationship predicate is a mapping p: Dp → V where Dp ⊆ 1A1 × 1A2 × … × 1An is a set called the domain of the relationship predicate p. So, relationship predicate is a mapping that assigns a truth value to certain relations. It is obvious that we can model various relationships that way. The definition presented above assumes the predicate to result in a probability. However, we can modify the definitions to suit classical two-valued logic or create predicates of some other multi-valued logic.
107
ISBN: 978-972-8924-88-1 © 2009 IADIS
4. EVALUATION OF PLTR To show the strength of PLTR, this section uses PLTR for definitions of various association rule types.
4.1 GUHA in PLTR Definition 5. Let ∼(a,b,c,d) be a GUHA (associational or implicational) quantifier. Then the relationship predicate based on the GUHA quantifier ∼ is a relationship predicate ∼’ defined for all X, Y ⊆ 1A as ∼’(X, Y) = ∼(a,b,c,d), where a = |X ∩ Y|, b = |X ∩ (Orig(Y) - Y)|, c = |(Orig(X) - X) ∩ Y|, d = |(Orig(X) - X) ∩ (Orig(Y) Y)|. Such definition can be used for each associational or implicational quantifier of the GUHA method. The relationship predicate ∼’ is true iff the generalized quantifier ∼(a,b,c,d) is true, too. For quantifiers that deal directly with a probability (e.g. the founded implication discussed above), we can provide an alternate definition with that probability being the truth value of the relationship predicate: Definition 6. The relationship predicate of founded implication is a relationship predicate ⇒’ defined for all X, Y ⊆ 1A as ⇒’(X, Y) = a/(a+b), where a = |X ∩ Y|, b = |X ∩ (Orig(Y) - Y)|. For instance, if R was a typed relation containing data, the formula (3) would be in PLTR expressed as: ⇒’(R(weight > 100kg & smoker & not(sport)), R(heart-failure))
(9)
or infixually (both with truth-value ≥ 0.95): R(weight > 100kg & smoker & not(sport)) ⇒’ R(heart-failure)
(10)
Definition 7. Let t, u ∈ Ω be attributes of domain [0,1] (Dt = Du = [0,1]). Then the Spearmen's correlational relationship predicate s-corr’ is defined for each X ⊆ 1t and Y ⊆ 1u as follows: s-corr’(X, Y) = (1p), where p = min({α : s-corrα (X, Y) = 1}). If R was a typed relation of data about blood pressure and heartbeat frequency, the original GUHA correlational rule pressure s-corr0.05 frequency / man & ill (11) would be equal to the subsequent PLTR rule with truth value equal to 0.95: s-corr’(R(man & ill)[pressure], R(man & ill)[ frequency]).
(12)
R(man & ill)[pressure] s-corr’ R(man & ill)[ frequency].
(12)
Infix notation:
4.2 Emerging Patterns in PLTR Definition 8. The growth-rate based relationship predicate 1, X being an r-emerging pattern is equivalent to a truth value of a rule (14) R[X]