and Multiagent Data Mining (MADM)

111 downloads 0 Views 3MB Size Report
Xiaohua Tony Hu (Drexel University, USA). Mark Last ... Il-Yeol Song (Drexel University, USA) .... downloading of human pathway information from. KEGG ...
Multiagent Data Warehousing (MADW) and Multiagent Data Mining (MADM) Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining Houston, USA, November 27, 2005

Edited by Wen-Ran Zhang Georgia Southern University, USA Yanqing Zhang Georgia State University, USA Xiaohua (Tony) Hu Drexel University, USA

ISBN 0-9738918-0-7

Multiagent Data Warehousing (MADW) and Multiagent Data Mining (MADM) Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining Houston, USA, November 27, 2005

Edited by Wen-Ran Zhang Georgia Southern University, USA Yanqing Zhang Georgia State University, USA Xiaohua (Tony) Hu Drexel University, USA

The papers appearing in this book reflect the authors’ opinions and are published in the interests of timely dissemination based on review by the program committee or volume editors. Their inclusion in this publication does not necessarily constitute endorsement by the editors. ©2005 by the authors and editors of this book. No part of this work can be reproduced without permission except as indicated by the “Fair Use” clause of the copyright law. Passages, images, or ideas taken from this work must be properly credited in any written or published materials. ISBN 0-9738918-0-7 Printed by Saint Mary’s University, Canada.

CONTENTS Workshop Committee ……………………………………………………………… ii Forward ………………………………………………………………………………. iii A Multiagent Framework to Integrate and Visualize Gene Expression Information Li Jin, Karl V. Steiner, Carl J. Schmidt, Gang Situ, Sachin Kamboj, Kay T. Hlaing, Morgan Conner, Heebal Kim, Marlene Emara, and Keith S. Decker University of Delaware, USA…………………..………..............……………..............……………....1

Concepts, Challenges, and Prospects on Multiagent Data Warehousing and Multiagent Data Mining Wen-Ran Zhang, Dept. of Computer Science College of Information Technology, Georgia Southern University, USA ……………..............……8

Multi-Party Sequential Pattern Mining Over Private Data

Justin Zhan1, LiWu Chang2, and Stan Matwin3 School of Information Technology & Engineering, University of Ottawa, Canada 2 Center for High Assurance Computer Systems, Naval Research Laboratory, USA ……….......18

1,3

Privacy-Preserving Decision Tree Classification Over Vertically Partitioned Data

Justin Zhan1, Stan Matwin2, and LiWu Chang3 School of Information Technology & Engineering, University of Ottawa, Canada 3 Center for High Assurance Computer Systems, Naval Research Laboratory, USA………........27

1,2

Data Mining for Adaptive Web Cache Maintenance Sujaa Rani Mohan, E. K. Park, and Yijie Han, University of Missouri, Kansas City, USA……….36

Temporal Intelligence for Multiagent Data Mining in Wireless Sensor Networks Sungrae Cho, Ardian Greca, Yuming Li, and Wen-Ran Zhang Department of Computer Science, Georgia Southern University, USA……………………………44

A Schema of Multiagent Negative Data Mining Fuhua Jiang, Yan-Qing Zhang, and A.P. Preethy Dept. of Computer Science, Georgia State University, USA ………………………………............49

Distributed Multi-Agent Knowledge Space (DMAKS): A Knowledge Framework Based on MADWH Adrian Gardiner, Dept. of Information Systems, Georgia Southern University, USA…................53

Applying MultiAgent Technology for Distributed Geospatial Information Services

Naijun Zhou 1 and Lixin Li 2 1 Department of Geography, University of Maryland - College Park 2 Department of Computer Sciences, Georgia Southern University ……………………….………62

i

Workshop Committee Honorary Chair M. N. Huhns, University of South Carolina, USA Workshop Organizers and Program Committee Co-Chairs Wen-Ran Zhang Georgia Southern University, USA [email protected]

Yan-Qing Zhang Georgia State University, USA [email protected]

Xiaohua Tony Hu Drexel University, USA [email protected]

Publicity Chair Yuchun Tang, Georgia State University, USA, [email protected] Program Committee Ajith Abraham (Chung-Ang University, Korea) Nick Cercone (Dalhousie University, Canada) Sungrae Cho (Georgia Southern University) Diane J. Cook (University of Texas – Arlington, USA) Dejing Dou (University of Oregon, USA) Xiaohua Tony Hu (Drexel University, USA) Mark Last (Ben-Gurion University of the Negev, USA) Vincenzo Loia (Universita di Salerno, Italy) Yi Pan (Georgia State University, USA) Ziyong Pen (Wuhan University, China) Zhongzhi Shi (Chinese Academy of Science, China) Il-Yeol Song (Drexel University, USA) Raj Sunderraman (Georgia State University, USA) Yong Tang (Zhongshan University, China) David Taniar (Monash University, Australia) Juan Vargas (University of South Carolina, USA) Feiyue Wang (University of Arizona, USA) Xindong Wu (University of Vermont, USA) John Yen (Pennsylvania State University, USA) Hao Ying (Wayne State University, USA) Wen-Ran Zhang (Georgia Southern University, USA) Yan-Qing Zhang (Georgia State University, USA) Yi Zhang (Univ. of Electronic Science and Technology of China) Ning Zhong (Maebashi Institute of Technology, Japan)

Website: http://tinman.cs.gsu.edu/~cscyntx/ ICDM-MADW-MADM2005.htm

ii

Forward 2005 IEEE-ICDM Workshop on Multiagent Data Warehousing and Multiagent Data Mining (MADW/MADM-2005) is the first international workshop of its kind. This workshop is to bring together researchers from diverse areas including data mining, data warehousing, multiagent systems, artificial intelligence, computational intelligence, machine learning, robot control, knowledge management, bioinformatics, neuroscience, and other related areas to layout the foundation for MADW/MADM. Biological systems such as brains have enormous capabilities in information processing and coordinated knowledge discovery. One challenging issue facing data mining and knowledge discovery today is understanding how the enormous amount of radio, audio, spacio-temporal, and bio-information is processed by the massive number of neural or genetic agents of biological systems and how multiple agents can be coordinated for information processing and knowledge discovery at the micro and/or macro system levels. MADWH/MADM can be considered a YinYang pair where the Yin is internal centralization that promotes coordinated computational intelligence (CCI), and the Yang is external distribution that promotes distributed artificial intelligence (DAI). The two sides coexist and re-enforce each other in knowledge discovery. Many disciplines including computational intelligence and artificial intelligence can join forces on the common platform of MADWH/MADM. ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ

Technical issues include (but not limited to) Necessity, applicability, and feasibility analysis for MADW/MADM in different domains; Coordinated computational intelligence (CCI) and distributed artificial intelligence (DAI); Dimension analysis, algorithms, and methods for MADM/MADW; Multiagent data mining (MADM) vs. multirelational data mining (MRDM); Agent cuboids, schemas, and architectures of MADW; Query languages for OLAP and OLAM with MADW and MADM; Mining agent association rules in first-order logic; Coordination protocols for collaborative knowledge discovery with MADW/MADM; Agent discovery, law discovery, self-organization, and reorganization; Full autonomy as a result of coordination of semiautonomous agents; Reinforced knowledge discovery with the interplay of MADW/MADM; Agent similarity and orthogonal MADW; MADW/MADM for brain modeling; MADW/MADM for applications in security/privacy, bioinformatics, biomedicine, semantic Web, ebusiness, Web service, Web mining, grids, wireless networks, mobile networks, Ad hoc networks, sensor networks, flexible engineering, robot learning/control, knowledge management, geographical information systems, and other suitable domains.

In this proceedings we included nine papers in the following categories: Concepts-ChallengesProspects; Application in Bioinformatics; Privacy/Security in Web-Based MADM; Adaptive Web-Cache for MADM; Temporal Intelligence for MADM in Wireless Sensor Networks; MADM for Machine Learning; MADM from Geographical Databases; and MADW/MADM for Knowledge Management. This workshop marks the birth of a new research area. The short term potential of this emerging area lies in multidimensional reorganizable agent-oriented OLAP and OLAM in business, engineering, and biomedical applications; its long term impact can be far-reaching in scientific discoveries especially in knowledge discovery about bio-agents, agent associations, agent organizations, and natural laws. Our thanks go to the authors, program committee members, Publicity Chair Yuchun Tang, Honorary Chair Michael N. Huhns, and IEEE ICDM05 Workshops Chair Pawan Lingras for their contributions, services, and support.

Co-Organizers: Wen-Ran Zhang Yanqing Zhang Xiaohua (Tony) Hu

iii

iv

A Multiagent Framework to Integrate and Visualize Gene Expression Information Li Jin1, 2, *, Karl V. Steiner 2, 3, *, Carl J. Schmidt 4, Gang Situ1, Sachin Kamboj1, Kay T. Hlaing1, Morgan Conner1, Heebal Kim4, Marlene Emara4, and Keith S. Decker1, * 1 Department of Computer Information Sciences, 2Delaware Biotechnology Institute, 3 Department of Electrical and Computer Engineering, 4Department of Animal and Food Sciences, University of Delaware, Newark, DE, 19711 *Email: 1{jin, decker}@cis.udel.edu, [email protected] Abstract

Public databases, such as KEGG [3], GenBank [4], SwissProt [5], and Ensembl [6], provide a huge data resource for genomic information, metabolic pathways and gene expression that can be utilized to study sequences, gene expression, or pathways of close species by sequence and functional annotations [7]. However, these public databases are heterogeneous and constantly updated, and new data sources are constantly published on-line. Therefore, a multiagent information gathering system, which satisfies the following basic requirements, should be helpful to advance biological research. (1) It can retrieve and integrate sequence and function information from distributed, heterogeneous and dynamic updated databases. (2) New agents for new data sources or new analysis services can be added easily in the future. (3) It should be easily adjustable for a variety of different organisms. In this paper, we present a general bioinformatics multiagent system, called BioMAS, which meets these three basic requirements. Biologists are familiar with several visual representations of metabolic pathways. Visualizations may aid in the understanding of the complex relationships between the pathway components, to extract important information, and to compare pathways between different organisms. Our goal is not only to retrieve and integrate gene expression information from public data sources into our knowledge databases but also to load all the gene expression information into an information visualization system such that relationships between the datasets can be visualized easily. Our approach is to utilize the Starlight information visualization system [8] to visually analyze organism-specific data within a three-dimensional hierarchical view with existing

With rapidly growing amounts of genomic and expression data publicly available, efficient and automated analysis tools are increasingly important for biologists to derive knowledge for a large variety of organisms. Multiagent information gathering methods can be used to retrieve and integrate genomic and expression information from available databases or web sources to generate knowledge databases for organisms of interest. We present a novel, flexible and generalizable bioinformatics multiagent system, called BioMAS, which in this paper is used to gather and annotate genomic data from various databases and web sources and analyze the expression of gene products for a given organism. In this paper, we also present a new approach to visualize complex datasets representing gene expression and pathway models in hierarchical view space using the Starlight information visualization system (Starlight). This approach is an innovative application of using Starlight in the field of comparative genomics.

1. Introduction 1.1. Overview Since the successful completion of the human genome project in 2003 [1], not only genomic data but also proteomic and expression data of numerous species have been published. Homologs, which are homologous genes that have common origins and share an arbitrary threshold level of similarity determined by alignment of matching bases [2], play an important role in predicting gene products by using the annotation in public databases. Therefore, sequences of different organisms are compared for clues about gene function.

2005 IEEE ICDM Workshop on MADW & MADM

1

pathway diagrams and chromosome diagrams. Starlight was initially created as a visualization system for the military intelligence community with features such as query tools for data and data mining, references tagged images and diagrams [9]. This paper describes an innovative application of this program to the field of bioinformatics. In the rest of this paper, we will first review existing approaches for retrieving, integrating, and visualizing biological data. Then we will discuss the details of processing of the gene expression component in BioMAS because most of the other components of BioMAS have previously been described in [10]. Next, the visualization of the gene expression data utilizing Starlight will be presented. The results of knowledge about the organism generated from different sources are combined into one visual space to make it convenient for biologists to compare genomic data across different organisms. As a demonstration we will present the visualization of the gene expression comparison of two different organisms – chicken (Gallus Gallus) and human (Homo Sapiens). Finally, we will discuss the advantages and limitations of our methods and future work.

web sites provide is static and predefined, i.e. users can obtain the information only by following the hyperlinks embedded in search results. For a dynamic visualization, considerable research effort has been focused on implementing new visualization tools to redraw pathway networks and interactions among proteins automatically. Typical applications are the BioMaze project [16] and the Visant visualization tool [17]. Although virtual reality tools also provide a way to understand metabolic networks and gene expression [18], visually representing the relationships among the multidimensional information is a complex task. A critical advantage of Starlight is that it can combine data and images such as pathway diagrams into one visual space to enable users to see the position of interesting data on the embedded images.

2. System and methods The architecture overview of our system is illustrated in Figure 1, using the Gallus Knowledge Base (GallusKB) as an example. The Gallus information is retrieved from distributed sources and integrated into GallusKB using BioMAS. Then the gene expression and pathway data are parsed into XML files as required by Starlight. Finally, user queries can be processed interactively by Starlight and results are presented to the user in a hierarchical structure. This enables users to explore the result space by following associated relationships. The gene product pathways and chromosome positions can be displayed in both the pathway diagram and the chromosome diagram in the Starlight view space.

1.2. Related work BioMAS is a bioinformatics multiagent system with an increasing number of functions and dedicated agents. The work presented in this paper is a new gene expression processing organization in BioMAS, whose previous work included, (1) basic annotation and query agent organization (2) functional annotation agent organization and (3) EST (Expressed Sequence Tags) processing agent organization [10]. There are several systems available that are aimed at retrieving, integrating, annotating and visualizing biological data. Our system differs from these systems in that our system can respond to the dynamic changes of data sources while other systems such as TSIMMIS [11] or InfoSleuth [12] cannot respond to dynamic changes in data sources. New sources and analysis methods can be easily integrated into our system while GeneWeaver [13] is not based on a shared architecture that supports reasoning about secondary user utility [10]. Our approach is not only to retrieve and integrate related gene expression information into databases and publish the information on-line but also to provide a novel method to visualize the relationships among the information utilizing Starlight. There have been several related studies to visualize gene expression data onto metabolic networks, such as KEGG, Expasy [14], and EcoCyc [15]. The type of pathway visualization these

2005 IEEE ICDM Workshop on MADW & MADM

2.1. Information retrieval and integration BioMAS is composed of five groups of agent organizations, including (1) Sequence Annotation Agents, which integrate gene sequence annotations from various sources, such as NCBI databases, Protein Domains [19], PSort [20], and SwissProt [5], (2) EST Processing Agents, which are responsible for building chicken contigs (contiguous sequences) and saving contigs into databases, (3) Functional Annotation Agents, which annotate the function of a gene using Gene Ontology [21] and MeSH (Medical Subject Heading) [22] terms, (4) Query Agents to facilitate queries from users, and (5) Pathway Agents for downloading of human pathway information from KEGG, predicting pathways for a given organism, and predicting organs where gene products are expressed.

2

NCBI DB

EST Processing Agents

PSort

Functional Annotation Agents

Protein Domains

Sequence Annotation Agents

Chicken Contigs DB

SWISSPROT Gene ontology

BlastX

Chicken Pathway Table

Ensemble DB

Information Extraction Agent

Human Pathway Table

MeSH Terms

Pathway Image Processing Agent

Pathway Images Table

Pathway Agents

Pathway DB

KEGG DB BioMAS

Distributed databases & on-line sources User

Starlight Visualization System

GallusKB

Java parser

Figure 1. Architecture of knowledge information gathering and visualization.

The organizations outlined in (1)-(4) were described previously in [10, 23]. Therefore, this paper only focuses on Pathway Agents. The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides a web-accessible database of human pathways, which can also be used to predict chicken pathways by blasting the human gene sequence against the chicken contig database. As shown in Figure 1, the Pathway Agents of BioMAS are used to retrieve the KEGG pathway data for human genes and place them in the Pathway Database (Pathway DB) and to save all human pathway information in a human pathway table. The Chicken Contigs Database (Chicken Contigs DB) holds 30,214 chicken contigs. BlastX [24] is used to blast the retrieved human gene sequence against Chicken Contigs DB with an E-Value cutoff =1x10–5 to identify the chicken gene products involved in the individual pathways as annotated by KEGG. All the chicken gene products identified are saved into a chicken pathway table under Pathway DB. In the next step, the pathway maps with highlighted chicken homologs are generated by the Pathway Image Processing Agent. As an example, http://udgenome.ags.udel.edu/gallus/pathway/hsa000 10.php shows the pathway map of Glycolysis /

2005 IEEE ICDM Workshop on MADW & MADM

Gluconeogenesis of Homo Sapiens where the highlights indicate matches between chicken homologs and human genes. The human gene and chicken contigs are mapped to chromosomal locations using the genomic sequence as determined by Washington University [25] and annotated using Ensembl [6]. In this way the relationship between chromosome positions of human and chicken genes involved in the same pathway can be compared visually. A Java parser is used to transform the data in Pathway DB to XML format data, which serves as the input for Starlight. Users can then use Starlight to conduct data mining and data analysis.

2.2. Information visualization A significant effort has been focused on dynamically generating biochemical network diagrams. The manually generated KEGG static diagram is a popular resource commonly used by biologist. However, to efficiently explore genomic databases, such as Pathway DB, a sophisticated tool is needed to select, search, navigate and analyze the database visually, especially for the comparison of the metabolic pathways in different organisms. With the GallusKB, our intention is to analyze the

3

similarities and differences in metabolic pathways and gene expression between human and chicken. In addition, the corresponding chromosome positions of human genes and their chicken homologs is of interest. For this paper, we use the hsa00010 pathway for Glycolysis / Gluconeogenesis in Homo sapiens provided by KEGG as an example. Our approach is described in detail below for the following aspects: the data model defining the input data, the view mode with the hierarchical view, navigation to query the database in 3D visualization space, and the data presentation.

2.2.2. Hierarchical view mode. Among the various view modes available in Starlight, the hierarchical view mode is selected for visual representation to explore the following research interests: • Gene expression in several organs for chick contigs; • Relationships between human genes, chicken homolog contigs and their respective chromosome positions; • Comparison of human genes and chicken homolog contigs in KEGG pathways; • Visually supported detection of chicken homologs for human gene products.

2.2.1. Data model. The input data format is a flat XML file as outlined in the following example code in Figure 2 extracted from GallusKB. Kidney hsa00010 Glycolysis / Gluconeogenesis - Homo sapiens 61/61 hsa00010.gif ec:1.1.1.1 hsa:124 alcohol dehydrogenase 1A (class I), alpha polypeptide >hsa:124 ADH1A; alcohol dehydrogenase 1A (class I), alpha polypeptide [EC:1.1.1.1] (A) MSTAGKVIKCKAAVLWELKKPFSIEEVEVAPPKAHEVRIKM VAVGICGTDDHVVSGTMVT PLPVILGHEAAGIVESVGEGVTTVKPGDKVIPLAIPQCGKCR ICKNPESNYCLKNDVSNP QGTLQDGTSRFTCRRKPIHHFLGISTFSQYTVVDENAVAKID AASPLEKVCLIGCGFSTG YGSAVNVAKVTPGSTCAVFGLGGVGLSAIMGCKAAGAARI IAVDINKDKFAKAKELGATE CINPQDYKKPIQEVLKEMTDGGVDFSFEVIGRLDTMMASLL CCHEACGTSVIVGVPPDSQ NLSMNPMLLLTGRTWKGAILGGFKSKECVPKLVADFMAK KFSLDALITHVLPFEKINEGF DLLHSGKSIRTILMF 4q21q23 CHK124 ADH1. ADH1 Chr4 UD.GG.Contig26702 228 7e-60 http://udgenome.ags.udel.edu/gallus/displayReport.php?type =rep&name=GP&id=UD.GG.Contig26702 Organ,Heart,hsa00010,ec:1.1.1.1,hsa:124,UD.GG.Contid26702

Figure 3. Hierarchical data structure.

Figure 3 shows the hierarchical data structure defined for the visualization of data sets. By this hierarchical structure, we can study the relationships between organs, pathways, human genes and chicken contigs. We can also visually compare of human and chicken gene expressions with pathway diagrams and chromosome maps. 2.2.3. Navigation. With the hierarchy view mode, the dataset is displayed in 3-D view space as shown in Figure 4. As an example, this Figure presents a hierarchical view of pathway hsa00010 - Glycolysis / Gluconeogenesis - Homo Sapiens.

Figure 4. Hierarchical view of pathway hsa00010 Glycolysis / Gluconeogenesis - Homo sapiens.

Figure 2. Flat XML format data of one chicken contig record.

2005 IEEE ICDM Workshop on MADW & MADM

4

The navigation through organs, pathways, enzymes, genes, and chicken contigs can be conducted by following the hierarchical tree, as shown in Figure 5. Interactive navigation is easily possible along the path of any organ. Figure 5 shows the result of zooming into the dataset for the heart, hsa00010, ec:1.1.1.1, hsa:124 and CHK:124. By selecting any record, the details of this record can be viewed. This hierarchical view shows how certain genes are expressed within a specific organ, and how the respective chicken homologs are expressed. The details for each specific record can be called up as shown for the example in Figure 6, which is base don the UD.GG.Contig26702 of hsa:124. Each of these details can be published as an html page and can be published on a designated web site via XSLT.

The hierarchy view mode makes it possible to visually query the gene data and compare gene expressions of both human and chicken. The field query generator can be used to create the query and query results can be presented in the link view of the hierarchical level. In this hierarchical view, any relationship query among data in different hierarchy levels can be conducted by selecting the data bar of interest in each level to highlight the resulting links as yellow lines. The corresponding results of the query can be displayed in a 3D environment. Figure 7 presents the hierarchical query view for the kidney. All records expressed in the kidney and their related data are linked with yellow lines between levels.

Figure 7. Hierarchical view of gene expression in the kidney. Yellow lines connect all data points related to the kidney.

Figure 5. Close-up view of heart data in hierarchical view.

Figure 6. Details for item hsa:124 – UD.GG.Contig26702.

2005 IEEE ICDM Workshop on MADW & MADM

5

Figure 8. 3D view of pathway, human and chicken chromosome maps highlighting the position mapping of genes expressed in the kidney.

java parser, data in the databases can be formatted into XML and entered into Starlight. Since images can be loaded into the data view space of Starlight, these images can be very useful for a better understanding of the data. As shown in Figure 8, traditional static pathway diagrams and chromosome maps can be linked to data sets as well. Data can be presented to the user in a hierarchical view to allow the user to explore the view space through the associated relationships while navigating the data sets. The examples used in Section 2.2 show preliminary results of applying Starlight for a genomics evaluation.

One of the unique features of Starlight is its MetaImage tool, where image files can be processed and graphically linked to correlated information within a dataset. Here the Starlight MetaImage tool is used to process pathway and chromosome diagrams such that these diagrams can be visually linked to genetic information in our approach to support genomic visualization. Figure 8 presents a 3-D view of the data with gene, pathway, and chromosome position mapping of genes expressed in the kidney. Pathway diagrams and chromosome images are embedded into the same view space allowing for simultaneous exploration of these data. In Figure 8, image 1 (left) is the chicken chromosome map, image 2 (center) is the human chromosome map, and image 3 (right) is the pathway map of Glycolysis/Gluconeogenesis of Homo Sapiens. The yellow lines highlight where the chicken homologs are matched with the human genes.

4. Conclusion and future work In this paper, we have presented a multiagent approach to retrieve, integrate and analyze gene expression information and a new approach to visualize gene expressions by using the Starlight information visualization system. With the ability to explore information in the publicly available genetic domain on the web, it is essential to develop increasingly powerful tools to retrieve the target information from different bioinformatics resources and to visualize this information efficiently. Our approach is to apply a multi-agent system (BioMAS) to retrieve data from separate resources and subsequently utilize the Starlight system to visualize correlations between the datasets. The pathway agent of BioMAS can automatically update the human and chicken data in the pathway database whenever the data in the external resources are changing. Currently, the view space in Starlight

3. Results and discussion BioMAS is a flexible and general multiagent system, which can work for different organisms of interest. The Gallus Knowledge Base and Fungi Knowledge Base (http://udgenome.ags.udel.edu) have been generated using BioMAS. GallusKB has a total of 30,214 chicken contigs in the database and chicken gene products have been predicted to be involved in a total of 140 pathways based on annotations in the KEGG database. 5851 human proteins in the KEGG database were used to search GallusKB, and 5437 (92%) Gallus gene products were found to have associated pathways. Through a

2005 IEEE ICDM Workshop on MADW & MADM

6

has to be updated manually. However, this could be improved in future work by using a trigger to update the view space automatically whenever the external source data change. The MetaImage tool provided within Starlight can create the coordination of data with relevant images. This feature is not automated yet either, which can lead to significant lead time while processing large amounts of similar images. Therefore, this is another focus of future work. The present data format is in flat XML format and some work should be invested to use a database as the data format of choice. The preliminary results obtained to date have been encouraging in outlining the potential for new visualization techniques to interactively explore genomic and pathway databases.

[11] Chawathe, S., H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J.Widom, The TSIMMIS project: integration of heterogeneous information sources. In Proceedings of the Tenth Anniversary Meeting of the Information Processing Society of Japan, December 1994. [12] Nodine, M. and A. Unruh. Facilitating open communication in agent systems: the infosleuth infrastructure. In M. Singh, A. Rao, and M. Wooldridge, editors, Intelligent Agents IV, pages 281–295. SpringerVerlag, 1998. [13] Bryson, K., M. Luck, M. Joy, and D.T. Jones. Applying agents to bioinformatics in geneweaver. In Proceedings of the Fourth International Workshop on Collaborative Information Agents, 2000. [14] ExPASy Proteomics Server, Swiss Institute of Bioinformatics, http://us.expasy.org/ [15] Encyclopedia of Escherichia coli K12 Genes and Metabolism, http://ecocyc.org/ [16] Zimányi,E., S. Skhiri dit Gabouje, Semantic Visualization of Biochemical Databases In Semantics for GRID Databases: Proc. of the Int. Conf. on Semantics for a Networked World, ICSNW2004, Paris, France, June 2004. [17] Hu, Z., Joseph Mellor, Jie Wu, Charles DeLisi. VisANT: an online visualization and analysis tool for biological interaction data. BMC Bioinformatics. 2004; 5 (1): 17 [18] Dickerson, J.A., Y. Yang, K. Blom, Using Virtual Reality to Understand Complex Metabolic Networks, Atlantic Symposium Comp Biol Genomic Info Systems Technol September. 950-953 [19] Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D., and Kahn, D., ProDom: automated clustering of homologous domains. Brief Bioinform 3(3), 246-51, 2002. [20] PSort: subcellular localization prediction, Brinkman Laboratory, Simon Fraser University, http://www.psort.org/ [21] Ashburner, M., and Lewis, S., On ontologies for biologists: the Gene Ontology--untangling the web. Novartis Found Symp 247, 66-80; discussion 80-3, 84-90, 244-52, 2002. [22] Medical Subject Headings, http://www.nlm.nih.gov/mesh/meshhome.html [23] Decker, K., S. Khan, C. Schmidt, D. Michaud, Extending a Multi-Agent System for Genomic Annotation. Proceedings of the Fifth International Workshop on Cooperative Information Agents, Modena, September 2001. LNAI 2182, Springer-Verlag, 2001. [24] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J., Basic local alignment search tool. J Mol Biol 215(3), 403-10, 1990. [25] Contiguous Chromosomal Chicken Sequences Sequencing Center, Washington University, St. Louis, http://genome.wustl.edu/projects/chicken/index.php?unmas ked=1

5. Acknowledgements This publication was partially supported by awards from the National Science Foundation (NSF 0092336), the US Department of Agriculture (9935205-8228), and the National Center for Research Resource at the National Institutes of Health (2 P20 RR016472-04) under the INBRE program.

6. References [1] Human Genome Project, http://www.ornl.gov/sci/techresources/Human_Genome/ho me.shtml [2] Jackson, J.H., Terminologies for Gene & Protein Similarity, Technical Reports & Reviews No. TR 99-01 Michigan State University, http://www.msu.edu/~jhjacksn/Reports/similarity.htm [3] KEGG: Kyoto Encyclopedia of Genes and Genomes, Kanehisa Laboratory. http://www.genome.ad.jp/kegg/ [4] GenBank, NIH genetic sequence database, http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview. html [5] Swiss-Prot Protein knowledgebase, TrEMBL Computer-annotated supplement to Swiss-Prot; http://us.expasy.org/sprot [6] Ensembl Genome Browser, Sanger Institute, The Wellcome Trust, http://www.ensembl.org/ [7] Benson D.A. et al. Genbank. Nucleic Acids Res., 28:15– 18, 2000. http://www.ncbi.nlm.nih.gov. [8] Starlight Information Visualization System, Pacific Northwest National Laboratory, http://starlight.pnl.gov/ [9] Kritzstein, B., Starlight, Military Geospatial Technology Online Archives Volume: 1 Issue: 1, http://www.military-geospatialtechnology.com/article.cfm?DocID=339 [10] Decker, K., Khan, S., Schmidt, C., Situ, G., Makkena, R., Michaud, D., Biomas: A multi-agent system for genomic annotation International Journal of Cooperative Information Systems, 11 (3-4): 265-292, 2002

2005 IEEE ICDM Workshop on MADW & MADM

7

Concepts, Challenges, and Prospects on Multiagent Data Warehousing and Multiagent Data Mining Wen-Ran Zhang Dept. of Computer Science, Georgia Southern University Statesboro, Georgia 30460-7997, [email protected] Phone: (912)486-7198, Fax: (912)486-7672

Abstract

ƒ

CI – Computational Intelligence [2];

A hierarchical architecture has been a dominating model for brain and behavior research for many years. Unfortunately, a hierarchical structure along is too simplistic to be realistic for organizing many billions of neurons and genetic or biological agents into an autonomous system for high level cognition. MADWH/MADM presents a multidimensional agent-oriented approach for brain modeling and decision making based on the hypothesis that a brain system consists of a society of semiautonomous neural agents and full autonomy is the result of coordination of semiautonomous functionalities. The agent-oriented approach leads to the following concepts, challenges, and prospects: (1) agent laws, agentization, agent discovery, law discovery, selforganization, and reorganization; (2) mining agent association rules in 1st-order logic; (3) modeling full autonomy as the result of coordination of semiautonomous agents; (4) modeling evolving processes like growing and aging; and (5) modeling healthy states as well as unhealthy states of biological systems. The short term potential of MADWH/MADM is in its commercial values in multidimensional multiagent OLAP and OLAM in business, engineering, and biomedical applications. As a platform for scientific discoveries, its long term impact can be far-reaching especially in knowledge discovery about bio-agents.

ƒ

AI – Artificial Intelligence.

ƒ

CCI – Coordinated CI [10,11,12].

ƒ

DAI – Distributed AI [3,5].

ƒ

MAS – MultiAgent System [3,6]

ƒ

MAC – MultiAgent Cerebrum and/or Cerebellum [10,11, 12].

The term of MADWH/MADM as a total package is first proposed in [13,14] and refined in [15]. It is originally used for brain modeling and neurofuzzy control based on the work in [10,11,12]. Here the concept of MADWH is generalized to a data/knowledge system that allows the warehousing of “agentwares” including agent specification, characterization, architecture, knowledge, associations, organizations, or even the agent itself as well as the data or memory associated to it. An agent here is a data “miner” that is either an autonomous, semiautonomous, computational, virtual or bio agent. While a traditional data warehouse is for business decision support, a MADWH is for agent modeling and knowledge discovery as well as business and engineering decision support. Although both are integrated, time-variant, and nonvolatile, a traditional data warehouse is subject-oriented and data-based; a MADWH is agent-oriented and agent-based. MADWH provides a centralization for reinforced MADM; and MADM provides a distribution of learning activities that can further develop a MADWH. The two together form a YinYang pair for equilibrium or harmony in learning and decisioin. The agents or miners envolved in MADWH/MADM can be heterogeneous or homogeneous. Heterogeneous agents perform different functionalities. Homogeneous agents perform the same functionalities.

Keyword: Multidimensional Agent-Orientated Brain Modeling; MADWH and MADM; Intermediate Agent Law; Agentization; Agent Association, Concepts-Challenges-Prospects

1. Introduction Some abbreviations are listed in the following: ƒ

MADWH – MultiAgent Data WareHouse or Multiagent Data Warehousing [13,14,15]

ƒ

MADM – Multiagent Data Mining [13,14,15].

ƒ

VA – Virtual Agent [16].

ƒ

VC – Virtual Community [16].

2005 IEEE ICDM Workshop on MADW & MADM

Biological systems such as brains have enormous capabilities in information processing and knowledge discovery including information storage, retrieval, sensor fusion, visualization, cognition, and recognition with spacio-temporal patterns. One challenging issue

8

MADWH/MADM systems for personalization, user modeling, P2P, autonomy, and semiautonomy.

facing machine learning and knowledge discovery today is understanding how the enormous amount of radio, audio, bio- and spacio-temporal data is processed and how knowledge networks are organized in the brain, and how a brain system can act as a coordinator of multiple miners for data mining and knowledge discovery.

This work presents some concepts, identifies a number of challenges, and provides some prospects on MADWH/MADM with discussions on applicability, feasibility, and architectural design issues for different applications. Section 2 introduces some basic concepts in MADWH and MADM as a package. Section 3 discusses CCI vs. DAI. Section 4 reviews the example in [15] for further discussion. Section 5 presents a comparison between MADM and MRDM. Section 6 introduces an intermediate agent law for agentization and agent discovery. Section 7 discusses feasibility and applicability of orthogonal agent association. Section 8 identifies a number of challenges ahead. Section 9 draws a few conclusions.

For many decades, a hierarchical architecture has been the dominating model in brain related research. Unfortunately, a hierarchical structure along is too simplistic to be realistic for organizing many billions of neurons into an autonomous system for high level cognition. Multiagent data warehousing (MADWH) and multiagent data mining (MADM) provides an alternative approach to brain modeling [10-15]. The new approach is based on a multidimensional agent orientation where agent similarity, agent cuboids, agent community, orthogonality, and reorganization are some basic concepts.

2. Basic Concepts MADWH/MADM focuses on the interplay of the two. With a MADWH, MADM algorithms can be developed in an evolving dynamic environment with autonomous or semiautonomous agents especially CI agents. Instead of mining frequent itemsets from customer transactions or frequent patterns from multiple relations, MADM discovers new agents and mines agent associations in first-order logic for coordination based on agent similarity. The concept of agent similarity leads to the notions of agent cuboid, orthogonal MADWH and MADM.

In MADWH/MADM the brain is considered a society of semiautonomous neural or genetic agents where full autonomy is the result of coordination of semiautonomous functionalities and learning is accomplished with multidimensional and multiagent data mining. It is shown in [10-15] that coordinated knowledge discovery is possible in an evolving dynamic environment with a large number of autonomous or semiautonomous neural agents as “actors” and agent actions as “transactions”.

The novelty of a MADWH lies in its ability to systematically combine neurofuzzy systems, multiagent systems, database systems, machine learning, data mining, information theory, neuroscience, decision, cognition, and control all together into a modern multidimensional information system architecture that is ideal for brain modeling of different animal species with manageable complexity. Although examples in robot control are used to illustrate the basic ideas as in [13,14,15], the new approach is generally suitable for data mining tasks where knowledge can be discovered collectively by a set of similar semiautonomous or autonomous agents from a geographically, geometrically, or timely distributed high-dimensional data environments.

Different from the Apriori algorithm [1] where frequency is used as a priori threshold for mining item associations, MADWH/MADM uses agent similarity as a priori threshold for discovering agent associations in first order logic that was once considered impossible in traditional data mining. Different from multirelational data mining (MRDM) [7,8], MADM does not assume a static data source or data stream. Instead, relevant data is like mineral deposit that is to be located through coordinated multiagent exploration or “data outcropping” from an uncertain and dynamic environment before knowledge discovery. A MADWH (or a multiagent data mart) provides a brain model for the coordination of data outcopping and mining [15]. Multidimensional agent orientation in MADWH and MADM provides a joint platform for many areas of research and development. From one point of view, MADWH/MADM is to promote CCI and DAI for scientific, engineering, and business applications including brain and neuroscience research itself where the two long term former adversaries of CI and AI can join forces with other areas. From a web-based software engineering perspective, self-organization and reorganization brought up a major challenge in the design, implementation, reuse, and integration of

2005 IEEE ICDM Workshop on MADW & MADM

3. CCI vs. DAI While multiagent systems (MAS) is originated from distributed AI (DAI) research and MADM can be considered distributed data mining, the term “multiagent data warehousing (MADWH)” is coined for brain modeling and neurofuzzy control [13,14,15]. It is a continuing research effort in coordinated computational intelligence (CCI) [10,11,12]. AI and CI research communities are well-known long term former adversaries which are to be brought

9

(4) What can CCI offer to autonomous machine learning and control? and

together with the MADWH/MADM platform. While traditional AI systems depend on symbolic reasoning techniques; numerical AI systems (fuzzy, neural, and/or genetic systems) mainly employee numerical computation for learning purposes. Due to its computational characteristics, numerical AI is scientifically renamed as computational intelligence (CI) [2]. Since CI components are fine-grained and numerical, their coordination is excluded from the distributed artificial intelligence (DAI) research [3]. On the other hand, since CI components are inherently distributed and computational, decomposition and coordination of CI systems have so far been mostly buried in computation with a few exceptions [10-15].

(5) What can CCI offer to legged locomotion? A cerebral/cerebellar agent in CCI [12] is referred to as a semiautonomous cognitively identifiable neuro/fuzzy/ genetic subsystem that (i) possesses partial states, knowledge, behaviors, learning and decision abilities of an autonomous agent; (ii) can communicate with other agents and forms, together with the others, a multiagent cerebrum/cerebellum (MAC) model of an autonomous agent; and (iii) does not lose its learning and decision abilities because of partial damages to others. A MAC model is homogeneous if all its agents perform the same type of function; otherwise, it is heterogeneous. A MAC model is a centralized model if there is a central coordinator; it is a distributed model if there is no central coordinator and coordination is achieved through communication protocols; it is a federated model if both centralized control and distributed protocols are used.

As an AI subfield, DAI research focus on coordination and cooperation methodologies among coarse-grained multiagent systems (MASs) [3,5,6]. A MAS is concerned with the behaviors among a collection of autonomous agents and how they can coordinate their knowledge, goals, skills and plans jointly to take actions or to solve problems collectively. Agents in a MAS may be working toward a single global goal, or toward separate but related and sometimes conflicting individual goals. So agents must share knowledge about problems and developing solutions, must resolve their conflicts and reach compromised or optimal global solutions, and must reason about the processes of coordination among the agents [3].

A cerebral/cerebellar agent is a fuzzy agent if the agent’s knowledge representation is based on fuzzy sets, learning and control are accomplished via fuzzy rules and/or fuzzy pattern recognition. A cerebral/cerebellar agent is an associative memorybased agent if the agent’s knowledge is represented in table or matrix forms where learning and control are accomplished via table-driven adaptive schemes. A cerebral/cerebellar agent is a neural agent if learning and control are accomplished via neural nets with neurons. An agent is a neurofuzzy agent if both neural and fuzzy techniques are combined.

An ideal autonomous DAI agent is a real world entity that has identity, knowledge, states, behaviors, and learning abilities. While being coordinated through communications, DAI agents form a MAS. A MAS is heterogeneous if the agents in the MAS are of different types; it is homogeneous if all agents are of the same type. A virtual agent (VA) [16] can be defined as a characterization or an image of an autonomous agent. A VA can also be a neural agent if the neural agent is a reflection of another agent in a brain system. Such VAs or neural agents can form a virtual community (VC) [16] that can be modeled as a MADWH.

The notion of CCI follows the hypothesis that the brain system of an autonomous agent consists of a school of cerebral/cerebellar agents which reflect the conceptual and/or physical world including the body states of the agent itself [12,15]. CCI is mainly concerned with: (1) agent-oriented decomposition of a brain system; (2) the coordination of neurofuzzy agents; (3) the formation of a MAC model; (4) the adaptive, incremental, exploratory, and explosive learning behaviors of a MAC model; and (5) self-organization and reorganization of CCI agents.

The notion of "agent" should be central in CCI as well as in DAI. While decision makers, autonomous robots, and networked intelligent information systems are typical DAI agents; it would be very hard to imagine that the biological or artificial neural system of an autonomous agent could function well without a school of intermediate cerebral/cerebellar agents between itself and its memory cells, fuzzy rules, and billions of neurons. Evidently, we have to answer the questions:

With the above definitions, low level fuzzy rules, associative memory cells, or neurons can not be considered as CCI agents because they, individually, do not show any cognitively identifiable agent behaviors or learning abilities. An associative memory module, on the other hand, can be considered a CCI agent if it meets the conditions for a CCI agent. It is interesting to consider the left and right cerebellum subsystems. Apparently, they are not fully autonomous, but they fit well into the category of homogeneous CCI agents. It is not too unusual to see someone who is half paralyzed

(1) What is an agent in CCI ? (2) What are the differences and similarities of CCI and DAI? (3) How are CCI agents identified and coordinated?

2005 IEEE ICDM Workshop on MADW & MADM

10

can be an agent of a MAS. On the other hand, the semiautonomous cerebral/cerebellar agents of a MAC system can reflect the states of the autonomous agents of a MAS. Therefore, CCI can be used in the coordination of DAI agents and vice versa.

due to left or right side neural damage but who may still be able to move on one leg with some support. This is a typical example of homogeneous semiautonomy. On the other hand, a MAC system with vision and hearing agents and/or arm and leg control agents is clearly heterogeneous.

4. A MADWH and MADM Approach for Brain Modeling and NeuroFuzzy Control

As a CI subfield, CCI should share the following common characteristics with DAI: (1) both are defined distributed world;

in

an

agent-oriented

4.1 Agent Cuboids vs. Data Cuboids

and

In [10-15], it is shown that an agent can be an autonomous or semiautonomous neurofuzzy agent for robot control. Two simulated unipeds are sketched in Fig. 1. The goal is to enable a simulated N-link uniped (whose motion is governed by a set of 2nd-order differential equations that has infinite number of inverse solutions) to learn gymnastic jumps. Each jump can be characterized with a pair, where V is a control vector and M is a measure vector as defined in Fig. 1 for a 3- and a 4-link uniped.

(2) both can use cooperation as well as competition strategies; (3) both need conflict resolution; (4) both need communication; and (5) both use coordination as a key. A dividing line between CCI and DAI can be drawn with the following essential distinctions:

Note that the 4-link (foot, lower leg, upper leg, and body) V vector has 10 dimensions and M vector has 7 dimensions. The angles θ1-θ4 in V determine the takeoff configuration of the robot; T1-T3 are torque applied to the three joints for taking off; T4-T6 are torque applied to the joints in flight to configure the robot for proper landing. The torques can be replaced with desired joint angles θd1,θd2,θd3 for landings. H, D, and A define the jump height, distance, and landing angle and the four angles θL1, θL2, θL3, and θL4 define the landing configuration. The landing configuration can be equivalently determined by (A,LH,θLx,θLy) where A is landing angle, LH is landing mass center height, θLx ,θLy are any two different link angles. A 3-link uniped has two joints and needs two take-off torques and two in-flight torques for a jump.

(1) DAI relies on symbolic representation and reasonning; CCI mainly relies on numerical representations and neuro/fuzzy/genetic learning schemes. (2) A multiagent cerebrum/cerebellum (MAC) model in CCI is defined in a fine-grained semiautonomous neuro/fuzzy/genetic agent world; a multiagent system (MAS) in DAI [3,5] is defined in a coarsegrained autonomous agent world. (3) DAI agents are mostly loosely-coupled systems that use intercommunications; CCI agents are tightlycoupled sub-systems that use intracommunications or brainstorming. (4) A MAS consists of a collection of autonomous agents which can take actions individually with or without coordination; a MAC model consists of a collection of semiautonomous cerebral/cerebellar agents which can make decisions individually or collectively but normally do not take actions without coordination.

In this case, a brain system is hypothesized as a society of semiautonomous neural agents that can be cognitively identified by taking off configurations of a jumping uniped [10-15] for different long, short, backward, and forward gymnastic jumps. Based on agent identification, agent similarity, and function similarity, we have the concepts of agent cuboids. While data cuboids organize relevant data sets into hypercube structures for business decision support, agent cuboids organize relevant or similar (cooperative or competitive) neural agents into hypercube brain structures for coordinated machine learning and control.

(5) DAI aims at enhancing effectiveness and efficiency of combined social and physical organizations, CCI searches for an agent-oriented brain architecture for an autonomous agent to emulate human learning and control. (6) DAI agents adapt into certain social protocols for their coordination, CCI agents adapt into social protocols and common sense versions of natural motion laws (referred to as cerebellar laws in this paper) for their coordination. Particularly, cerebellar agents rely heavily on cerebellar laws for coordination due to the nature of its motion control tasks.

For any orthogonal neighborhood, we have the application-specific associations in first order predicate logic as in Fig. 2. Fig. 3 shows the 4-link uniped configurations of 16 neural agents that can be organized in a 4-D agent cube as in Fig. 4.

The interplay of CCI and DAI can be essential in solving complex distributed problems. A MAC system

2005 IEEE ICDM Workshop on MADW & MADM

11

Fig. 5 shows two link weights matrices of a trained 3-layer BP neural controller with an error rate of 0.000009. It is assumed that an autonomous agent coordinates many semiautonomous neural agents in the MADWH. Whenever an agent is called from the warehouse, the link weights are assigned to the neural controller to generate a V vector for a desired jump measure vector M.

geometrical for uniped locomotion control [12]. Therefore, the orthogonal neural agent representation of dynamic motion laws provides effective inverse dynamics for the 2nd order differential (motion) equations that govern the motion of an autonomous agent. Such a brain structure may well explain the phenomena that an animal can learn and apply dynamic motion laws without understanding them.

Five jumps by five corner agents of a 3-link uniped are listed in Fig. 6. Evidently, agents A and B are similar based on their jumps because they differ on one corner parameter (θ1) and they are able to make almost the same jump with different joint torques. Agent C and D are similar also. But C can make longer jumps with the same height. Agent D and E are also similar, Agent E can make an even longer jump. (Note: angles are in degrees and height and distance are in meter.) Thus, every pair of neighbor agents can be considered similar. Here only one pair of actions is selected. If it is selected from 100 actions, the support is 0.01. Since only one pair of actions is tested successfully, the confidence is 1.0.

With the similarity law, agent extrapolation allocates new agents in the plausible directions. The implausible directions are marked with DeadEnd. A DeadEnd is also an important discovery. It helps redirecting the exploration toward the plausible direction. It emulates the process of outcropping in mineral deposit exploration by a team of miners. Fig. 8 shows the exploration in the long jump direction with agent A, B, C, and D as in Fig. 6.

Every pair of neighbor agents in the agent cuboids of Fig. 4 are tested as similar agents with high support and confidence measures. It is then can be concluded that both agent cuboids are orthogonal. From Fig. 6, we can see that the similar action measures are different only on the distance dimension. The application-specific 1st order association rules in this case can be determined as in Fig. 7. Interestingly, the two association rules are evidently dynamic motion laws in predicate logic form. Such laws can be used as meta knowledge for further coordinated data mining or knowledge discovery. Similarity leads to agent interpolation and extrapolation. in a global mining process. MADWH provides an efficient and adequate platform for modeling a brain system in performing coordinated adventures. At the neural network level, interpolation and extrapolation result in weight matrices for a new neural net assuming the same neural architecture. The weight matrices can be used as initial weights for training interpolated or extrapolated neural controllers. This can reduce training time dramatically compared with using random initial link weights. Given two similar BP neural agents A and B with neural weight matrices WA and WB, respectively, based on the dynamic motion laws in Fig. 7 we have

3-link: V = (θ1, θ2, θ3, T1, T2, T3, T4) or V = (θ1, θ2, θ3, T1, T2, θd1,θd2); M = (H, D, Am, θL1, θL2, θL3) ≡ (H, D, A, LH, Am, θLx, θLy), x,y = 1, 2, or 3, x≠y. 4-link: V = (θ1, θ2, θ3, θ4, T1, T2, T3, T4, T5, T6) or V = (θ1, θ2, θ3, θ4, T1, T2, T3, θd1,θd2,θd3) M = (H, D, Am, θL1, θL2, θL3, θL4)

Fig. 1. Control and measure parameters of an action ∀A1,A2,{action(A1, measurei, measurej1) ∧ action(A2, measurei, measurej2) ∧ neighbor(A1, A2) ⇒ ∃A3 , action(A3, measurei, (measurej1+ measurej2)/2)∧neighbor(A1,A3,A2)}; Rule ∀A1,A2,A3,{action(A1, measurei, measurej1) ∧ 2 action(A2, measurei, measurej2) ∧ neighbor(A1, A2, A3) ⇒ action(A3, measurei, (measurej1+ 2×measurej2)/2)}; OR ∀A1,A2,A3,{action(A1, measurei, measurej1) ∧ action(A2, measurei, measurej2) ∧ neighbor(A3, A2) ⇒ action(A3, measurei, A1, (2×measurej1+measurej1)/2}; Fig. 2 Application specific association rules (adapted from [15]) Rule 1

WI ≈ (WA+ WB)/2 and WE ≈ 2WA - WB; where WI is a weight matrix of an interpolated neural agent, and WE is a weight matrix of an extrapolated neural agent assuming the same neural architecture. Interpolation and extrapolation is learning by agent discovery. The learning speed of agent discovery is

2005 IEEE ICDM Workshop on MADW & MADM

It should be remarked that traditional data warehouse query languages and utilities can be extended for a MADWH. Agent-oriented drill-down, roll-up, slice, dice, and pivot with a MADWH can support brain analysis and cognition at different levels of concentration. Some spacio-temporal patterns in multiagent brain modeling are sketched in Fig. 9. The snap-shot of a controlled jump is shown in Fig. 10. A model for mental concentration in gymnastics is sketched in Fig. 11 where the smaller a kernel space the more concentrated and more precise the controlled jump.

12

A : ( 1 8 0 ,2 0 ,1 0 5 )

A

N e e d s s lig h tl y la r g e r to r q u e fo r th e s a m e ju m p m e a s u r e

C : ( 1 7 0 ,3 0 ,1 0 5 )

C s h o r te r d is ta n c e

B : ( 1 7 0 ,2 0 ,1 0 5 )

D lo n g e r d is ta n c e

Fig. 3 16 corner agents (adapted from [12])

E

B

lo n g e s t d is ta n c e E : ( 1 7 0 ,2 0 ,9 3 )

D : ( 1 7 0 ,3 0 ,9 3 )

Fig. 8. Coordinated data outcropping and data mining

Fig. 4 A 4-D base cuboid with 16 corner agents ⎡- 3.380671 - 5.789867 - 0.803275 - 1.805584 3.367871 - 0.567861 - 2.064914 ⎤ ⎥ ⎢4.225339 8.748863 0.977284 1.778801 - 4.920876 2.000594 1.970769 ⎥ ⎢ ⎥ ⎢0.740411 0.74155 0.709369 0.195054 0.843568 0.826496 0.622034 ⎥ ⎢ ⎢- 0.949556 - 1.602977 - 1.486067 - 1.475501 - 2.318124 - 0.778026 - 2.533893 ⎥ ⎥ ⎢- 0.968122 - 2.248199 - 0.549651 - 0.268073 1.282669 0.178415 0.500489 ⎥ ⎢ ⎥⎦ ⎢⎣- 1.110398 - 3.399694 0.118515 0.746009 1.559476 0.44012 3.492346 ⎡3.2076813.034865.6140853.5428383.1557114.198788.771423 ⎤ ⎢- 3.839715- 3.264428- 5.24406 - 3.647653- 2.924098- 4.400636- 8.965652 ⎥ ⎢ ⎥ ⎢3.565889- 6.934944- 0.539471- 0.1299612.854315.800477 0.773187 ⎥ ⎢ ⎥ ⎢- 2.683949- 2.811486- 4.623281- 2.797362- 2.463824- 2.924549- 7.586217 ⎥ ⎢⎣- 4.16698 2.739815- 1.101889- 1.359817 - 2.685953- 0.510151- 1.199774 ⎥⎦

Fig. 9 A sketch of spacio-temporal patterns in brain modeling for uniped control (adapted from [12])

n1: 6 n2: 6 n3: 5 Err: 0.000009 Fig. 5. Weights matrices of a 3-layer BP neural controller V: (θ1, θ2, θ3, T1, T2, T3, T4)

M:(D, H, A, θL1, θL2, θL3)

A 170 20 105 -219 20 110 -20

0.3 0.9 -4 0.29

0.3 164

B 180 20 105 -227 20 104 -20

0.3 0.9 -4 0.29

0.3 164

C 170 30 93 -291 20 83 -20

0.69 1.3 -3.9 0.29 0.45 164

D 170 30 105 -284 20 135 -20

0.25 1.3 -4

0.29 0.26 164

E 170 20 93 -309 20 118 -20

1.0 1.3 -4

0.29 0.46 164

Fig. 10 Snapshots of a simulated controlled uniped jump

Fig. 6. Similar actions by similar agents (adapted from 15) ∀A1,A2,{jump(A1, H=x, D=short) ∧ jump(A2, H=x, D=long) ∧ neighbor(A1, A2) ⇒ ∃A3,jump(A3, H=x, D=medium-long)∧neighbor(A1,A3,A2)}; ∀A1,A2,A3,{jump(A1, H=x, D=short)∧ action(A2, H=x, D=medium-long) ∧ neighbor(A1,A2,A3) ⇒ action(A3, H=x, D=long)}; OR ∀A1,A2,A3,{jump(A1, H=x, D=medium-long) ∧jump(A2, H=x, D=long) ∧ neighbor(A3,A1,A2) ⇒ action(A3, H=x, D=short)}. Fig. 7. Dynamic motion laws as agent association rules

2005 IEEE ICDM Workshop on MADW & MADM

Fig. 11 Neural agents for different levels of mental concentration in gymnastics

13

5. MADM vs. MRDM – A Comparison It is interesting to compare multiagent data mining (MADM) with multirelational data mining (MRDM). In [10-15], each neurofuzzy agent for a 3-link uniped has 14 dimentions, each agent for a 4-link uniped has 17 dimensions. There are infinite number of reverse of the 2nd-order differential equations governing the robot motion that lead to infinite number of solutions. Using agent association there could be many agent cuboids that form a agent community or society for different gymnastic jumps or for jumping control under different gravities or different terrains. The cognitive complexity would be unmanageable with a usual data warehouse. For instance, it would be very difficult if not impossible to represent the data and knowledge for visualization and decision using a table format without the multiagent approach and it would definitely be impossible to mine dynamic motion laws in 1st-order predicate logic without agent associations.

MADM can be distinguished from MRDM or graphbased mining [7,8] as in the follows. (1) Agents are dynamic actors and/or controllers while relations and graphs are static data sets and structures. (2) Agent association rules are rules governing agent communities while an item association is a rule about items and a relation association is a pattern regarding some relations that does not possess the cognitive identity, dynamics, learning, decision making, and control ability of an autonomous or semiautonomous agent. (3) Orthogonal agent association is based on agent similarity as a priori knowledge while item association is based on item frequency as a priori knowledge and multirelational or graph association may use a relational strength in the real interval [0,1].

Here agent association is classified as orthogonal or non-orthogonal. We examine orthogonal agent association and leave non-orthogonal agent association for future study. An orthogonal agent association rule takes the general form

(4) Orthogonal Agent association may lead to the discovery of new neural, fuzzy, or genetic agents similar to existing agents while item association and MRDM or graph data mining are not motivated by agent discovery.

∀A(gent)1,A(gent)2, {P(A1,A2) ⇒ ∃A(gent)3{Q(A1,A2,A3)}}

(5) Agents can be coordinated for collaborative knowledge discovery and decision making while item and relations can not be coordinated.

which reads “for all agent1 and Agent2, IF the predicate P(A1,A2) is true THEN there exists some Agent3 SUCH THAT Q(A1,A2,A3) is true.” Two agent association rules are listed in Fig. 12 for agent interpolation and extrapolation in coordinated data outcropping and data mining as in the follows.

(6) Orthogonal agent associations lead to an orthogonal MADWH that resembles a brain system while relational associations are not agent-oriented. (7) Coordinated multiagent data mining can discover dynamic motion laws that can not be observed in item and relational association rules.

∀A1,A2, {similar(A1,A2) ⇒ ∃ A3{A3≈(A1+A2)/2 ∧ Similar(A3,A1)∧similar(A3,A2)}. – Interpolation Rule

(Distance(A2,A3)=d)∧ (Distance(A1,A3)=2d)}

(8) Agent association assumes a dynamic and distributed data environment while item and relational associations assume a static data source.

⇒ A3≈2•A2-A1.

6. Intermediate Agent Law

∀A1,A2, {similar(A1,A2)∧similar(A2,A3)∧

– extrapolation rule

Orthogonal agent association rule mining in 1st order logic leads to the extension of the mean-value theorem in calculus to a commonsense intermediate agent law for MADWH/MADM.

Fig. 12 Two association rules for uniped control A close examination of the orthogonal association rules reveals that agent similarity is a key for orthogonality. There are a number of similarities between MADM and MRDM including:

Intermediate Agent Law. Given any pair of similar biological-system-inspired computational agents or agent communities A1 and A2 defined in a multidimensional space with a measurable distance 2d>0, a third similar agent or agent community A3 can be discovered or created such that A3 is similar to A1 and A2 with distance d to each of them.

(1) MADM and MRDM can both be used for machine learning and engineering applications. (2) Both are suitable for knowledge discovery from high dimensional data environment. (3) Both can be used for extracting association rules in zero- or first-order logic.

2005 IEEE ICDM Workshop on MADW & MADM

In the above commonsense law, a biologicalsystem-inspired computational agent could be any 14

(4) Many autonomous or semiautonomous agents are needed for the learning/decision/control;

computer based system, intelligent or not intelligent, autonomous or semiautonomous, mobile or stationary, ground or airborne, neural or genetic, CI or AI agent, or any other computational agent. Cruise controllers, robot controllers, artificial neurons and neural networks, gene expressions and genetic models, fuzzy controllers and systems, rough set-based systems, and different memory components are apparently such agents.

(5) Collective/explorative learning/decision/control is needed; and (6) Coordination is possible. It is evident that orthogonal agent association may not be suitable to a static data environment if the data storage is incomplete. It is also evident that orthogonal agent association is a good fit for brain modeling and robot learning/control because there could be billions of neurons in a brain system that has to be a large community or society of semiautonomous neural agents for the exploration of different dynamic data environments.

The commonsense law provides a basis for the extension of digitization to agentization. The concept of agentization is popular in military operational research. It is adapted into MADWH/MADM in ref. [12,15]. Here the term “agentization” stands for “populating a multidimensional space with virtual or real agents or agent communities.” With the intermediate agent law, an orthogonal MADWH can be defined as a virtual non-linear dynamic agentization and MADM can be defined as a coordinated discovery process for new agents, agent associations, agent organizations, and agent laws.

8. Challenges MADWH/MADM as an emerging research area is just in its infant status. Many tough challenges lay ahead. We enumerate some challenging issues as in the follows. ƒ Bring CCI and DAI together. The first and foremost challenge is how to bring CCI and DAI together for the interplay of MADWH and MADM. It is expected that CCI and DAI will play a major role in MADWH/MADM. However, many questions are yet to be answered in this direction of research.

7. Feasibility and Applicability of Orthogonal Agent Association Feasibility. From the early discussions two major necessary conditions for orthogonal agent association and orthogonal MADWH can be derived.

ƒ Warehousability, agent identification, agentoriented decomposition, and agentization. The concepts of semiautonomy, full autonomy, agent cuboids, and agent society are for dealing with the complexity in brain modeling and agentization. It is suggested that all bio-like robots or devices can be decomposed into semi-autonomous agents based on their physical configurations. However, some agents are better fitted for MADWH and some are not. In general, the architecture and the embedded knowledge of neural networks and fuzzy controllers [9] can almost be completely characterized and stored in a MADWH. Therefore, it is easier to adapt MADWH/MADM into scientific and engineering applications. On the other hand, a mobile software agent can be stored in a warehouse for dispatching or “agentization”, but it might be difficult to identify its dimension. Many research efforts are needed for agent identification, agentoriented decomposition, and agentization in different application domains especially for web applications.

(1) Agent association requires agent identification. Agent identification can be accomplished using information gain method, Gini method, or parameter analysis. In almost all robot learning tasks, some physical configurations of the robots (including ground robot, underwater robot, and flying robots) are almost certain to be good cognitive identities of neural agents for neural learning and control. (2) Agent similarity is a key for orthogonal multiagent data warehousing. To define similarity, the corner or key parameters must be identified to define configuration similarity, agent capabilities must be tested to define function similarity. Applicability. From the early discussions the following applicability conditions can be derived for orthogonal agent associaiton. (1) The environment meets the two feasibility conditions; (2) The learning/decision/control space is geometrically, geographically, conceptually, and/or timely distributed;

ƒ Heterogeneity. In the robot control example, the agents are homogeneous because all are for gymnastic jumping. How to organize heterogeneous agents into a MADWH/MADM framework is a great challenge. A typical heterogeneous example in data mining is to

(3) The learning/decision/control task is dynamic in nature;

2005 IEEE ICDM Workshop on MADW & MADM

15

a MADWH [15]. The two can be considered a YinYang pair for equilibrium and harmony.

warehouse all agents for data cleaning/integration, selection/transformation, mining/discovery, pattern evaluation and visualization such that different matching sequences can be selected and optimized for different data mining tasks. Another typical example is to combine radio, audio, and motor control agents in brain modeling for autonomous learning/control. MADWH/MADM is not to provide the final solutions for such tough challenges but to provide enabling technologies that lead to evolving better solutions.

ƒ MADWH/MADM for neuroscience, bioinformatics, and biomedical research. Evidently, this is a forever challenging and forever promising area of research. ƒ MADWH/MADM for wireless sensor networks and semantic web. Many research topics remain untouched in these areas. ƒ MADWH/MADM for knowledge management. This opens a new avenue in management information system research in addition to P2P and B2B business models.

ƒ Schema design complexity. Multiple dimensional agent orientation is evidently a complex concept. Agent orientation can be considered an extension of object-orientation and object-oriented database design techniques can be borrowed. However, agent orientation has to fit into multiple dimensions.

ƒ Apply MADWH/MADM for different engineering, scientific, government, military, and business applications. These applications are evidently domain-specific with the common agent-oriented approach to brain modeling. Since multiagent brain modeling and research will never end, the application of MADWH/MADM does not seems to have a boundary.

ƒ Query language design complexity. To the author’s best knowledge, no such query languages have been developed yet. It is expected that agent-orientation can be mounted to SQL-based data mining query languages [4] like DMQL. ƒ Complexity in agent discovery and law discovery. In [10-15], agent discovery is illustrated with interpolation and extrapolation, and law discovery is illustrated with mining agentassociation in first-order logic for the specific application of neurofuzzy control. Similar discovery has not been fully researched in many other application areas.

ƒ MADWH/MADM and agent-oriented software engineering paradigm. MADWH/MADM adds new challenges for agent-oriented software engineering. Agent-oriented self-organization and reorganization for coordinated data mining and knowledge discovery is a major challenge. Integration and optimization with different agents and subtasks for data cleaning, integration, selection, transformation, mining/discovery, pattern evaluation and visualization is a typical example.

ƒ Complexity in agent-oriented self-organization and reorganization. Although self-organization has been a hot topic in neural network research, self-organization and reorganization has not been fully addressed at the autonomous and semiautonomous agent levels. It is shown in [12] that self-organization and reorganization is possible in multiagent brain modeling. It is a challenging and interesting task to address the self-organization and reorganization issues at different levels of agent granularities, for instances, at macro-, micro-, neuron, genetic, and/or nanolevels.

9. Conclusions Some basic concepts have been introduced for MADWH and MADM. A comparison has been provided between a traditional data warehouse and a MADWH, and between MADM and MRDM. The roles of CCI and DAI in MADWH and MADM have been discussed. A commonsense intermediate agent law has been posted. A number of challenges have been identified. Despite the great challenges, it can be concluded that (1) MADWH/MADM provides a joint platform for many different research areas including CCI and DAI;

ƒ Reinforced knowledge discovery with the interplay between MADWH/MADM. Although this seems to be a big challenge, it could be the most enjoyable step once the other difficulties have been resolved. MADWH provides the centralization of agent-oriented data, knowledge, and brain storming algorithms; MADM provides distributed mechanisms and methods for reinforced knowledge discovery. A MADWH can enhance MADM and MADM can further develop and refine

2005 IEEE ICDM Workshop on MADW & MADM

(2) It enables agent discovery, agent law discovery, self-organization, and reorganization. (3) It enables full autonomy as the result of coordination of semiautonomous functionalities. (4) It enables the modeling of evolving processes like growing and aging.

16

[10] W. Zhang, "MAC-J: A Self-Organizing Multiagent Cerebellar Model for Fuzzy-Neural Control of Uniped Robot Locomotion." Int'l J. on Intelligent Control and Sys., Vol. 1, No. 3, 1996, p339-354.

(5) The short term potential of MADW/MADM lies in its commercial values in multidimensional agentoriented OLAP and OLAM; and (6) Its long term impact is far-reaching because its potential in supporting scientific discoveries as well as in business decision support especially in discoveries about bio-agents and bio-inspired agents and laws at the macro-, micro-, and/or nano- levels is forever promising and challenging.

[11] W. Zhang, “Neurofuzzy Agents and Neurofuzzy Laws for Autonomous Machine Learning and Control,” IEEE Int’l Conf. on Neural Networks, Houston, TX, June, 1997, 1732-1737. [12] W. Zhang, “Nesting, Safety, Layering, and Autonomy: A Reorganizable Multiagent Cerebellar Architecture for Intelligent Control – with Application In Legged Locomotion and Gymnastics.” IEEE Trans. on Systems, Man, and Cybernetics, Part B, Vol. 28, No. 3, June 1998, p357-375.

Reference [1] R. Agrawal, H. Hannila, R. Srikant, H. Toivonen, and A. I. Verkamo. “Fast discovery of association rules.” In U. M. Fayyad, G. Piatesky-Shapiro, P. Smith, and R. Uthurusamy, Editors. Advances in Knowledge Discovery and Data Mining. Cambridge, MA: AAAI/MIT Press, 1996.

[13] W. Zhang, “Modeling A Cerebrum/Cerebellum System as An Evolving Multiagent Data Warehouse.” Proc. Of 6th Joint Conf. On Information Sciences (JCIS) – CIN, March 8-13, 2002, Duke University, NC. pp541-544.

[2] J. C. Bezdek, "On the relationship between neural networks, pattern recognition, and intelligence," Int’l J. of Approximate Reasoning, Vol. 6, 1992, 85107. [3] A. H. Bond and L. Gasser, "An Analysis of Problems and Research in DAI," Readings in Distributed Artificial Intelligence. eds. A. H. Bond and L. Gasser, Morgan Kaufmann, 1998, 3-35.

[14] W. Zhang, “A Multiagent Data Warehousing and Multiagent Data Mining Approach to Cerebrum/Cerebellum Modeling.” Proc. of SPIE Int,l Conf. on Data Mining and Knowledge Discovery. April, 2002, Orlando, FL. pp261-271.

[4] J. Han and M. Kamber. Data Mining, Concepts and Techniques, Morgan Kaufmann, 2001. [5] M. N. Huhns, Editor, Distributed Intelligence. Pitman, London, 1987.

[15] W. Zhang and L. Zhang “A Multiagent Data Warehousing (MADWH) and Multiagent Data Mining (MADM) Approach to Brain Modeling and NeuroFuzzy Control.” Information Sciences, 167 (2004) 109-127.

Artificial

[6] M. N. Huhns, and M. P. Singh, Readings in Agents. Morgan Kaufmann Pub., 1997.

[16] W. Zhang and M. Cheng, "Virtual Agents and Virtual Communities: An Agent-Oriented Software and Knowledge Engineering Paradigm for Distributed Cooperative Systems." Proc. of the 5th Int’l Conf. on Software and Knowledge Engineering, San Francisco, June, 1993, pp207-214.

[7] MRDM'01: Workshop on multi-relational data mining. In conjunction with PKDD'01 and ECML'01, 2002. http://www.kiminkii.com/mrdm/. [8] H. Toivonen, L. Dehaspe. “Discovery of Frequent Datalog Patterns.” Data Mining and Knowledge Discovery. 3:(1)7-16, 1999.

Acknowledgement: This work has been partially supported by a grant for Faculty Development from Georgia Southern University, Statesboro, GA.

[9] H. Ying, Fuzzy Control and Modeling: Analytical Foundations and Applications, IEEE Press, 2000.

2005 IEEE ICDM Workshop on MADW & MADM

17

Multi-Party Sequential Pattern Mining Over Private Data Justin Zhan1 , LiWu Chang2 , and Stan Matwin3 1,3 School of Information Technology & Engineering, University of Ottawa, Canada 2 Center for High Assurance Computer Systems, Naval Research Laboratory, USA {[email protected], [email protected], [email protected]

Abstract

becoming more and more popular with time. One of important computations is sequential pattern mining [1, 8, 2, 7, 3], which is concerned of inducing rules from a set of sequences of ordered items. The main computation in sequential pattern mining is to calculate the support measures of sequences by iteratively joining those subsequences whose supports exceed a given threshold. In each of above works, an algorithm is provided to conduct such a computation assume that the original data are available. However, conducting such a mining without knowing the original data is challenging. Generic solutions for any kind of secure collaborative computing exist in the literature [4, 5, 6]. These solutions are the results of the studies of the Secure Multi-party computation problem [9, 5, 6, 4], which is a more general form of secure collaborative computing. However, the proposed generic solutions are usually impractical. They are not scalable and cannot handle large-scale data sets because of the prohibitive extra cost in protecting data secrecy. Therefore, practical solutions need to be developed. This need underlies the rationale for our research.

Privacy-preserving data mining in distributed environments is an important issue in the field of data mining. In this paper, we study how to conduct sequential patterns mining, which is one of the data mining computations, on private data in the following scenario: Multiple parties, each having a private data set, want to jointly conduct sequential pattern mining. Since no party wants to disclose its private data to other parties, a secure method needs to be provided to make such a computation feasible. We develop a practical solution to the above problem in this paper. Keywords: Privacy, security, sequential pattern mining.

1

Introduction

Data mining and knowledge discovery in databases is an important research area that investigates the automatic extraction of previously unknown patterns from large amounts of data. They connect the three worlds of databases, artificial intelligence and statistics. The information age has enabled many organizations to gather large volumes of data. However, the usefulness of this data is negligible if meaningful information or knowledge cannot be extracted from it. Data mining and knowledge discovery, attempts to answer this need. In contrast to standard statistical methods, data mining techniques search for interesting information without demanding a priori hypotheses. As a field, it has introduced new concepts and are

2005 IEEE ICDM Workshop on MADW & MADM

2

Mining Sequential Patterns On Private Data

2.1 Background Data mining includes a number of different tasks. This paper studies the sequential pattern mining problem. Since its introduction in 1995 [1], the sequential pattern mining has received a

18

by party 1, party 2, · · ·, and party n respectively, where each dataset consists of the customer-ID, transaction-time, and the items purchased in each transaction.

great deal of attention. It is still one of the most popular pattern-discovery methods in the field of Knowledge Discovery. Sequential pattern mining provides a means for discovering meaningful sequential patterns among a large quantity of data. For example, let us consider the sales database of a bookstore. The discovered sequential pattern could be like “70% of people who bought Harry Porter also bought Lord of Ring at a later time”. The bookstore can use this information for shelf placement, promotions, etc. In the sequential pattern mining, we are given a database D of customer transactions. Each transaction consists of the following fields: customerID, transaction-time, and the items purchased in the transaction. No customer has more than one transaction with the same transaction-time. We do not consider quantities of items bought in a transaction: each item is a binary variable representing whether an item was bought or not. An itemset is a non-empty set of items. A sequence is an ordered list of itemsets. A customer support is a sequence s if s is contained in the customersequence for this customer. The support for a sequence is defined as the fraction of total customers who support this sequence. Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum support. Each such maximal sequence represents a sequential pattern.

2. D1 , D2 , · · ·, and Dn contain different types of items (e.g., they come from different types of markets). 3. The identity of the transactions in D1 , D2 , · · ·, and Dn are the same. 4. The customer-ID and customer’s transaction time can be shared among the parities, but the items that a customer actually bought are confidential. Mining Sequential Patterns On Private Data problem: Party 1 has a private data set D1 , party 2 has a private data set D2 , · · ·, and party n has a private data set Dn , data set [D1 ∪ D2 ∪ · · · ∪ Dn ] is the union of D1 , D2 , · · ·, and Dn (by vertically putting D1 , D2 , · · ·, and Dn together.)1 Let N be a set of transactions with Nk representing the kth transaction. These n parties want to conduct the sequential pattern mining on [D1 ∪ D2 ∪ D3 · · · ∪ Dn ] and to find the sequential patterns with support greater than the given threshold, but they do not want to share their private data sets with each other. We say that a sequential pattern of xi ≤ yj , where xi occurs before or at the same time as yj , has support s in [D1 ∪ D2 ∪ · · · ∪ Dn ] if s% of the transactions in [D1 ∪ D2 · · · ∪ Dn ] contain both xi and yj with xi happening before or at the same time as yj (namely, s% = P r(xi ≤ yj )).

2.2 Problem Definition We consider the scenario where multiple parties, each having a private data set (denoted by D1 , D2 , · · ·, and Dn respectively), want to collaboratively conduct sequential pattern mining on the union of their data sets. Because they are concerned about data privacy, neither party is willing to disclose its raw data set to others. Without loss of generality, we make the following assumptions on the data sets (the assumptions can be achieved by pre-processing the data sets D1 , D2 , · · ·, and Dn , and such pre-processing does not require one party to send its private data set to other parties):

2.3 Sequential Pattern Mining Procedure The procedure of mining sequential patterns contains the following steps: Step I: Sorting The database [D1 ∪ D2 · · · ∪ Dn ] is sorted, with customer ID as the major key and transaction 1

Vertically partitioned datasets are also called heterogeneous partitioned datasets where different datasets contain different types of items with customer IDs are identical for each transaction.

1. D1 , D2 , · · ·, and Dn are datasets owned

2005 IEEE ICDM Workshop on MADW & MADM

19

1. L1 = large 1-sequence 2. for (k = 2; Lk−1 6= φ; k++) do{ 3. Ck = apriori-generate(Lk−1 ) 4. for all candidates c ∈ Ck do { Compute c.count 5. (Section 2.4 will show how to compute this count on private data) 6. Lk = Lk ∪ c | c.count ≥ minsup 7. end 8. end 9. Return Uk Lk

time as the minor key. This step implicitly converts the original transaction database into a database of customer sequences. As a result, transactions of a customer may appear in more than one row which contains information of a customer ID, a particular transaction time and items bought at this transaction time. For example, suppose that datasets after being sorted by their customer-ID numbers are shown in Fig. 1. Then after being sorted by the transaction time, data tables of Fig. 1 will become those of Fig. 2.

Step II: Mapping

where Lk stands for a sequence with k itemsets and Ck stands for the collection of candidate ksequences. The procedure apriori-generate is described as follows: First: join Lk−1 with Lk−1 :

Each item of a row is considered as an attribute. We map each item of a row (i.e., an attribute) to an integer in an increasing order and repeat for all rows. Re-occurrence of an item will be mapped to the same integer. As a result, each item becomes an attribute and all attributes are binaryvalued. For instance, the sequence < B, (A, C) >, indicating that the transaction B occurs prior to the transaction (A,C) with A and C occurring together, will be mapped to integers in the order B → 1, A → 2, C → 3, (A, C) → 4. During the mapping, the corresponding transaction time will be kept. For instance, based on the sorted dataset of Fig. 2, we may construct the mapping table as shown in Fig. 3. After the mapping, the mapped datasets are shown in Fig. 4.

1. insert into Ck 2. select p.litemset1 , · · ·, p.litemsetk−1 , q.litemsetk−1 , where p.litemset1 = q.litemset1 , · · ·, p.litemsetk−2 = q.litemsetk−2 3. from Lk−1 p, Lk−1 q. Second: delete all sequences c ∈ Ck such that some (k-1)-subsequence of c is not in Lk−1 . Step IV: Maximization Having found the set of all large sequences S, we provide the following procedure to find the maximal sequences.

Step III: Mining Our mining procedure will be based on mapped dataset. The general sequential pattern mining procedure contains multiple passes over the data. In each pass, we start with a seed set of large sequences, where a large sequence refers to a sequence whose itemsets all satisfy the minimum support. We utilize the seed set for generating new potentially large sequences, called candidate sequences. We find the support for these candidate sequences during the pass over the data. At the end of each pass, we determine which of the candidate sequences are actually large. These large candidates become the seed for the next pass.

1. for (k = m; k ≤ 1; k- -) do 2. for each k-sequence sk do 3. Delete all subsequences of sk from S Step V: Converting The items in the final large sequences are converted back to the original item representations used before the mapping step. For example, if 1A belongs to some large sequential pattern, then 1A will be converted to item 30, according to the mapping table, in the final large sequential patterns.

The following is the procedure for mining sequential patterns on [D1 ∪ D2 · · · ∪ Dn ].

2005 IEEE ICDM Workshop on MADW & MADM

20

Alice C-ID

Items Bought

C-ID

06/25/03

30

1

06/30/03

90

1

06/28/03

110

2

06/10/03

10, 20

2

06/15/03

40, 60

2

06/13/03

107

2

06/20/03

9, 15

3

06/18/03

35, 50

06/19/03

103

3

06/25/03

30

06/10/03

45, 70

06/26/03

105, 106

06/21/03

101, 102

06/30/03

Items Bought

C-ID

3

T-time

Carol

1

3

T-time

Bob

3 3

5, 10

3

T-time

Items Bought

Figure 1. Raw Data Sorted By Customer ID tries in the vector where their values are 0’s. They then transform their values in time-vector into real numbers so that if transaction tr1 happens earlier than the transaction tr2 , then the real number to denote tr1 should smaller than the number that denotes tr2 . For instance, ”06/30/2003” and ”06/18/2003” can be transform to 363 and 361.8 respectively. The purpose of transformation is that we will securely compare them based their real number denotation. Next, we will present a secure protocol that allows n parties to compare their transaction time. The goal of our privacy-preserving classification system is to disclose no private data in every step. We firstly select a key generator who produces the encryption and decryption key pairs. The computation of the whole system is under encryption. For the purpose of illustration, let’s assume that Pn is the key generator who generates a homomorphic encryption key pair (e, d). Next, we will show how to conduct each step.

2.4 How to compute c.count To compute c.count, in other words, to compute the support for some candidate pattern (e.g., P (xi ∩ yi ∩ zi |xi ≥ yi ≥ zi )), we need to conduct two steps: one is to deal with the condition part where zi occurs before yi and both of them occur before xi ; the other is to compute the actual counts for this sequential pattern. If all the candidates belong to one party, then c.count, which refers to the frequency counts for candidates, can be computed by this party since this party has all the information needed to compute it. However, if the candidates belong to different parties, it is a non-trivial task to conduct the joint frequency counts while protecting the security of data. We provide the following steps to conduct this cross-parties’ computation. 2.4.1

Vector Construction

The parties construct vectors for their own attributes (mapped-ID). In each vector constructed from the mapped dataset, there are two components: one consists of the binary values (called the value-vector); the other consists of the transaction time (called the transaction time-vector). Suppose we want to compute the c.count for 2A ≥ 2B ≥ 6C in Fig. 4. We construct three vectors: 2A, 2B and 6C depicted in Fig. 5. 2.4.2

2.4.3

of

Transaction

Without loss of generality, assuming there are k transaction time: e(g1 ), e(g2 ), · · ·, and e(gk ), with each corresponding to a transaction of a particular party. Protocol 1. . 1. Pn−1 computes e(gi )×e(gj )−1 = e(gi −gj ) for all i, j ∈ [1, k], i > j, and sends the sequence denoted by ϕ to Pn in a random order.

Transaction time comparison

To compare the transaction time, each time-vector should have a value. We let all the parties randomly generate a set of transaction time for en-

2005 IEEE ICDM Workshop on MADW & MADM

The Comparison Time

2. Pn decrypts each element in the sequence ϕ.

21

Alice C-ID

T-tme

1

06/25/03

Bob Items Bought

Carol

T-time

Item Bought

C-ID

N/A

N/A

N/A

N/A

1

N/A

1

2

06/10/03

30

C-ID

10, 20

N/A

N/A

2

06/20/03

9, 15

Item Bought

06/28/03

110

06/13/03

107

N/A

90

N/A

N/A

2

06/30/03

T-time

N/A 2 06/15/03

40, 60

N/A

N/A

N/A

N/A

3

06/10/03

45, 70

N/A

N/A

3

06/18/03

35, 50

N/A

N/A

N/A

3

06/19/03

103

N/A

N/A

3

06/21/03

101, 102

N/A

N/A 06/26/03

105, 106

3

06/25/03

30

06/30/03

5, 10

N/A 3

N/A

3

N/A

N/A

N/A: The information is not available.

Figure 2. Raw Data Sorted By Customer ID and Transaction Time Alice

30 - 1A

10 - 2A

20 - 3A

(10, 20) - 4A

9 - 5A

15 - 6A

(9, 15) - 7A

5 - 8A

Bob

90 - 1B

40 - 2B

60 - 3B

(40, 60) - 4B

35 - 5B

50 - 6B

(35, 50) - 7B

45 - 8B

110 - 1C

107 - 2C

Carol

103 - 3C

101 - 4C

102 - 5C

(101,102)- 6C

105 - 7C

106 - 8C

(5, 10) - 9A 70 - 9B

(45, 70)- 10B

(105,106) - 9C

Note that, in Alice’s dataset, item 30 and 10 are reoccurred, so we map them to the same mapped-ID.

Figure 3. Mapping Table sort the transaction time.

He assigns the element +1 if the result of decryption is not less than 0, and −1, otherwise. Finally, he obtains a +1/−1 sequence denoted by ϕ0 .

Proof. Pn−1 is able to remove permutation effects from ϕ0 (the resultant sequence is denoted by ϕ00 ) since she has the permutation function that she used to permute ϕ, so that the elements in ϕ and ϕ00 have the same order. It means that if the qth position in sequence ϕ denotes e(gi − gj ), then the qth position in sequence ϕ00 denotes the evaluation results of gi − gj . We encode it as +1 if gi ≥ gj , and as -1 otherwise. Pn−1 has two sequences: one is the ϕ, the sequence of e(gi − gj ), for i, j ∈ [1, k](i > j), and the other is ϕ00 , the sequence of +1/−1. The two sequences have the same number of elements. Pn−1 knows whether or not gi is larger than gj by checking the corresponding value in the

3. Pn sends +1/ − 1 sequence ϕ0 . 4. Pn−1 compares the transaction time of each entry of vectors such as 2A, 2B, and 6C in our example. She makes a temporary vector T. If the transaction time does not satisfy the requirement of 2A ≥ 2B ≥ 6C, she sets the corresponding entries of T to 0’s; otherwise, she copies the original values in 6C to T (Fig. 5). Theorem 1. (Correctness). Protocol 1 correctly

2005 IEEE ICDM Workshop on MADW & MADM

22

Alice Mapped C-ID ID

1A

2A

3A

4A

1

1

06/25/03

0

N/A

0

N/A

0

2

0

N/A

1

06/10/03

1

06/10/03

1

3

1

06/25/03

1

06/30/03

0

N/A

0

N/A

5A

6A

0

N/A

0

06/10/03

1

06/20/03

1

N/A

0

N/A

0

N/A

7A 0

06/20/03 1 N/A

N/A 06/20/03

0

N/A

8A

9A

0

N/A

0

N/A

0

N/A

0

N/A

1

06/30/03

1

06/30/03

Bob Mapped C-ID ID

1B

2B

3B

4B

1

1

06/30/03

0

N/A

0

N/A

0

2

0

N/A

1

06/15/03

1

06/15/03

1

3

0

N/A

0

N/A

0

N/A

0

5B

N/A 06/15/03 N/A

0

6B

N/A

7B

0

N/A N/A

0

N/A

0

1

06/18/03

1

0 0

06/18/03 1

8B

9B

10B

N/A

0

N/A

0

N/A

0

N/A

N/A

0

N/A

0

N/A

0

N/A

06/10/03

1

06/10/03

1

06/10/03 1

06/10/03

Carol Mapped C-ID ID

1C

2C

3C

4C

5C

6C

7C

8C

9C

1

1

06/28/03

0

N/A

0

N/A

0

N/A

0

N/A

0

N/A

0

N/A

0

N/A

0

N/A

2

0

N/A

1

06/13/03

0

N/A

0

N/A

0

N/A

0

N/A

0

N/A

0

N/A

0

N/A

3

0

N/A

0

N/A

1

06/19/03

1

1

06/21/03

1

06/21/03

06/21/03 1

06/26/03

1 06/26/03 1

06/26/03

N/A : The information is not available.

Figure 4. Data After Being Mapped

g1 g2 g3 ··· gk

g1 +1 -1 +1 ··· +1

g2 +1 +1 +1 ··· +1

g3 -1 -1 +1 ··· -1

··· ··· ··· ··· ··· ···

gk -1 -1 +1 ··· +1

S1 S2 S3 S4

S1 +1 +1 +1 +1

S2 -1 +1 +1 -1

S3 -1 -1 +1 -1

S4 -1 +1 +1 +1

Weight -2 +2 +4 0

Table 2.

Table 1. ϕ00

ments with g1 < g4 < g2 < g3 ; (2) the sequence ϕ is [e(g1 − g2 ), e(g1 − g3 ), e(g1 − g4 ), e(g2 − g3 ), e(g2 − g4 ), e(g3 − g4 )]. The sequence ϕ00 will be [−1, −1, −1, −1, +1, +1]. According to ϕ and ϕ00 , Pn−1 builds the Table 2. From the table, Pn−1 knows g3 > g2 > g4 > g1 since g3 has the largest weight, g2 has the second largest weight, g4 has the third largest weight, g1 has the smallest weight.

ϕ00

sequence. For example, if the first element is −1, Pn−1 concludes gi < gj . Pn−1 examines the two sequences and constructs the index table (Table 1) to compute the largest element. In Table 1, +1 in entry ij indicates that the value of the row (e.g., gi of the ith row) is not less than the value of a column (e.g., gj of the jth column); -1, otherwise. Pn−1 sums the index values of each row and uses this number as the weight of the information gain in that row. She then sorts the sequence according the weight. To make it clearer, let’s illustrate it by an example. Assume that: (1) there are 4 ele-

2005 IEEE ICDM Workshop on MADW & MADM

Theorem 2. (Privacy-Preserving). Assuming the parties follow the protocol, the private data are securely protected. Proof. We need prove it from two aspects: (1)

23

2A

6C

2B

0

N/A

0

N/A

0

1

06/10/03

0

N/A

1

1

06/30/03

1

06/18/03

0

N/A 06/15/03

Step I

N/A

T 0

06/10/03

0

0

06/18/03

0

06/14/03

1

06/10/03

0

0

06/19/03

1

06/15/03

1

06/30/03

1

1

06/18/03

0

06/23/03

Step II

Step III Secure Number Product Protocol

c.count

Figure 5. An Protocol To Compute c.count PN

2.4.4

The Computation of c.count

Protocol 2. Privacy-Preserving Number Product Protocol

Theorem 3. (Efficiency). The computation of protocol 1 is efficient from both computation and communication point of view.

1. Pn sends e(xn1 ) to P1 . 2. P1 computes e(xn1 )x11 = e(xn1 x11 ), then sends it to P2 .

Proof. The total communication cost is upper bounded by αm2 . The total computation cost is upper bounded by m2 + m + 1. Therefore, the protocols are very fast.

3. P2 computes e(xn1 x11 )x21 = e(xn1 x11 x21 ). 4. Continue until Pn−1 obtains e(x11 x21 · · · xn1 ).

After the above step, they need to compute c.count based their value-vector. For example, to obtain c.count for 2A ≥ 2B ≥ 6C in Fig. 5, P they need to compute N i=1 2A[i] · 2B[i] · T [i] =

2005 IEEE ICDM Workshop on MADW & MADM

P

= 3i=1 2A[i]·2B[i]·T [i] = 0, where N is the total number of values in each vector. In general, let’s assume the value-vectors for P1 , · · ·, Pn are x1 , · · ·, xn respectively. Note that P1 ’s vector is T . For the purpose of illustration, we denote T by xn−1 . Next, we will show how n parties compute this count. without revealing their private data to each other. i=1 2A[i]·2B[i]·T [i]

Pn−1 doesn’t get transaction time (e.g.,gi ) for each vector. What Pn−1 gets are e(gi − gj ) for all i, j ∈ [1, k], i > j and +1/ − 1 sequence. By e(gi −gj ), Pn−1 cannot know each transaction time since it is encrypted. By +1/ − 1 sequence, Pn−1 can only know whether or not gi is greater than Pj . (2) Pn doesn’t obtain transaction time for each vector either. Since the sequence of e(gi − gj ) is randomized before being send to Pn who can only know the sequence of gi − gj , he can’t get each individual transaction time. Thus private data are not revealed.

5. Repeat all the above steps for x1i , x2i , · · ·, and xni until Pn−1 gets e(x1i x2i · · · xni ) for all i ∈ [1, N ].

24

time. In Section 2.4.4, we present protocols to compute c.count. We discussed the correctness of the computation in each section. As for the privacy protection, all the communications between the parties are encrypted, therefore, the parties who has no decryption key cannot gain anything out of the communication. On the other hand, there are some communication between the key generator and other parties. Although the communications are still encrypted, the key generator may gain some useful information. However, we guarantee that the key generator cannot gain the private data by adding random numbers in the original encrypted data so that even if the key generator get the intermediate results, there is little possibility that he can know the intermediate results. Therefore, the private data are securely protected with overwhelming probability. In summary, we provide a novel solution for sequential pattern mining over vertically partitioned private data. Instead of using data transformation, we define a protocol using homomorphic encryption to exchange the data while keeping it private. Our mining system is quite efficient that can be envisioned by the communication and computation complexity. The total communication complexity is upper bounded by α(nN + m2 − N ). The computation complexity is upper bounded by m2 + m + 5t + 4.

6. Pn−1 computes e(x11 x21 · · · xn1 ) × e(x12 x22 · · · xn2 ) × · · · × e(x1N x2N · · · xnN ) = e(x11 x21 · · · xn1 + x12 x22 · · · xn2 + · · · + x1N x2N · · · xnN ) = c.count. Theorem 4. (Correctness). Protocol 2 correctly compute c.count. Proof. In step 2, P1 obtains e(xn1 ). He then computes e(xn1 x11 ). In step 3, P2 computes e(xn1 x11 x21 ). Finally, in step 5, Pn−1 gets e(x1i x2i · · · xni ). He then computes e(x11 x21 · · · xn1 ) × e(x12 x22 · · · xn2 ) × · · · × e(x1N x2N · · · xnN ) = e(x11 x21 · · · xn1 + x12 x22 · · · xn2 + · · · + x1N x2N · · · xnN ) which is equal to c.count. Theorem 5. (Privacy-Preserving). Assuming the parties follow the protocol, the private data are securely protected. Proof. In protocol 2, all the data transmission are hidden under encryption. The parties who are not the key generator can’t see other parties’ private data. On the other hand, the key generator doesn’t obtain the encryption of other parties’s private data. Therefore, protocol 2 discloses no private data. Theorem 6. (Efficiency). The computation of c.count is efficient from both computation and communication point of view.

References [1] Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Philip S. Yu and Arbee S. P. Chen, editors, Eleventh International Conference on Data Engineering, pages 3–14, Taipei, Taiwan, 1995. IEEE Computer Society Press.

Proof. To prove the efficiency, we need conduct complexity analysis of the protocol. The bit-wise communication cost of protocol 2 is α(n − 1)N . The computation cost of protocol 2 is nN , of protocol 2 is 5t + 3. The total computation cost is upper bounded by nN − 1. Therefore, the protocols are sufficient fast.

3

[2] Jay Ayres, Johannes Gehrke, Tomi Yiu, and Jason Flannick. Sequential pattern mining using a bitmap representation. [3] G. Chirn. Pattern discovery in sequence databases: Algorithms and applications to DNA/protein classification. PhD thesis, Department of Computer and Information Science, New Jersey Institute of Technology, 1996.

Overall Discussion

Our privacy-preserving classification system contains several components. In Section 2.4.3, we show how to correctly compare the transaction

2005 IEEE ICDM Workshop on MADW & MADM

[4] O. party

25

Goldreich. computation

Secure (working

multidraft).

http://www.wisdom.weizmann.ac.il /home/oded/public html/foc.html, 1998. [5] O. Goldreich, S. Micali, and A. Wigderson. How to play any mental game. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing, pages 218–229, 1987. [6] S. Goldwasser. Multi-party computations: Past and present. In Proceedings of the 16th Annual ACM Symposium on Principles of Distributed Computing, Santa Barbara, CA USA, August 21-24 1997. [7] H. Kum, J. Pei, W. Wang, and D. Duncan. ApproxMAP: Approximate mining of consensus sequential patterns. Technical Report TR02-031, UNC-CH, 2002. [8] Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Peter M. G. Apers, Mokrane Bouzeghoub, and Georges Gardarin, editors, Proc. 5th Int. Conf. Extending Database Technology, EDBT, volume 1057, pages 3–17. SpringerVerlag, 25–29 1996. [9] A. C. Yao. Protocols for secure computations. In Proceedings of the 23rd Annual IEEE Symposium on Foundations of Computer Science, 1982.

2005 IEEE ICDM Workshop on MADW & MADM

26

Privacy-Preserving Decision Tree Classification Over Vertically Partitioned Data Justin Zhan1 , Stan Matwin2 , and LiWu Chang3 School of Information Technology & Engineering, University of Ottawa, Canada 3 Center for High Assurance Computer Systems, Naval Research Laboratory, USA {zhizhan, stan}@site.uottawa.ca, [email protected]

1,2

Abstract

In this paper, we study a prevalent collaboration scenario, the collaboration involving a data mining task: multiple parties, each having a private data set, want to conduct data mining on the joint data set that is the union of all individual data sets; however, because of the privacy constraints, no party wants to disclose its private data set to each other. The objective of this paper is to develop efficient methods that enable this type of computation while minimizing the amount of the private information that each party has to disclose. Data mining includes various algorithms such as classification, association rule mining, and clustering. In this paper, we focus on classification. There are two types of classification between two collaborative parties: Figure 1(a) shows the data classification on the horizontally partitioned data, and Figure 1(b) shows the data classification on the vertically partitioned data. To use the existing data classification algorithms, on the horizontally partitioned data, all the parties need to exchange some information, but they do not necessarily need to exchange each single record of their data sets. However, for the vertically partitioned data, the situation is different. A direct use of the existing data classification algorithms requires one party to collect all other parties’ data to conduct the computation. In situations where the data records contain private information, such a practice will be infeasible. We study the classification on the vertically partitioned data, in particular, we study how to build a decision tree classifier on private data. In this

Protection of privacy is one of important problems in data mining. The unwillingness to share their data frequently results in failure of collaborative data mining. This paper studies how to build a decision tree classifier under the following scenario: a database is vertically partitioned into multiple pieces, with each piece owned by a particular party. All the parties want to build a decision tree classifier based on such a database, but due to the privacy constraints, neither of them wants to disclose their private pieces. We build a privacypreserving system, including a set of secure protocols, that allows the parties to construct such a classifier. We guarantee that the private data are securely protected. Keywords: Privacy, decision tree, classification.

1

Introduction

Business success often relies on collaboration. The collaboration is even more critical in the modern business world, not only because of mutual benefit it brings but also because the coalition of multiple parters will be more competitive than each individual. Assuming they trust each other to a degree that they can share their private data, the collaboration becomes straightforward. However, in many scenarios, sharing data are impossible because of privacy concerns. Thus, collaboration without sharing private data becomes extremely important.

2005 IEEE ICDM Workshop on MADW & MADM

27

Private Data Private Data

(a) Horizontally Partitioned Data

Private Data

Data Mining

Private Data

Data Mining

(b) Vertically Partitioned Data

Figure 1. is willing to disclose its raw data set to others.

problem, each single record is divided into multiple pieces, with each party knowing one piece. We have developed a privacy-preserving system that allows them to build a decision tree classifier based on their joint data.

2

Privacy-Preserving Classification

Next, we give the notations that we will follow.

2.1 Notations • e: public key.

Decision-Tree

• d: private key. • Pi : the ith party.

Classification is an important problem in the field of data mining. In classification, we are given a set of example records, called the training data set, with each record consisting of several attributes. One of the categorical attributes, called the class label, indicates the class to which each record belongs. The objective of classification is to use the training data set to build a model of the class label such that it can be used to classify new data whose class labels are unknown. Many types of models have been built for classification, such as neural networks, statistical models, genetic models, and decision tree models. The decision tree models are found to be the most useful in the domain of data mining since they obtain reasonable accuracy and they are relatively inexpensive to compute. We define our problem as follows:

• n: the total number of parties. Assuming n > 2. • m: the total number of class. • xij : the jth element in Pi ’s private attribute. • α is the number of bits for each transmitted element in the privacy-preserving protocols. • N: the total number of records.

2.2 Decision Tree Classification Algorithm Classification is one of the forms for data analysis that can be used to extract models describing important data classes or to predict future data. It has been studied extensively by the community in machine learning, expert system, and statistics as a possible solution to knowledge discovery problems. The decision tree is one of the classification methods. A decision tree is a class discriminator that recursively partitions the training set until each partition entirely or dominantly consists of examples from one class. A well known algorithm

Problem 1. We consider the scenario where n parties, each having a private data set (denoted by S1 , S2 , · · ·, and Sn respectively), want to collaboratively conduct decision tree classification on the union of their data sets. The data sets are assumed to be vertically partitioned. Because they are concerned about the data privacy, neither party

2005 IEEE ICDM Workshop on MADW & MADM

28

is the number of elements in S. To find the best split for a tree node, we compute information gain for each attribute. We then use the attribute with the largest information gain to split the node.

for building decision tree classifiers is ID3 [13]. We describe the algorithm below where S represents the training samples and AL represents the attribute list:

2.3 Cryptography Tools ID3(S, AL) In this paper, we use the concept of homomorphic encryption which was originally proposed in [18]. Since then, many such systems have been proposed [3, 15, 16, 17]. We observe that some homomorphic encryption schemes, such as [4], are not robust against chosen cleartext attacks. However, we base our secure protocols on [17], which is semantically secure [9]. In our secure protocols, we use additive homomorphism offered by [17]. In particular, we utilize the following characterizer of the homomorphic encryption functions: e(a1 ) × e(a2 ) = e(a1 + a2 ) where e is an encryption function; a1 and a2 are the data to be encrypted. Because of the property of associativity, e(a1 + a2 + .. + an ) can be computed as e(a1 )×e(a2 )×· · ·×e(an ) where e(ai ) 6= 0. That is

1. Create a node V. 2. If S consists of samples with all the same class C then return V as a leaf node labelled with class C. 3. If AL is empty, then return V as a leaf-node with the majority class in S. 4. Select test attribute (T A) among the AL with the highest information gain. 5. Label node V with T A. 6. For each known value ai of T A (a) Grow a branch from node V for the condition T A = ai . (b) Let si be the set of samples in S for which T A = ai . (c) If si is empty then attach a leaf labelled with the majority class in S. (d) Else attach the node returned by ID3(si , AL − T A). According to ID3 algorithm, each non-leaf node of the tree contains a splitting point, and the main task for building a decision tree is to identify an attribute for the splitting point based on the information gain. Information gain can be computed using entropy. In the following, we assume there are m classes in the whole training data set. Entropy(S) is defined as follows:

d(e(α1 )α2 ) = d(e(α1 α2 ))

(4)

The privacy-preserving classification system contains several secure protocols that multiple parties need follow. There are five critical steps:

X Qj log Qj ,

(3)

2.4 Privacy-Preserving Decision Tree Classification System

m

Entropy(S) = −

d(e(a1 + a2 + · · · + an )) = d(e(a1 ) × e(a2 ) × · · · × e(an ))

(1)

j=1

• To compute Entropy(Sv ).

where Qj is the relative frequency of class j in S. Based on the entropy, we can compute the information gain for any candidate attribute A if it is used to partition S: Gain(S, A) = Entropy(S) −

X |Sv | (

v∈A

|S|

Entropy(Sv )),

(2)

|Sv | |S| .

• To compute

|Sv | |S| Entropy(Sv ).

• To compute information gain for each candidate attribute.

where v represents any possible values of attribute A; Sv is the subset of S for which attribute A has value v; |Sv | is the number of elements in Sv ; |S|

2005 IEEE ICDM Workshop on MADW & MADM

• To compute

• To compute the attribute with the largest information gain.

29

6. Pn−1 sends e(Qj ) to P1 .

The goal of our privacy-preserving classification system is to disclose no private data in every step. We firstly select a key generator who produces the encryption and decryption key pairs. The computation of the whole system is under encryption. For the purpose of illustration, let’s assume that Pn is the key generator who generates a homomorphic encryption key pair (e, d). Next, we will show how to conduct each step. 2.4.1

7. P1 computes e(Qj )−R = e(−RQj ) and sends it to Pn−1 . 8. Pn−1 computes e(Qj log(Qj ) + RQj ) × e(−RQj ) = e(Qj log(Qj )). Protocol 3. To compute e(Entropy(Sv )) 1. Repeat protocol 1-2 to compute e(Qj log(Qj )) for all j’s.

Computation of e(Entropy(Sv ))

2. Pn−1 computes e(Entropy(Sv )) Q P j e(Qj log(Qj )) = e( j Qj log(Qj )).

Protocol 1. To compute e(Qj ) 1. Pn sends e(xn1 ) to P1 .

=

2. P1 computes e(xn1 )x11 = e(xn1 x11 ), then sends it to P2 .

Theorem 1. (Correctness). rectly compute Entropy.

3. P2 computes e(xn1 x11 )x21 = e(xn1 x11 x21 ).

Proof. In protocol 1, Pn−1 obtains e(Qj ). In protocol 2, Pn−1 gets e(Qj log(Qj )). These two protocols are repeatedly used until Pn−1 obtains e(Qj log(Qj )) for all j’s. In protocol 3, Pn−1 computes the entropy by all the terms previously obtained. Notice that although we use Entropy(Sv ) to illustrate, Entropy(S) can be computed following the above protocols with different input attributes.

4. Continue until Pn−1 obtains e(x11 x21 · · · xn1 ). 5. Repeat all the above steps for x1i , x2i , · · ·, and xni until Pn−1 gets e(x1i x2i · · · xni ) for all i ∈ [1, N ]. 6. Pn−1 computes e(x11 x21 · · · xn1 ) × e(x12 x22 · · · xn2 ) × · · · × e(x1N x2N · · · xnN ) = e(x11 x21 · · · xn1 + x12 x22 · · · xn2 + · · · + x1N x2N · · · xnN ).

Theorem 2. (Privacy-Preserving). Assuming the parties follow the protocol, the private data are securely protected.

7. Pn−1 computes e(x11 x21 · · · xn1 + 1 N = x12 x22 · · · xn2 + · · · + x1N x2N · · · xnN ) e(Qj ).

Proof. In protocol 1, all the data transmission are hidden under encryption. The parties who are not the key generator can’t see other parties’ private data. On the other hand, the key generator doesn’t obtain the encryption of other parties’s private data. Therefore, protocol 1 discloses no private data. In protocol 2, although Pn−1 sends e(Qj ) to Pn , Qj is hidden by a set of random numbers known only by Pn−1 . Thus private data are not revealed. In protocol 3, the computations are still under encryption, no private data are disclosed either.

Protocol 2. To compute e(Qj log(Qj )) 1. Pn−1 generates a set of random numbers r1 , r2 , · · ·, and rt . 2. Pn−1 sends the sequence of e(Qj ), e(r1 ), e(r2 ), · · ·, e(rt ) to Pn in a random order. 3. Pn decrypts each element in the sequence, and sends log(Qj ), log(r1 ), log(r2 ), · · ·, log(rt ) to P1 in the same order as Pn−1 did. 4. P1 adds a random number R to each of the elements, then sends them to Pn−1 .

Theorem 3. (Efficiency). The computation of Entropy is efficient from both computation and communication point of view.

5. Pn−1 obtains log(Qj ) + R and computes e(Qj )(log(Qj )+R) = e(Qj log(Qj ) + RQj ).

2005 IEEE ICDM Workshop on MADW & MADM

Protocol 1-3 cor-

30

v| Proof. In protocol 4, Pn−1 obtains e( |S |S| ). In pro-

Proof. To prove the efficiency, we need conduct complexity analysis of the protocol. The bit-wise communication cost of protocol 1 is α(n − 1)N , of protocol 2 is α(3t + 5). The total communication cost has the upper bound of αm(nN + 3t − N + 5). The computation cost of protocol 1 is nN , of protocol 2 is 5t + 3. The total computation cost is upper bounded by mnN + 5mt + 4m. Therefore, the protocols are sufficient fast.

2.4.2

The Computation of

Protocol 4. To Compute

v| tocol 5, Pn−1 gets |S |S| Entropy(Sv ). The computation uses the both properties of homomorphic encryption.

Theorem 5. (Privacy-Preserving). Assuming the parties follow the protocol, the private data are securely protected. Proof. In protocol 4, all the data communication are hidden under encryption. The key generator doesn’t receive any data. The parties who are not the key generator can’t see other parties’ private data. Therefore, protocol 4 discloses no private v| 0 data. In protocol 5, although P1 sends e( |S |S| + R )

|Sv | |S| Entropy(Sv )

|Sv | |S|

1. Pn−1 sends e(|Sv |) to the party (e.g., Pi ) who holds the parent node.

v| to Pn , |S |S| is hidden by a random number known only by P1 . Thus private data are not revealed.

1

v| 2. Pi computes e(|Sv |) |S| = e( |S |S| ), then sends it to Pn−1 .

Theorem 6. (Efficiency). The computation of protocol 4 and protocol 5 is efficient from both computation and communication point of view.

v| Up to now, Pn−1 has obtained e( |S |S| ) and e(Entropy(Sv )). Next, we discuss how to comv| pute |S |S| Entropy(Sv ).

Protocol 5. To Compute

Proof. To prove the efficiency, we need conduct complexity analysis of the protocol. The total communication cost is 7α. The total computation cost is 8. Therefore, the protocols are very efficient.

|Sv | |S| Entropy(Sv )

v| 1. Pn−1 sends e( |S |S| ) to P1 .

|Sv | v| 0 0 2. P1 computes e( |S |S| ) × e(R ) = e( |S| + R ) where R0 is a random number only known by v| 0 P1 , then sends e( |S |S| + R ) to Pn .

3. Pn decrypts it and sends

|Sv | |S|

2.4.3

Following the above protocols, we can compute v| e(Entropy(S)), |S |S| Entropy(Sv ). What left is to compute information gain for each attribute and select the attribute with the largest information gain.

+ R0 to Pn−1 . (

|Sv |

+R0 )

4. Pn−1 computes e(Entropy(Sv )) |S| v| 0 e( |S |S| Entropy(Sv ) + R Entropy(Sv )).

=

Protocol 6. To Compute Information Gain for An Attribute

5. Pn−1 sends e(Entropy(Sv )) to P1 . 0

= 6. P1 computes e(Entropy(Sv ))−R 0 e(−R Entropy(Sv )), and sends it to Pn−1 . v| 7. Pn−1 computes e( |S |S| Entropy(Sv ) R0 Entropy(Sv )) × e(−R0 Entropy(Sv )) v| e( |S |S| Entropy(Sv )).

Q

1. Pn−1 computes P

|Sv | v∈A e( |S| Entropy(Sv ))

=

|Sv | v∈A |S| Entropy(Sv ).

+ =

P

2. He computes e( e(−

P

|Sv | −1 v∈A |S| Entropy(Sv ))

=

|Sv | v∈A |S| Entropy(Sv )).

3. He computes e(Gain(S, A)) = P |Sv | e(Entropy(S))×e(− v∈A |S| Entropy(Sv )).

Theorem 4. (Correctness). Protocol 4-5 corv| rectly computes |S |S| Entropy(Sv ).

2005 IEEE ICDM Workshop on MADW & MADM

The Computation of the Attribute With the Largest Information Gain

31

Once we compute the information gain for each candidate attribute, we then compute the attribute with the largest information gain. Without loss of generality, assuming there are k information gains: e(g1 ), e(g2 ), · · ·, and e(gk ), with each corresponding to a particular attribute.

g1 g2 g3 ··· gk

Protocol 7. To Compute the Largest Information Gain

g2 +1 +1 +1 ··· +1

g3 -1 -1 +1 ··· -1

··· ··· ··· ··· ··· ···

gk -1 -1 +1 ··· +1

Table 1.

1. Pn−1 computes e(gi )×e(gj )−1 = e(gi −gj ) for all i, j ∈ [1, k], i > j, and sends the sequence denoted by ϕ to Pn in a random order.

S1 S2 S3 S4

2. Pn decrypts each element in the sequence ϕ. He assigns the element +1 if the result of decryption is not less than 0, and −1, otherwise. Finally, he obtains a +1/−1 sequence denoted by ϕ0 .

S1 +1 +1 +1 +1

S2 -1 +1 +1 -1

S3 -1 -1 +1 -1

S4 -1 +1 +1 +1

Weight -2 +2 +4 0

Table 2. In Table 1, +1 in entry ij indicates that the information gain of the row (e.g., gi of the ith row) is not less than the information gain of a column (e.g., gj of the jth column); -1, otherwise. Pn−1 sums the index values of each row and uses this number as the weight of the information gain in that row. She then selects the one that corresponds to the largest weight. To make it clearer, let’s illustrate it by an example. Assume that: (1) there are 4 information gains with g1 < g4 < g2 < g3 ; (2) the sequence ϕ is [e(g1 − g2 ), e(g1 − g3 ), e(g1 − g4 ), e(g2 − g3 ), e(g2 − g4 ), e(g3 − g4 )]. The sequence ϕ00 will be [−1, −1, −1, −1, +1, +1]. According to ϕ and ϕ00 , Pn−1 builds the Table 2. From the table, Pn−1 knows g3 is the largest element since its weight, which is +4, is the largest.

3. Pn sends +1/ − 1 sequence ϕ0 to Pn−1 who computes the largest element. Theorem 7. (Correctness). Protocol 6-7 correctly computes the attribute with the largest information gain. Proof. In protocol 6, Pn−1 obtains e(Gain(S, A)). In protocol 7, Pn−1 gets the attribute with the largest information. We discuss the details as follows: Pn−1 is able to remove permutation effects from ϕ0 (the resultant sequence is denoted by ϕ00 ) since she has the permutation function that she used to permute ϕ, so that the elements in ϕ and ϕ00 have the same order. It means that if the qth position in sequence ϕ denotes e(gi − gj ), then the qth position in sequence ϕ00 denotes the evaluation results of gi − gj . We encode it as +1 if gi ≥ gj , and as -1 otherwise. Pn−1 has two sequences: one is the ϕ, the sequence of e(gi − gj ), for i, j ∈ [1, k](i > j), and the other is ϕ00 , the sequence of +1/−1. The two sequences have the same number of elements. Pn−1 knows whether or not gi is larger than gj by checking the corresponding value in the ϕ00 sequence. For example, if the first element ϕ00 is −1, Pn−1 concludes gi < gj . Pn−1 examines the two sequences and constructs the index table (Table 1) to compute the largest element.

2005 IEEE ICDM Workshop on MADW & MADM

g1 +1 -1 +1 ··· +1

Theorem 8. (Privacy-Preserving). Assuming the parties follow the protocol, the private data are securely protected. Proof. In protocol 6, there is no data transmission. In protocol 7, we need prove it from two aspects: (1) Pn−1 doesn’t get information gain (e.g.,gi ) for each attribute. What Pn−1 gets are e(gi − gj ) for all i, j ∈ [1, k], i > j and +1/ − 1 sequence. By e(gi −gj ), Pn−1 cannot know each information gain since it is encrypted. By +1/ − 1 sequence, Pn−1 can only know whether or not gi is greater than Pj .

32

4

(2) Pn doesn’t obtain information gain for each attribute either. Since the sequence of e(gi − gj ) is randomized before being send to Pn who can only know the sequence of gi − gj , he can’t get each individual information gain. Thus private data are not revealed.

Prior to conclude this paper. We describe the most related works. In early work on privacypreserving data mining, Lindell and Pinkas [14] propose a solution to privacy-preserving classification problem using oblivious transfer protocol, a powerful tool developed by secure multiparty computation (SMC) research [22, 10]. The techniques based on SMC for efficiently dealing with large data sets have been addressed in [21]. Randomization approaches were firstly proposed by Agrawal and Srikant in [2] to solve privacypreserving data mining problem. Researchers proposed more random perturbation-based techniques to tackle the problems (e.g., [5, 19, 7]). In addition to perturbation, aggregation of data values [20] provides another alternative to mask the actual data values. In [1], authors studied the problem of computing the kth-ranked element. Dwork and Nissim [6] showed how to learn certain types of boolean functions from statistical databases in terms of a measure of probability difference with respect to probabilistic implication, where data are perturbed with noise for the release of statistics. The problem we are studying is actually a special case of a more general problem, the Secure Multi-party Computation (SMC) problem. Briefly, a SMC problem deals with computing any function on any input, in a distributed network where each participant holds one of the inputs, while ensuring that no more information is revealed to a participant in the computation than can be inferred from that participant’s input and output [12]. The SMC problem literature is extensive, having been introduced by Yao [22] and expanded by Goldreich, Micali, and Wigderson [11] and others [8]. It has been proved that for any function, there is a secure multi-party computation solution [10]. The approach used is as follows: the function F to be computed is first represented as a combinatorial circuit, and then the parties run a short protocol for every gate in the circuit. Every participant gets corresponding shares of the input wires and the output wires for every gate. This approach, though appealing in its generality

Theorem 9. (Efficiency). The computation of protocol 6 and protocol 7 is efficient from both computation and communication point of view. Proof. The total communication cost is upper bounded by αm2 . The total computation cost is upper bounded by m2 + m + 1. Therefore, the protocols are very fast.

3

Overall Discussion

Our privacy-preserving classification system contains several components. In Section 2.4.1, we show how to correctly compute e(Entropy(Sv )). In Section 2.4.2, we present protocols to compute |Sv | |S| Entropy(Sv ). In Section 2.4.3, we show how to compute information gain for each candidate attribute; we then describe how to obtain the attribute with the largest information gain. We discussed the correctness of the computation in each section. Overall correctness is also guaranteed. As for the privacy protection, all the communications between the parties are encrypted, therefore, the parties who has no decryption key cannot gain anything out of the communication. On the other hand, there are some communication between the key generator and other parties. Although the communications are still encrypted, the key generator may gain some useful information. However, we guarantee that the key generator cannot gain the private data by adding random numbers in the original encrypted data so that even if the key generator get the intermediate results, there is little possibility that he can know the intermediate results. Therefore, the private data are securely protected with overwhelming probability.

2005 IEEE ICDM Workshop on MADW & MADM

Conclusion

33

and simplicity, means that the size of the protocol depends on the size of the circuit, which depends on the size of the input. This is highly inefficient for large inputs, as in data mining. It has been well accepted that for special cases of computations, special solutions should be developed for efficiency reasons. In this paper, we provide a novel solution for decision tree clasification over vertically partitioned private data. Instead of using data transformation, we define a protocol using homomorphic encryption to exchange the data while keeping it private. Our classification system is quite efficient that can be envisioned by the communication and computation complexity. The total communication complexity is upper bounded by α(mnN + 3mo−mN +m2 +12). The computation complexity is upper bounded by mnN +5mt+m2 +5m+9.

SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 211-222, San Diego, CA, June 9-12, 2003. [8] M. Franklin, Z. Galil, and M. Yung. An overview of secure distributed computing. Technical Report TR CUCS-00892, Department of Computer Science, Columbia University, 1992. [9] B. Goethals, S. Laur, H. Lipmaa, and T. Mielikainen. On secure scalar product computation for privacy-preserving data mining. In Proceedings of The 7th Annual International Conference in Information Security and Cryptology (ICISC 2004), volume 3506 of Lecture Notes in Computer Science, pages 104–120, Seoul, Korea, December 2–3, 2004, Springer-Verlag, 2004. [10] O. Goldreich. Secure multiparty computation (working draft). http://www.wisdom.weizmann.ac.il /home/oded/public html/foc.html, 1998.

References

[11] O. Goldreich, S. Micali, and A. Wigderson. How to play any mental game. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing, pages 218–229, 1987.

[1] G. Aggarwal, N. Mishra, and B. Pinkas. Secure computation of the k th-ranked element. In EUROCRYPT pp 40-55, 2004.

[12] S. Goldwasser. Multi-party computations: Past and present. In Proceedings of the 16th Annual ACM Symposium on Principles of Distributed Computing, Santa Barbara, CA USA, August 2124 1997.

[2] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 439– 450. ACM Press, May 2000.

[13] J. Han and M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, 2001.

[3] J. Benaloh. Dense probabilistic encryption. In Proceedings of the Workshop on Selected Areas of Cryptography, pp. 120-128, Kingston, Ontario, May, 1994.

[14] Y. Lindell and B. Pinkas. Privacy preserving data mining. In Advances in Cryptology - Crypto2000, Lecture Notes in Computer Science, Volume 1880, 2000.

[4] J. Domingo-Ferrer. A provably secure additive and multiplicative privacy homomorphism. In Information Security Conference, 471-483, 2002.

[15] D. Naccache and J. Stern. A new public key cryptosystem based on higher residues. In Proceedings of the 5th ACM conference on Computer and Communication Security, pp. 59-66, San Francisco, California, United States, 1998.

[5] W. Du and Z. Zhan. Using randomized response techniques for privacy-preserving data mining. In Proceedings of The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 2427 2003.

[16] T. Okamoto and S. Uchiyama. A new publickey cryptosystem as secure as factoring. In Eurocrypt’98, LNCS 1403, pp.308-318, 1998.

[6] C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned databases. In CRYPTO 2004 528–544.

[17] P. Paillier. Public key cryptosystems based on composite degree residuosity classes. In In Advances in Cryptology - Eurocrypt ’99 Proceedings, LNCS 1592, pages 223-238. Springer-Verlag, 1999.

[7] A. Evfmievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the Twenty-second ACM

2005 IEEE ICDM Workshop on MADW & MADM

34

[18] R. Rivest, L. Adleman, and M. Dertouzos. On data banks and privacy homomorphisms. In Foundations of Secure Computation, eds. R. A. DeMillo et al., Academic Press, pp. 169-179., 1978. [19] Shariq Rizvi and Jayant R. Haritsa. Maintaining data privacy in association rule mining. In Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002. [20] L. Sweeney. k-anonymity: a model for protecting privacy. In International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10 (5), pp 557–570, 2002. [21] J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 639- 644, Edmonton, Alberta, Canada, July 23-26, 2002. [22] A. C. Yao. Protocols for secure computations. In Proceedings of the 23rd Annual IEEE Symposium on Foundations of Computer Science, 1982.

2005 IEEE ICDM Workshop on MADW & MADM

35

Data Mining for Adaptive Web Cache Maintenance Sujaa Rani Mohan, E.K. Park, Yijie Han University of Missouri, Kansas City {srmhv7 | ekpark | hanyij }@umkc.edu Abstract Proxy web caching is commonly implemented to decrease web access latency, internet bandwidth costs and origin web server load. Data mining techniques such as URL(Universal Resource Locator) and web-content mining, etc. improve the web-cache usage but add overhead to clients and/or routers. Moreover changing web access patterns dictate what would need to be cached in the web proxy caches. Designing the configuration settings that maintain optimal performance for a proxy cache system requires an adaptive cache configuration mechanism. A transparent shareable proxy caching system is proposed in which its system of proxy caches adapt themselves to changing web access patterns. We propose an algorithm that mines client web access patterns to classify clients, reconfigure proxy caches, and assign clients to proxy caches. Four light weight agents, in a system of proxy caches and a web cache server, ensure optimal use of computer resources, and significantly increased cache performance.

(Universal Resource Locator) text mining, web content mining, etc. Consider the general scenario in which caches and clients are deployed by a standard ISP (see Figure 1). The clients of an ISP are connected through a mesh of interception routers to the backbone router. This backbone r ou t e rs wi t c h e st h ec l i e n t ’ sr e qu e s t st ot h ea ppr opr i a t ewe b cache usually the nearest (least cost) one. If this cache is not able to serve the request then the request is broadcast to its neighbor caches. If the neighbor caches still can not serve the request then the request is sent to the origin web server. This helps in saving internet bandwidth and decreases the web access latency by avoiding repeated calls to the origin web server.

2. Current and related work One of the main concerns for the ISPs is to find an optimal configuration for the proxy caches to best utilize the available resources. Proxy web caches are temporary stores of frequently accessed web objects at an intermediate place between the origin web server and the client as opposed to browser caches which are present on the client machine. Web objects can be data or image files downloaded from the origin web server. Also large files can be downloaded and stored in the cache thereby decreasing the time to download the file from the web. The size of a proxy web cache is however limited. There are many efficient techniques such as web prefetching [9] and cache replacement [11] which help in deciding what needs to be kept in a cache and what needs to be replaced. In [4], an efficient cache replacement algorithm for the integration of web caching and prefetching is proposed. A Web cache server maintains the system of proxy caches and is responsible for their smooth operation. It helps configuring the proxy caches based on the type of pre-fetching technique used, replacement strategy followed, etc. This is a tedious process and there is no single configuration policy that exists to maintain optimal performance as the

1. Introduction Internet Service Providers (ISPs) are the commercial service companies that provide internet access to individuals and/or enterprises. Usually a client connects to an ISP through a modem/cable after establishing an account with them. Some providers only offer a basic connection to the Internet while others provide standard services. With the number of ISPs growing on a daily basis around the world, there is room for plenty of competition for acquiring more subscribers. The key to getting more clients is by being able to provide cheaper subscription rates with better accessibility. Web caching, among other techniques, is commonly used to significantly reduce internet costs. In [2], the authors study various trends and techniques in web caching. The proxy caches themselves are located at the backbone routers which switch incoming requests from clients, either randomly or more recently based on data mining techniques [2] such as URL

2005 IEEE ICDM Workshop on MADW & MADM

36

performance highly depends on user web access patterns. An Adaptive mechanism is hence required.

maintain [2]. Web caching also involves different strategies for prefetching URLs, replacement of stale objects in the cache, etc. Many Cache servers support different types of architectures and also allow following different cache replacement and prefetching strategies based on how the configurations are set. Such static configurations however fail to maintain a consistent cache performance when ov e r l oa de dorf orv a r y i ngu s e r s ’we ba c c e s spa t t e r n s .We b caching performance also depends on resource availability and cost, etc. which was not considered by them. There have also been some approaches which dynamically change the way caching is performed. In [7], a web caching agent acquires knowledge about web objects to deduce effective caching strategies dynamically. The cache hit ratio increased by 20% on an average when their adaptive admittance algorithm was used as opposed to traditional replacement methods. A prototypical system was designed and developed to support data warehousing of web log data, extraction of data mining models and simulation of web caching algorithms [3]. They have used an external agent to update their data mining model dynamically whenever the performance declines. Their results show that their proposed method gives a 50-75% increase in performance when compared to traditional replacement algorithms such as LRU. An association rule based approach is used to predict the web objects to retain in the cache [18]. Since the algorithm runs on every web object requested, there is a substantial overhead involved. Such approaches however neither make best use of available resources nor preserve an optimal cache performance. An Adaptive Approach which dynamically changes the policies used to manage the proxy caches is needed. Our paper proposes a framework for such an approach to personalize the configuration of proxy caches ba s e d on us e r s ’v a r y i ng we ba c c e s s pa t t e r ns t h e r e by achieving both optimal resource utilization and optimal cache performance. In our approach, we have loosely extended the web personalization concept to web caches to personalize what the web cache contains and how it ma n a g e st h eda t ai na c c or da n c et oc l i e n t s ’c h a ng i ngn e e ds . In more specific terms it regulates the use of cache resources, the cache replacement strategy, the prefetching techniqueus e d,e t c .a spe rt h ec l i e nt s ’we ba c c e s spa t t e r ns . Usually proxy caches are configurable and the network administrator can specify the cache parameters. Deciding the most suitable parameter options is difficult especially if t h eu s e r s ’we ba c c e s spa t t e r ns cannot be predicted. Our paper uses agents to set up the proxy cache system and maintain their cache parameters for best utilization. The paper is organized as follows. Section 3 presents our approach which includes a detailed description of the association rule based rule set definition for identifying configuration settings necessary for the proxy caches for optimal cache performance. A brief description of the caching architecture and the multi-agent system including

2.1. Data mining Data mining is the analysis of data to establish relationships and identify hidden patterns of data which otherwise would go unnoticed. Web usage patterns have been mined for site evaluations [14]. These approaches however overload the caches and/or the backbone routers. Backbone routers form the major connection to the outside networks and hence overloading the backbone routers will noticeably slow down the network. Data mining has been extensively used in improving the overall web experience based on the web usage patterns it identifies. In [9], the authors employ an efficient pattern based approach for prefetching web objects. In [19], data mining is used to mine web log data to perform predictive prefetching of URLs that a client may request in the future based on past requests.

Web Server

Web Server Web Cache Server

Sharable Proxy Web Caches 1…….k

5U

Backbone Router of ISP

Client Machines 1…….n

Figure 1. General proxy cache deployment

2.2. Caching Architecture A common caching system used comprises of a Cache server which manages a group of proxy caches working together to serve a number of clients. There are several caching architectures, each with its own advantages and disadvantages. For example, though hierarchical caches have configuration hassles and unnecessary delays including issues with security, it has shorter connection times and lower bandwidth usage than distributed caching [13]. [16] discusses using cluster caches wherein a URL is placed in any one of the cluster of caches. This also involves configuration problems. Hybrid Caches are more favored as they combine the best features among the existing architectures but are difficult to implement and

2005 IEEE ICDM Workshop on MADW & MADM

37

an algorithm showing the collaborated working of the various agents is also given. Section 4 gives a verification for the feasibility of our approach and a brief discussion of the results. We conclude with our ongoing work and a discussion on the scalability of this approach to different caching architectures in Section 5. In this paper, we have used the terms client and user, and itemset and rule set interchangeably.

The concept of association rules also works similarly. An Association Rule is an implication of the form A -> B, where A and B are two sets of items that occur together in many instances. In our scenario, A will correspond to cache configuration settings and B will correspond to user web access pattern items and there will be an implication A -> B if and only if a setting A allows a hit for a request of type B. Every client request goes to one of the many proxy caches and checks for the presence of a fresh copy in that cache failing which the requested page is fetched from the origin web server. We define a client request type based on many parameters such as the size of file requested, number of requests in unit time, etc. Cache Settings are properties that define Cache operations such as memory usage, pre-fetching technique, allowed upload file size, the number of DNS lookup processes allotted for this type of clientele, etc. The Apriori algorithm has become a well known standard for identifying patterns using association rules. [Agrawal et al., 1996] Its main disadvantage is that, if a pattern of length n is needed then n passes are needed through the items. This can become a large overhead for our current application. In [6], the authors describe an efficient approach to perform incremental mining of Frequent Sequence Patterns in Web logs. The approach we have used in this paper is a variation of the Apriori algorithm [Agrawal et al., 1994] which identifies frequent rule sets using a pattern repository in linear time [12]. The main advantage of this approach is the ease of updating the rule set and scaling. New frequent rule sets added to the repository can be used immediately. We extend this approach for identifying frequent rule sets for proxy cache configuration settings (each unique proxy cache setting is obtained from a frequent rule set). The Configuration Repository contains all the frequent rule sets. A variation of their approach extended to our application is explained in detail below: Initial Set-up: 1. A list of relevant cache settings (settings irrelevant to user access patterns need not be considered here) are obtained from the cache server /proxy cache system used (SQUID cache server was used in our case) and the default setting options for each parameter are defined. 2. A list of tokens are defined from the proxy cache settings (ex. 1-File Size 100Kb, 80KB, etc.) 1a, 1b, 2a, 3a, 3b, 3c. Each token has two characters XY (X- unique identifier of setting, Y- takes a value from a-z uniquely identifying a value with which a setting can be set). Each token is also associated with a support value ranging from 0 to 1. The default value that a setting takes is a. 3. All Proxy caches are configured using a default values for all settings and n clients are assigned to proxy caches at random.

3. Proposed Approach Our approach studies the user web access patterns and customizes how the cache is configured for cache replacements, web object prefetching, etc. in a personalized way so that there is a balance between effective network usage with the given resources. This is based on the framework suggested in [15]. Our approach has been developed with the following design considerations in mind: a. Use an existing ca c h i n ga r c h i t e c t u r es ot h a tI SP’ sdo not need to change the existing costly deployment of proxy caches b. Use a data mining technique to mine for web patterns that will not overload available resources c. There should be no re-configurations necessary for either the clients or the routers (transparency) A multi agent system employs classification algorithms a n dda t ami n i ngt e c hn i qu e st os t u dyt h ec l i e n t ’ swe ba c c e s s patterns and configure caches to best serve the clients with available resources. Also a performance monitor maintains the web cache performance within acceptable limits at all times. The proposed solution uses intelligent agents and a variation of the Apriori algorithm [1] for web access pattern mining, cache and performance maintenance. Some basic assumptions made are given below: a. The clients have varying web access patterns and using a single cache configuration for all proxy caches would decline cache performance drastically. b. There may be other configuration rules based on which a cache is set up besides those considered here. Usually such rules are rigid and cannot be changed based on web access patterns. c. The proxy caches make their web objects shareable among themselves. Also the web cache server, which is one selected among the various proxy servers should be able to have all rights on the remaining proxy caches. d. The proxy caches may be located anywhere on the network based on other parameters such as cost, location. We have approached this problem as a pattern matching problem of client patterns and proxy cache settings.

2005 IEEE ICDM Workshop on MADW & MADM

38

identified then MaxProxyCount rule sets with highest support rule sets become the resulting Client Type Configurations (CTC). This final list of CTCs is stored in a repository called Configuration Repository. Each proxy cache is configured with respect to a CTC in the repository. More than one proxy cache may be configured with the same rule set depending upon the number of clients. As can be seen there will be at most as many frequent rule sets as there are proxy caches. Defining the Client Token Set and updating the Configuration Repository: This step is initially done to identify client user patterns and to match them to the configuration rule set best suited to them. Once the system is up and running, new rule sets are added to the Configuration Repository by running the following steps whenever performance declines beyond the threshold limit. 7. Ea c hCl i e n t s ’c l i e n t -set is then used to identify the CTC it most closely adheres to. Clients assigned to appropriate proxy caches based on the CTC. Now that we have a rule set for every client and valid frequent rule sets we can easily match clients with the best suitable proxy cache. A switch list records the Client and the Cache it is connected to. This is sent to the backbone router. Also the token sets defined by clients is continuously monitored for any frequently occurring rule set which is not present in the Configuration Repository. Updating this repository occurs by monitoring new frequent rule sets identified from the logs and ranking them against those existing in the repository. Only the highest ranked frequent rule sets define the CTCs with which the proxy caches are defined. The following section explains the actual deployment of this association rule based algorithm in the form of light weight agents in a system of Proxy cache and the Master Cache server.

Once the web cache logs have about 10,000 or more entries, the initial rule set definition is done as follows: 4. The Proxy caches log every client request as hits or misses. It also includes information regarding the type of hit or miss and time stamps. These logs are initially collected by setting all proxy caches to default setting and running the cache system for a period of time. From these logs we can identify the settings relevant to each client based on a simple decision tree i.e. If Client A requests for web pages with high graphic content a cache with larger memory is more suitable. From the web cache logs, n client-sets are identified. Each client-set gives a list of items that are most closely associated to their web access behavior. A direct association between client behavior and items can be seen in Table1. In a single pass through these client-sets a count of the number of times each setting-option occurred is obtained. The setting-options are ordered in descending order of the counts. 5. An itemset is identified for each cache setting. These itemsets initially contain only one item (the item they are defined for).We shall use the term item to represent a single cache setting and itemset to represent a set of cache configuration settings. Items are added to the itemset based on the minimum support level which is defined as the ration of the number of times the item occurs in the set to the number of client-sets. Passes are made through client-sets to identify maximum length frequent itemsets. The number of passes will at most be the number of cache settings taken into consideration. At each pass a decision is made to either add or reject a new item. Once rejected it can no longer be added in the itemset. (For example, item 1a’ ss u ppor twith all other settings 2s, 3s, etc, is checked then 1b with all other settings, etc.). Only those itemsets whose support is greater than the minimum support level are retained in the list for the next pass. At the end of the n or less than n passes we will be able to get the maximum length rule sets which have minimum support level. Default item options are added to these frequent rule sets that are not complete i.e. they do not have a token defined. 6. The rule sets are then filtered against a list of invalid itemsets(set of items that cannot occur together and are defined by the Internet service provider at configuration time as and when resources change). Also the number of proxy caches (MaxProxyCount) available is set by the Internet service provider. If MaxProxyCount is greater than the number of frequent rule sets

2005 IEEE ICDM Workshop on MADW & MADM

3.1. Caching Architecture In this paper we use a distributed caching architecture to implement our personalized caching approach. Figure 2 gives the system deployment for our approach. A brief description of the components is given below. 3.1.1. Web cache server (Master Cache). The web cache server does not cache anything. It manages the system of proxy web caches. It also contains the Configuration Repository explained above, a temporary store of preprocessed log data in the form of rule sets for each client as received from individual proxy caches and a client to proxy cache switch list that it maintains and updates the routers with. The classifying agent and the cache

39

maintenance agent are deployed here. These agents manage the switch list(used by the backbone router to route client requests) which contains the information of whi c hc l i e n t ’ sr e qu e s tn e e dt ober e di r e c t e dt owh i c hproxy cache. They are responsible for identifying the need for and reconfiguring proxy caches whenever the performance declines. The web cache server is located at the same level as the shareable Proxy caches but does not cache any new web objects and does not have any clients attached to it.

3.1.4. Web Server. This is the origin WWW server that stores web documents and manages the delivery of web pages over the internet. Transparent Inter-proxy cache cooperation Cache communication is via simple IP multicasts. All c a c h e s( 1…k )wor kbyi nter-proxy cooperation. The caches are set up at a core backbone proxy that can potentially serve all ISP clients and this is the farthest position were caches can be located for the maximum benefit. Locating these caches at the focal point is not necessary as it is too far away from the outside network and not much can be cached [5] [8]. All client requests are routed through the proxy caches. This occurs transparent to the clients. The client requests are sent to the appropriate proxies through interception at the backbone routers based on their IP addresses. The switch list of clients to proxy cache mapping is prepared by the classifying agent. This arrangement allows for complete transparency between clients and the working of the proxy caches. For example, if client X configured to cache A requests a URL cached in cache C (this is found through inter proxy cache cooperation using multicasts), the data is sent to client X from cache A through the switch (routed to) transparent to client X.

3. Classifying Agent 4. Cache Maintenance Agent

Web Server

Web Server Web Cache Server Sharable Proxy Web Caches 1…….k

1. Log Data preprocessing Agent 2. Cache Performance Agent

5U

Backbone Router of ISP

3.2. Multi-Agent System Client Machines 1…….n

Agents perceive and act based on their environment. Our approach uses light-weight agents which allow run time addition of new capabilities to the system. This will help in adding additional security and performance criteria checks later on. These agents help identify client web access patterns, map the patterns to cache configuration rules, allow adaptive cache re-configurations and allow an assignment of clients to cache which shall give an optimal performance of given resources at all times. These agents work in the background and do not hinder the caching protocol at any stage even when the caches are reconfigured. They are simply the core backbone for personalizing the cache usage for each client. The proxy caches are initialized with default settings. After the initial set up, the agents begin to work in the background. The Multi-agent system comprises of automatic agents that run continuously and semi-automatic agents which only run when another agent triggers them. An overall working of the various agents is shown in the algorithm in Figure 3. A brief description of their features and how they work together is described below.

Figure 2. Multi-agent deployment in proxy cache set up 3.1.2. Proxy Web Caches. Initially all proxy caches are configured using default settings and client requests are randomly assigned to a proxy cache. Once the initial set up is run and the base rule set is formed the proxy caches are reconfigured based on the rule sets and clients are reassigned accordingly. The performance is continuously monitored and changes to client assignment and cache configuration are made by the Master Cache server. The proxy caches manned by a Master Cache server are shareable among themselves. The log data pre-processing agent and cache performance agent are deployed at the proxy web caches and are responsible to trigger the Master Cache with any performance declines. Most of the processing occurs in the system of proxy web caches and web cache server. 3.1.3. Clients. Clients may be a single point of access such as standard home computers or a group of computers like a small office building or apartment. The machines by themselves do not require changes in their browser settings or additional computer resources for our approach. The routing switch will take care of any client-to-proxy cache redirections. Hence there is no overhead at the client.

2005 IEEE ICDM Workshop on MADW & MADM

3.2.1. Automatic agents. a. Log data pre-processing agent. Every proxy cache has a log of all the requests that came in and how the request was served (whether the requested Web object was available in the cache: Cache_Hit, Cache_Miss, Cache_Refresh_Hit, whether the requested web object was denied though it was present in the cache, etc.). A lot of useful user features can

40

be mined from these logs. The agent continuously processes the log data for the current rule set for the client and sends the current rule set list for all clients to the Master Cache on demand. This is used by the cache maintenance agent to update the Configuration Repository and the client-proxy cache assignment. b. Cache performance agent. The performance of a cache is me a s u r e dbyi t sa bi l i t yt os e r v eac l i e n t ’ sr e qu e s t as efficiently as the origin web server. The Cache performance agent is deployed at every proxy cache where the performance is measured for every cache cycle. This agent triggers the classifying agent or cache maintenance agent based on the results it obtains. For every pre-determined cache cycle, various performance parameters such as the throughput, mean response time, error rate, queuing of requests, connection length, object size, cache run-time hit ratio and other parameters are monitored by studying the cache logs. It must be noted that the main performance measure, the cache run-time hit ratio (%), is similar to the cache hit ratio which measures the ratio of the amount of requests served from the cache, as a percentage of total successful requests serviced but only for the most recent (about 10,000) requests serviced. This will give a more accurate estimate since the cache content changes dynamically. A steady decline in the cache performance (random fluctuations in performance are ignored/filtered as they are usually not the cause for poor cache performance) triggers the classifying agent to modify the re-assignment of clients to caches based on the c l i e n t s ’c u r r e n twe ba c c e s spa t t e r n s . If this still does not improve performance then the cache maintenance agent is called to update the Configuration Repository and to re-configure the proxy caches based on the updated rule sets. An algorithm snippet for the cache performance agent follows: 1. For every cache cycle 2. Check performance parameters 3. if performance declines below threshold 4. if time since last_call of classifying_agent < min_time_call 5. call cache_maintenance_agent () 6. else if number of clients for each cache is not within acceptable range 7. call cache_maintenance_agent() 8. else 9. call classifying_agent() 10. End For

An algorithm for the classifying agent follows: 1.f ore v e r yc l i e n ti =1…. n 2. ca c h e [ i ]=f i n dbe s tc a c h ema t c hf r om c a c h e1…k us i ng binary search decision trees 3. Send new Switch list cache[] to Web cache Server to update back bone router b. Cache maintenance agent This agent, deployed at the Master cache, is triggered by the cache performance agent and has access to all the proxy cache pre-processed logs containing most recent rule sets of clients. The purpose of this agent is to update the Configuration Repository with new rule sets as the need be. This is done as explained above.

Item 1a 2a

3a 4a

5a 6a 6b 6c 7a 8a

4. Verification In this paper we show the correctness of our approach by analyzing log traces obtained from [10] and show how the performance of a typical caching mechanism can be enhanced by dynamically varying the cache parameters using our framework. For simplicity, we simulated the scenario using only 3 configuration settings (A,B,C). Item A can have 2 values each of which can be represented as a token (1a,1b) and item B can take 5 different settings which are represented by options(2a,2b,2c,2d,2e) and item C can have 2 values. The support level for frequent items was set at 25% of the number of clients in our case 1000. Theoretically this means, one can configure the various proxy caches in (2! * 5! * 2!) ways. But usually not all combinations of configuration options are valid or have minimum support. Once the CRTs have been identified, and stored in the Configuration Repository, the switch list

3.2.2. Semi-automatic agents. a. Classifying agent. This agent is triggered by the cache performance agent to check the most recent rule set list sent by the proxy caches of its clients and re-assign the clients to different proxy caches as needed. This agent also prepares the new switch list and sends it to the backbone router to handle re-route future client requests.

2005 IEEE ICDM Workshop on MADW & MADM

Table 1. Sample items Description of unique client web access feature request greater object size how many objects does the client request for on an average ( basically total size of all objects cached for the client in a given period of time) number of web objects with size > 4MB If latency time is higher for a request. i.e. Loaded caches –more DNS lookup n processes need to be spawned file upload sizes greater than 100KB refresh pattern: if pages expires too fast but still needs to be cached refresh pattern: if pages need to be refreshed even when they were refreshed only recently refresh pattern: If a new trend is identified Clients which have a longer time interval between requests Haphazard clients. Overload cache with too many requests in a given time interval.

41

is created by matching the proxy caches to the clients. Table 1 and Table 2 show some sample items and itemsets. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

configuration for the web proxy caches. We used a SQUID proxy cache server [17] and ran 2 proxy caches. We then simulated the configuration changes obtained as a result of running the rule set identification algorithm on the traces and applied the changes dynamically. There was no relevant downtime for the proxy caches, the run time hit ratio or the cache performance was not affected significantly. Nevertheless we believe that an adaptive approach is the best solution for improving web cache performance and optimally using memory and network resources.

Among a set of proxy caches (k+1) located at the backbone router of the ISP choose one to be the server(Master cache) n: number of clients for an ISP (assume there is only one backbone router for this ISP) Initially assign  n / k clients to each proxy web cache. Wait for one cache cycle Start log_data_pre-processing_agent to continuously obtain client rule sets Start cache_performance_agent to continuously monitor cache_performance cache_maintenance_agent is initialized to set default c o nf i g ur a t i o nf o rpr o x y 1…k for every new cache cycle do If cache performance is below threshold then if classifying agent last_call_time < x milliseconds Call the cache_maintenance_agent else Call the classification_agent else continue

5. Conclusion This paper suggests a framework for dynamic cache maintenance and the results will vary for different set ups. Also as mentioned before there may be network issues depending on the way it is set up. This framework once set up is shown to prove advantageous for both static and dynamically varying client web access patterns. It ensures that all resources are considered for best utilization with little overhead.

Figure 3 Multi Agent system algorithm

The major advantages of our approach are enumerated below: a. A significant increase in cache run time hit ratio (optimal performance depending on user web access trends) b. No client side re-configuration needed. c. No added overhead at the backbone routers d. Light weight agents (stationary) –allows run time addition of new capabilities to the system e. Allows scalable web caching f. Agents work with predictable run time g. Though client requests are re-routed to different caches, multiple copies of the same URL are not found in different caches because caches are shareable h. ISPs can propose different service packs to clients. For example, they can provide high-speed internet with faster response time at higher cost (due to more Domain Name System (DNS) lookup spawns, larger caches, more bandwidth, etc.). i. Automatic stabilization of the proxy cache system when new hardware is added, one or more proxy caches are added or removed for maintenance etc. with minimal manual intervention One of the possible problems of our approach is that the switch list might conflict with the routing protocol in which case the router would just ignore the switch list. The switch list should be configured not to overload any proxy caches etc.

Table 2. Sample frequent rule sets Sample Rule Sets * 1a,8a 6a,8a 1a, 2a, 6b,7a * Settings that are not shown retain default values

We calculated the run time hit ratio from the cache logs from the raw data and then to the data processed by our approach. We calculate a hit if the request could be satisfied by the way a cache configuration has been decided after the algorithm is run. Case 1: In a general scenario, all proxy caches are set up using one rule set which has highest support. We calculated the run time hit ratio from the cache logs obtained from two separate weeks from a busy Internet service provider WWW server available from [10]. Using a default configuration setting for all proxy caches we calculated an average of 88% hits for different trace sets. We only used the requests and type of requests as our input. We then applied our approach, computed the CRTs and calculated the run time hit ratio for the same traces. We ran the tests a number of times and obtained a best case of 96% and an average case of 94%. Case 2: Wet h e nc h a ng e dt heu s e r s ’we ba c c e s spa t t e r ns using a biased randomizer and found the performance degrade to as low as 66%. By applying our approach, the same requests resulted in >90% run time hit ratio in all test runs. Since we assume a higher limit for resources there can be a margin of error possible. We also verified the effectiveness of the agents by dynamically changing the

2005 IEEE ICDM Workshop on MADW & MADM

The proposed web caching framework can be adopted for almost any type of caching architecture. This approach can also be scaled to a higher level in which the web cache

42

servers for different sets of proxy caches talk to each other and share optimal configuration patterns. Security on the Internet has become a major concern, especially for large enterprises, which prefer not to cache the web objects they access rather than have their caches sniffed by others. Our approach can help configure proxy caches to permit authenticated access to certain web cache objects by increasing the security level setting thereby regulating web cache usage.

[8] Steven D. Gribble, Eric A. Brewer, “ System Design Issues for Internet Middleware Services: Deductions from a Large Client Tr a c e ” ,Proc. of the 1997 USENIX Symposium on Internet Technologies and Systems. Monterey, California, USA, December 1997, pp. 207-218. [9] Jaeeun Jeon, Gunhoon Lee, Haengrae Cho, and Byoungchul Ahn, “ A prefetching Web caching method using adaptive search patterns” ,IEEE Pacific Rim Conference on Communications, Computers and signal Processing (PACRIM), August 2003, Vol 1, pp. 37-40,

We are also extending this adaptive rule set identification approach to detect malicious transactions from database transaction logs in large databases, commonly accessed via web applications, over a network. Malicious activity occurs primarily due to dynamically changing access roles and poorly configured data base caches.

[10] National Laboratory for Applied Network Research. Anonymized access logs, . [11] Stefan Podlipnig and Laszlo Böszörmenyi, “ A survey of We bc a c her e pl a c e me nts t r a t e g i e s ” ,ACM Computing Surveys (CSUR) archive, December 2003, 35(4): 374-398.

6. References

[12] Richard Relue and Xindong Wu, “ Rule generation with the pattern repository” ,Proc. of the IEEE International Conference on Artificial Intelligence Systems, September 2002, pp. 186 – 191.

[1] Rakesh Agrawal, Andreas Arning, Toni Bollinger, Manish Mehta, John Shafer, Srikant Ramakrishnan, “ The Quest Data Mining System” ,Proc. of the 2nd Int'l ACM Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996, pp. 244-249.

[13] Pablo Rodriguez, Christian Spanner, and Ernst W. Biersack, “ Analysis of Web Caching Architectures” ,Hierarchical and Distributed Caching IEEE/ACM TRANSACTIONS ON NETWORKING. August 2001, 9(4):404-418.

[2] Greg Barish and Kathia Obraczka, “ World Wide Web caching: trends and techniques” ,Communications Magazine, IEEE, May 2000, pp. 38(5):178 –184.

[14] Myra Spiliopoulou, “ Web usage mining for Web site e v a l ua t i on” , Communications of the ACM, August 2000, 43(8):127-134.

[3] Francesco Bonchi, Fosca Giannotti, Cristian Gozzi, Giuseppe Manco, Mirco Nanni, Dino Pedreschi, C. Renso Salvatore Ruggieri, “ Web log data warehousing and mining for intelligent web caching” , Data and Knowledge Engineering (DKE) ©Elsevier, October 2001, pp. 32(2):165-189.

[ 15 ]Suj a aRa niMo ha n,E. K.Pa r k ,Yi j i eHa n,“ As s oc i a t i onRul e Ba s e d Da t a Mi ni ng Ag e nt sf orPe r s ona l i z e d We b Ca c hi ng ” , Proc. of the 29th Annual International Computer Software and Applications Conference, IEEE, Edinburgh, Scotland, July 2005.

[4] Cheng-Yue Chang and Ming-Syan Chen, “ A new cache replacement algorithm for the integration of web caching and prefetching” ,Proc. of the eleventh international ACM Conference on Information and knowledge management, November 2002, pp. 632 –634.

[16] Jia Wang, “ A Survey of Web Caching Schemes for the Internet” ,ACM SIGCOMM Computer Communication Review, October 1999, 29(5): 36 –46. [17] Duane Wessels et al, “ Squi dI nt e r ne tObj e c tCa c he ” , National Laboratory for Applied Network Research .

[5] Bradley M. Duka, David Marwood, and Michael J. Feeley, “ The Measured Access Characteristics of World-Wide-Web Client Proxy Caches” ,Proc. of the 1997 USENIX Symposium on Internet Technologies and Systems. Monterey, CA. Technical Report TR-97-16. December 1997.

[18] Qiang Yanga ndHa i ni ngHe nr yZha ng ,“ Web-Log Mining f orPr e di c t i v eWe bCa c hi ng ” ,IEEE Transactions on Knowledge and Engineering, August 2003, 15(4): 1050-1053.

[6] Maged El-Sayed, Carolina Ruiz, and Elke A. Rundensteiner, “ FS-Miner: Efficient and Incremental Mining of Frequent Sequence Patterns in Web logs” ,Pr oc .oft heACM WI DM’ 04. Washington, DC, November 2004, pp. 12-13.

[19] Venkata N. Padmanabhan, Jeffre y C.Mog ul ,“ Us i ng pr e di c t i v epr e f e t c hi ngt oi mpr ov eWor l dWi de rWe bl a t e nc y ” , ACM SIGCOMM Computer Communication Review , July 1996, 26(3): 22-36.

[7] Annie P. Foong, Yu-Hen Hu, and Dennis M. Heisey, “ Adaptive Web cac hi ngus i ngl og i s t i cr e g r e s s i on” ,Proceedings of the 1999 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing IX, August 1999, pp. 515 –524.

2005 IEEE ICDM Workshop on MADW & MADM

43

Temporal Intelligence for Multi-Agent Data Mining in Wireless Sensor Networks Sungrae Cho, Ardian Greca, Youming Li, and Wen-Ren Zhang Department of Computer Sciences Georgia Southern University Statesboro, GA 30460 [email protected] Tel: 912-486-7375; Fax: 912-486-7672 Abstract— In wireless sensor network, sensor nodes are functioning as autonomous, self-organizing multi-agents to provide useful information to users. Yet, it is a challenging issue how autonomous but resource limited agents should be designed to make them capable of helping each other in their data mining tasks. The identified limitations for sensor agents include power consumption and scalability. In this paper, we define wireless sensor network from the perspective of multi-agent data mining and warehousing, and propose a temporal intelligent coordination protocol to reduce power consumption and to provide scalability. Simulation results show that the temporal intelligent coordination protocol significantly lowers power consumption and thus maximizes the network life time.

I. I NTRODUCTION Wireless sensor networks have drawn immense attentions recently from industries and research institutions as an enabling technology for invisible ubiquitous computing arena [10]. Spurred by the rapid convergence of key technologies such as digital circuitry, wireless communications, and micro electro mechanical systems (MEMS), a number of components in a sensor node can be integrated into a single chip with reduction in size, power consumption, and cost [1]. These small sensor nodes could be deployed in home, military, science, and industry applications such as transportation, health care, disaster recovery, warfare, security, industrial and building automation, and even space exploration. By connecting these small sensor nodes by radio links, the sensor nodes could perform tasks which traditional sensor nodes are hard to match. Albeit the applications enabled by wireless sensor networks are very attractive, one of the most frequently used

2005 IEEE ICDM Workshop on MADW & MADM

function would be data mining. In wireless sensor network, sensor nodes are expected to operate as an agent to gather useful information to remote users. This multi-agent system, however, has to overcome several technical challenges. The identified challenges include (1) scalability, (2) adaptability, (3) addressing, and (4) energy-efficiency. Since sensor networks consists of a large number of sensor nodes and thus large amount of data will be produced, large-scale data mining and warehousing techniques are needed. Also, user constraints and environmental conditions, such as ambient noise, topology change, and event arrival rate, can be time-varying in wireless sensor networks. Thus, the system should be able to adapt to these time-varying conditions. Furthermore, sensor nodes may not have global identification because of the large amount of overhead and the large number of sensor nodes. Therefore, naming or addressing is a challenging issue in wireless sensor networks. In addition to these challenges, the energy consumption of the underlying hardware and protocols is also of paramount importance. Wireless sensor nodes are expected to be operated by battery. Because of the requirement of unattended operation in remote or even potentially hostile locations, sensor networks are extremely energy-limited. Energy optimization in the sensor networks is much more complex since it involves not only reducing the energy consumption of single sensor node but also maximizing the lifetime of an entire network. The network lifetime can be maximized by incorporating energy awareness into every stage of wireless sensor network design and operation, thus empowering the

44

system with the ability to make dynamic tradeoffs between energy consumption, system performance, and operational fidelity [9]. Since various sensor nodes often detect common phenomena, there is likely to be some redundancy in the sensory data that the sources generate. In-network filtering and processing technique can therefore help to conserve the scarce energy resources. Data aggregation or data fusion has been identified as an essential paradigm for wireless routing in sensor networks [7]. The idea is to combine the data coming from different sources en-route – eliminating redundancy, minimizing the number of transmissions and thus saving energy. In this paper, we apply multi-agent data mining concept to wireless sensor network and propose temporal intelligent coordination protocol as a knowledge and coordination mechanism. The objective of the protocol is to further reduce the energy consumption when data aggregation is involved. In order to reduce the energy consumption, our protocol employs an intelligent decision logic in the sensor agent which defers or deactivates the transmission of its response. To our best knowledge, this is the first research incorporating multi-agent data mining concept into wireless sensor network and providing temporal intelligent coordination protocol for sensor agents. The remainder of this paper is organized as follows. Multi-agent system methodology is explained for wireless sensor network in Section II. In Section III, the proposed temporal intelligent coordination protocol is described. In Section IV, we compare the energy-efficiency performance of data aggregation with or without our protocol. Finally, contributions and future work are discussed in Section V

learning, and decision-making. The following sensor agent’s activities can be identified [13]: •

• • • •

In BDI approach [6], they see the problem in two perspectives: external and internal views. The external view breaks the problem into two main components: the agents themselves (agent model) and their collaboration or coordination. The internal view uses three models for agent class: an agent model for defining relationships between agents, goal model for describing goals, planning and scheduling models to achieve agent goal. In any distributed environment, the agents can be classified with particular roles according to their capability description [4]. Agents may have persistent roles – long term assignment as well as task specific role – short term assignments. From this two points of view, we can comprise the multi-sensor-agent-based organization into two main models: agent/role model (agents’ capability and behavior) and agent/role interaction model. Note that the agent/role interaction model can be defined to the level of individual queryresponse and to associated data. To perform appropriate response, a role can be defined with four general attributes: responsibility, permission, activities, and protocols [4]. •

II. M ULTI -AGENT S YSTEM M ETHODOLOGY There are number of methods have been proposed for modeling of agents in a distributed environment [4], [6]. The widely used approach is to model agents based on BDI (believe, desire, and intention) [6]. The sensor agent can be a miner, a decision maker, a controller, or an actor that has local or partial learning and decision capabilities, can manage and uses its local data and knowledge, and can cooperate or be coordinated with other agents for collective monitoring,

2005 IEEE ICDM Workshop on MADW & MADM

identifying sensor agents and agent communities, e.g., sensor agent monitoring temperature, training new sensor agents using task assignment, dispatching sensor agents to their posts, deploying knowledge and coordination protocols, mining new knowledge including new coordination protocols.

45

Responsibility: sensor agent/role functionality can be measured by responsibility assigned to it, when can be divided into two categories: timeliness property and security property. The timeliness property ensures the task will be done by performing certain actions. For example, to illustrate it further, we discuss the monitoring responsibility of sensor agent/role. The timeliness property in this case is to inform the relevant agent in case of any updates in data resources. In this context, an example of sensor agent responsibility might be DataMonitor = {Monitor.DataCollectionAgent, CheckTemperature.AwaitUpdate}.

Wide Area Network

Sink

Interest Transmission Sensor Field

User

Data

Fig. 1.







Phenomenon gathering.

This expression represents that DataMonitor consists of execution protocol Monitor followed by the protocol DataCollectionAgent followed by the activity CheckTemperature and a protocol AwaitUpdate. In this case, the sensor agent will definitely be required to ensure that the temperature needs to satisfy a certain limitation, called its safety property, e.g., 70 ≤ temperature ≤ 76. Permissions: are the rights associated with the roles to realize their responsibility. This specification shows that the sensor agent who carries out it role has permission to access, read, and modify the data source. Activities: are private action/computational functionality associated to its role. Protocols: define the mechanism for the roles to interact or communicate each other.

In the following, we will discuss about how the roles are interacted and communicated and how power consumption is minimized in perspective of agent/role interaction model.

This role dissemination and sensory data gathering can be performed by the traditional address-centric approach where the shortest path is found based on physical end address as IP world has. The cost of an address in wireless sensor networks can be considered high if the address space is underutilized and the address space occupies greater portion of the total bits transmitted. Globally unique address would need to be very large compared to to typical size of data attached to them. Also, maintaining local address would be inefficient because more work is required to keep addresses locally unique as the network topology changes dynamically. In wireless sensor networks, a more favorable approach is the data-centric routing. In the data-centric approach, role dissemination is performed to assign the sensing tasks to the sensor agents [1]. Data-centric routing requires an attribute-based naming [1], [8]. For the attribute based naming, the users are more interested in querying an attribute of the phenomenon, rather than querying an individual agent. For instance, “are there any agent where the temperature is over 70 degree?” is a more common query than ”what is the temperature measured by a certain agent?” The attribute-based naming is used to carry out queries by using the attributes of the phenomenon. Coverage of deployed sensors will overlap to ensure robust sensing task, so one event will likely trigger multiple sensors in the same phenomenon. In this case, it is likely to receive multiple identical copies of a sensory data. Also, some roles inherit redundant responses as follows: •

III. M ULTI -AGENT T EMPORAL I NTELLIGENT COORDINATION P ROTOCOL



A. Background Sensor nodes are scattered densely in a sensor field as in Fig. 1. A node called sink requests sensory information by sending a query throughout the sensor field. This query is received at sensor agents (or sources). When the agent finds data matching the role (or query), the data (or response) is routed back to the sink by a multihop infrastructureless networked sensors. The information gathered in the sink agent can be accessed by user via existing wide area networks such as Internet or satellite networks [1].

2005 IEEE ICDM Workshop on MADW & MADM



Max: The sink is interested in gathering maximum value from the sensor field. In this case, other values less than the maximum are redundant. Min: The sink is interested in gathering minimum value from the sensor field. In this case, other values greater than the minimum are redundant. Existence: Some application needs to identify the existence of a target object. For example, in directed diffusion [5], an initial query dissemination is used to determine if there indeed are any sensor agents that detect the interested object.

We refer the role with the above types as a singular role which expects only one response from source agents. Redundant and unnecessary responses will generate unnec-

46

120

essary transmission at the underlying layers. For example, unnecessary response will cause high duty cycle at the medium access control (MAC) layer which in turn generates high contention from multiple agents. Consequently, sensor agents suffer from unnecessary energy consumption.

T=5 T=10 T=20

Expected Number of Responses

100

B. Temporal Data Suppression In this section, temporal intelligent coordination protocol is proposed when sensor agent are collectively gathers sensor data and reports them to the sink agent. The objective of the scheme is to further reduce the energy consumption when data aggregation is involved. In order to reduce the energy consumption, our scheme employs an intelligent decision logic in the sensor agent which defers or deactivates the transmission of its response. The temporal intelligent coordination protocol is performed as follows: •





The sink disseminates a role to its child agents with (1) role type, (2) depth of the tree D, and (3) timer parameter T . The sink then waits for DT for responses from its child agents. Each agent simply forwards the role and waits for (D − d)T where d is the depth of the agent. By permitting this waiting time, each agent is able to aggregate all the responses from its child agents, and agents in the same depth can be synchronized. When responses are received at source agent i from its child agents during (D−d)T , it looks up the role type. If the role type is not singular, then the agent immediately sends its response back to its parent after (D − d)T . If the role type is singular, it activates a timer after (D −d)T with timer value Bi . Timer Bi is derived from received timer parameter T . When timer expires, the source agent transmits its response. If a response from other source agents is received prior to timer expiration, agenti compares the received response with its own response. If agent i finds that its response is redundant, it deactivates its timer. IV. P ERFORMANCE E VALUATION

To evaluate the energy-efficiency performance of temporal intelligent coordination protocol, we developed a simulator based on event-driven simulation using Java. The simulator generates a random topology as follows. We assume that the

2005 IEEE ICDM Workshop on MADW & MADM

80

60

40

20

0

0

100

200

Fig. 2.

300

400

500 R

600

700

800

900

1000

The number of responses vs. R.

sensors have a fixed radio range and are placed in a square area randomly. The sensors form a network routing tree. This tree is formed based on the proximity metric of each agent using breadth first search tree [3]. The root of the tree (sink) is randomly selected in the simulator. When we vary the number of sensors, we vary the size of the area over which they are distributed so as to keep the density of sensors constant. For instance, we use a 1000 × 1000 area for 1000 sensors. For 4000 sensors, the dimensions are enlarged to 2000 × 2000. Based on the tree formed, the sink disseminates a role (with all role parameters as described in Section III) to its child agents which forward this role to their children. This process is continued until the role is reached to the deepest agents. The depth of the tree is computed based on the tree formed, and is used for response waiting time ((D − d)T ) at each of the agent. When the role is reached to the deepest agents, the deepest agents respond with their sensor reading which are aggregated at their parent agent, and so on towards the sink. The sensor reading values are generated uniformly in range of [10, 90] where we assume the minimum and maximum sensor readings are 1 and 100, respectively. In other words, the sink expects the sensor readings are from 1 to 100, but actual readings are between 10 and 90. The choice of the range is rather arbitrary, but we observed that expanding the reading range does not affect the performance when we also increase the agent density.

47

In Fig. 2, we show the number of responses versus the number of agents R with different parameters T = 5, 10, and 20 time units when the sink sends out the MAX role. We observe that very small number of responses can be obtained with different parameter T under larger number of source agents R (at most 101 responses at T = 5 and R = 1000). We show that with temporal intelligent coordination protocol parameter T the number of responses can be significantly reduced, i.e., energy-efficiency can be significantly improved. Especially, the more T we have, the more improvement in energy-efficiency we achieve. However, if we have a large T , the total latency in role processing increases. Therefore, the recommended choice of T would be arg max t0. The part of hyper-surface in the rectangle area needs to be repaired.

is h ( x , k ) =

NDDCHA uses a base hyper-surface to approximate the main outline of underlying function, which is created by a base learning algorithm. Furthermore, a number of patching hyper-surfaces are overlapped onto base hyper-surface to form a new hyper-surface. The patching information is from negative data set. The main problem of compensating base hyperostosis is which area of hyper-surface is needed to patch. In the Boosting and Bagging algorithm[7, 8], a voting strategy is applied to combine all hypotheses. This approach does not work because the patching hypothesis depends on the base learning hypothesis. The patching hypothesis uses negative data which is complement set of positive data set. Training data set is denoted as S, and (x,y) is an example of S, then (x,y)∈S. x is an n-dimension vector in real space and y is the label. In binary classification, either y=+1 or y=-1. If an example (x,y) is correctly classified according to the classifier or separator h(x)=0, (x,y) is said to be in the consistent subset (x,y) ∈ CS⊆S, otherwise (x,y) is in the inconsistent subset IS=S-CS, (x,y) ∈ IS. The function of partitioner p(h,x,y) is to return value of either true or false so that it can partition a data set. A partitioner accepts the hypothesis and a specific example as input. One simple example of a partitioner in the classification is the crisp boundary p(h, x, y): ||y-h(x)||≤ε, ε∈[0,0.5], then the notwell-separated data is NS={(x, y)| (x, y)∈S, ||y-h(x)||≤ε, ε∈[0,0.5]}. The boundary between positive data set and negative data set is called border. We can say the positive

2005 IEEE ICDM Workshop on MADW & MADM

k

∑h

(i )

( x ), for k > 0 . The training data sets

i =0

are defined as follows, ⎧ S0 = S ⎪ S = {(x, Δ y ) ∈ S |x ∈ X,d ( i −1) ( h ( i −1) , x , Δ y ), i i −1 i −1 ⎪ i ⎨ Δ y i = Δ y i −1 − h(x,i − 1 ), and Δ y 0 ∈ Y} ⎪ ⎪⎩ S i# = S i −1 − S i for i = 1 ..k The labels on the training subset Si are the differences of predicated labels and expected labels. The hypothesis is produced by training on the residual data since the idea of NDDCHA is to compensate the base hypothesis each time. Since the above algorithm is iterated over k times it has to be regression learning algorithm. There are a total of 1+k passes in this algorithm. S i# is the ith positive data subset and does not change during training. The final positive data is U k +1S i# . i =1

In the testing phase, the hypotheses from training are used to create patching data to compensate the base hypothesis. The key point in the testing phase is to determine the suitable patching hypothesis. The function of vector set similarity VS accepts two data sets Si and Ti-1, one from the training data set and the other from the testing data set, to generate a subset of Ti from Ti-1. As a result, each vector x1 ∈Ti-1 becomes similar to at least one vector x2∈Si, denoted as vs(x1, x2)≥ δ, where δ∈[0,1] is the degree of similarity. If

50

user’s perspective, the inputs of NDDCHA are data set, stop criteria, base learning algorithm and negative data learning method. The stop criteria could be empty of negative data set, number of iteration and etc.

x1=x2, then vs(x1, x2)= 1 whereas if x1≠x2, vs(x1, x2)= 0. Here Pi predicts labels on negative data set Ti ⎧ T0 = T ⎨ ⎩Ti = VS ( S i , Ti −1 ), if ∀x1 ∈ Ti −1 , ∃x 2 ∈ S i vs ( x1 , x 2 ) ≥ δ , δ ∈ [0,1], i = 1..k

III. SCHEMA OF MULTIAGENT NEGATIVE DATA MINING

⎧ P0 = {(x, y) | x ∈T0 , y = h(0) (x)} ⎪P = {(x , y) | x ∈T , x ∈S , i −1 2 i 1 1 ⎨i ⎪ vs ( x , x ) δ , y h ( x ≥ = 1 2 1 , i)},i = 1..k ⎩ In above expressions, δ is the regulating parameter to control the degree of two vectors’ similarity. It can be seen that, Ti is similar to Si, so that h(i)(x) can be used for testing Ti .to generate the values of Pi . These values are overlapped to compensate labels as P#i=OV(P#i,P#i-1). The final output P#k is the predicting label set. The output labels which are compensated value are given as follows. ⎧ P0# = P0 i = 1 ..k .

Based on the understanding of negative data, a negative data concept and NDDCHA, a schema is proposed to do data partitioning. Data from data warehouse are fed into base learning agent. The output of base learning agent is positive and negative data. The positive data contains the knowledge data whereas negative data does not. Negative data agent learns model from negative data set and transmits the positive data to assembly agent to form whole knowledge picture. Since negative data involves useful information, the negative data agent recursively splits the negative data into sub-positive and sub-negative data. Its

⎨ # # ⎩ Pi = OV ( Pi −1 , Pi )

Fig 2 The schema of multiagent negative data mining

iteration generates a pair of positive and negative data each round. The iteration does not stop until the stop criteria are reached. The scheduling agent is a representative of the whole system interfaced with customers. The interior of rectangle is multiagent implementation of NDDCHA. The exterior is the input and output of the system. The outputs have categories where one is the knowledge data discovered and data subsets partitioned and the other is a signal of termination of system notifying other agent beyond NDDCHA system. There are four agent base learning agents ab, negative agent an, assembly agent aa and partitioning agent ap. When the scheduling agent receives a

It can be seen that in the training phase the learner uses the hypotheses h(x)=h(x,i) together with the partitioning function p(i)(h(i),x,y) as divider, generating a positive group S#(k)= U k +1S i# and a negative group Sk+1=S-S#(k). We find i =1

a subset of testing data Ti, which is similar to Si and use the hypotheses produced on Si for testing Ti. The ‘+’ operation is one case of OV function, and hence the final testing result is treated as a summation of overlapping function Pk# =

k

∑P i =1

i

#

.

In the data mining, the issue concerned is knowledge data which is a series of hypotheses as shown on Fig 2. From

2005 IEEE ICDM Workshop on MADW & MADM

51

event. The data channel has synchronous and asynchronous types. The asynchronous data is used for high bandwidth packet, which is necessary for large size of data set. Based on MOST architecture, the system could also work on the distributed system.

mining job, it delegates the job to ab, then ab creates a hypothesis submitted to knowledge data base and relays the data to agent ap1 to form negative data. The next is a loop of negative data learning completed by agent an. The loop is the kernel of negative learning. Each time it creates a hypothesis and negative data subset. The system is flexible because the agents could be replaced easily. For example, we can choose either support vector machine or neural network algorithm as an algorithm of a learning agent. And we can choose fuzzy border or Euclidian distance as the partitioner. Assembly agent executes combination operating for two different data sets. The diamond shape of branch to stop learning is part of work of scheduling agent; therefore the stop event to customer is emitted by scheduling agent. This event only tells customer data mining has finished for this specific data set. The whole system is still running to respond the environment changing. There are data flows in the system where one is classifying data set from data warehouse and the other is knowledge data hypotheses. The protocol for classifying data set is defined in the following as BNF-like representation:

IV. CONCLUSION

The NDDCHA improves the learning algorithm performance through compensating the base hypothesis by utilizing the negative data set. Useful information in the negative data is mined to benefit the model of an application. This approach expands the hypotheses space to close the target space so that the approximation error is reduced. Multi-agent technique extends the application of NDDCHA to mine the large size of data set and provides a mean to cluster the data by partitioning the negative data and positive data set. The schema proposed offers a flexible architecture to enable users to choose learning algorithm freely without breaking the software system. The schema described here is for binary classification. It could be extended to multiple classification data mining and regression because positive data and negative data concept is also valid in these scenarios. Therefore, as our future work, we will study these scenarios and other efficient data protocols. The future work includes investigating the relationship between the partitioning method and the data distribution, and improving the efficiency of data transportation.

= +1 | -1 = integer (>=1) = real = : = + = = +

This protocol could be extended as a format of XML to exchange the data easily. The drawback is that XML presentation making the size of data files larger. The protocol of knowledge data is text descriptive file because different learning algorithm has different model representation. For example, the model in SVM is mainly consisted of kernel definition and a list of support vectors while the model in neural network is the topology of network and weight of edges. Rule based knowledge data is commonly used such as rough set, associate rules. Although model based knowledge data is hard for human to realize, it contains full coverage of information from classifying data set. The cooperation of agents uses the strategy that control event transport into centralized registry table. It is a partial software implementation of MOST network[9]. Each function or interface of an agent registers in the registry in the system initialization phase. And system also maintains an event notification matrix for each property of an agent. The behavior is similar to the event listener in the Java which uses a container to register an event entry. Once a property of the agent is changed, it sends notification matrix to a manager, and the manager dispatches the event to all registered agents. The system contains two channels, the data channel and the control channel. The data channel is responsible for classifying data transportation while the control channel dispatches or broadcasts the message and

2005 IEEE ICDM Workshop on MADW & MADM

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]

M. P. Singh and M. N. Huhns, Multiagent Systems: A Theoretical Framework for Intentions, Know-How, and Communications: Springer-Verlag, 1994. M. Klusch, S. Lodi, and M. Gianluca, "The role of agents in distributed data mining: issues and benefits," 2003. C. Ming-Syan, H. Jiawei, and P. S. Yu, "Data mining: an overview from a database perspective," Knowledge and Data Engineering, IEEE Transactions on, vol. 8, pp. 866, 1996. M. Wooldridge, An Introduction to MultiAgent Systems: John Wiley & Sons, 2002. J. S.-T. Nello Cristianini, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods: Cambridge University Press, 2000. V. N. Vapnik, The nature of statistical learning theory, 2 ed: Springer, 2000. R. E. Schapire, "The Boosting Approach to Machine Learning An Overview," presented at MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, 2001. L. Breiman, "Bagging predictors," Machine Learning, vol. 24, pp. 123-140, 1996. M. Cooperation, "MOST Specifications version 2.5," http://www.mostnet.de/downloads/Specifications/MOST%20Specifica tions/MOSTSpecification.pdf, 2005.

52

↑ TOP of text, page 2 and following pages,aligned to line↑

Distributed Multi-Agent Knowledge Space (DMAKS): A Knowledge Framework Based on MADWH Adrian Gardiner Dept. of Information Systems, Georgia Southern University, [email protected] [Leave 2 blank lines before beginning main text — page 1]

databases and have built data warehouses [13]. A data warehouse is a centralized collection of data and metadata from multiple sources, integrated into a common repository and extended by summary information (such as aggregate views) that is used primarily to support organizational decision making [21]. Traditional data warehouses are designed as permanent data repositories, where data is maintained by building stovepipe systems that periodically feed in new data. Data within a data warehouse is normally a duplication of existing data sets that are located either inside or external to the organization, but integrated and transformed with the purpose of supporting decision making [23]. Within the data warehouse paradigm, knowledge is generated primarily two ways: from analyst-directed interactions with the data repository, such as through using visualization and OLAP tools; and identification of interesting facts and patterns through the application of data mining algorithms. Typically, the data cube is used as a data model of the data warehouse data, given the multi-dimensionality of the stored data structures [13]. A data cube consists of several independent attributes, grouped into dimensions, and some dependent attributes, which are called measures. A data cube can be viewed as a d-dimensional array, with each cell containing the measures for the respective sub-cube [13]. A data model suitable for multidimensional analysis should provide means to define: 1. dimension levels, 2. grouping/classification relationships (that link those levels) and 3. analysis paths [11].

Abstract Business data warehousing, which is an information architecture built upon the principle of data centralization and objectivity, is poorly equipped for environments in which knowledge resources are highly distributed, locally managed, and increasingly feature less-structured data types. To address the limitations of traditional data architectures and to broaden the scope of information management, information architectures (called distributed knowledge management (DKM)) have been proposed that focus on managing the creation of local knowledge within autonomous groups and exchanging knowledge across them [5]. Commonly, DKM approaches advocate the use of peer-to-peer agent networks as the deployment methodology. In such networks, peers are either forced to group, or can group spontaneously [3]. In this paper, we propose an alternative approach for self-organizing peer grouping called Distributed Multi-agent Knowledge Space (DMAKS), which is a distributed knowledge repository modeled on the MADWH concept, as devised by [35]. The strengths of this approach are numerous, including: that self-organization is based on functionality, rather than purely semantics; self-organization is addressed at both a macro (society level) and micro-level (local level) through dynamical hierarchies; evolution in functional states and knowledge are possible; different levels of abstraction of both knowledge and functionality are available; and the native support of functional decomposition enables the construction of dynamic aspect systems. Key Terms - Data warehouse and repository, Knowledge management architecture, Multiagent systems, Peer-to-peer networks, Self-organizing systems.

1.

Traditional Data Warehouse Architecture

Many companies have recognized the strategic importance of the knowledge hidden in their large

2005 IEEE ICDM Workshop on MADW & MADM

53

whole to produce, maintain, and enhance a business knowledge base. Firestone [14] also highlights that the trend for developing local data marts (distributed data marts) is a sign that many organizations’ knowledge management architecture is moving towards becoming decentralized and disintegrated. Firestone [14] defines data marts as subject-oriented, integrated, time-variant, non-volatile collection of data in support of management's decision making process focused on a single business process or single department (e.g., marketing). The epistemological assumptions reflected within the centralization paradigm have also been heavily criticized by ([5], [4], [6]) and colleagues, who argue that given the subjective nature of knowledge, centralization of knowledge leads to taking knowledge out of its social context, where stakeholders typically hold locally-held shared interpretation schemas that give knowledge meaning. Such views about the limitations of traditional data architectures have coalesced into an approach to information management called distributed knowledge management (DKM) [5], which focuses upon managing the creation of local knowledge within autonomous groups and exchanging knowledge across them [5]. The two core principles of DKM, as devised by [5], are: the principle of autonomy (communities of knowing should be granted the highest possible degree of semantic autonomy to manage their local knowledge); and the principle of coordination (the collaboration between autonomous communities must be achieved through a process of semantic coordination, rather than through centrally defined semantics (semantic homogenization)). The principle of coordination (semantic interoperability between domains) has been addressed numerous ways - for example, through the use of shared schemas (e.g. ontologies) (see, e.g., [2], [12]), and local context descriptions (see, e.g., [7]). Commonly, the deployment approach for DKM is through the use of agent architectures - and in particular, peer-to-peer computing (P2P computing). A DKM system utilizing P2P computing is commonly referred to as peer-to-peer knowledge management (P2PKM). An example of such a system is [3]’s tool for DKM called KEEx.

Dimensions within data cubes commonly reflect concept hierarchies [13]. A concept hierarchy is an instance of a hierarchy schema. Hierarchy schema can reflect simple hierarchical (each link between a parent and child levels has one-to-many cardinalities) or more complex hierarchical structures, such as non-strict hierarchies, where connections between hierarchical levels can be of a many-to-many cardinality [31]. Analysts, with a knowledge of the embedded concept hierarchies, typically interact with the data cube to perform multidimensional analysis (i.e., choose the most relevant view for induction and hypothesis testing), which mainly consists of the following view operations: drill-down, roll-up, and slice and dice operations. Limitations of traditional data warehouses include that their infrastructure is centralized; inflexible; generates redundant data; is reliant upon predefined dimensions, dimension levels and hierarchies, and aggregation paths; and given the need for renewing data sets, information stored with data warehouses is commonly dated, which undermines the ability of analysts to perform real-time analysis.

2.

Emerging Forms of Knowledge Management Architecture

With the advent of the semantic web, traditional information architectures are expected to evolve into one based more upon collections of autonomous and semiautonomous agents that come together to form flexible, dynamic-learning decision analysis cooperatives. The traditional architecture for business data warehousing, which is built upon the limiting concepts of inflexible system infrastructure, and the centralization and duplication of data, is therefore poorly equipped to fit into this new vision for information management. Firestone [15] predicts that data warehouses will increasingly move towards distributed data management, to address a reality in organization information architecture that can be best described as involving “physically distributed knowledge objects across an increasingly far-flung network architecture” [15]. Formally, [15] defines Distributed Knowledge Management (DKM) as a system that manages the integration of distributed objects into a functioning

2005 IEEE ICDM Workshop on MADW & MADM

54

[28] also suggests a method of letting peers organize themselves into a small world structured around the topics that peers contain knowledge about (organization around semantics). As highlighted through the concept of small world structures, it is important to appreciate that knowledge nodes in many proposed P2PKM architectures can be course grained, in that they may be a composite of a number of different data types (and thus contain varying data and semantic structures) (see, e.g., [4], [3]). In addition, knowledge nodes may contain their own internal network structures (organization).

In a “pure” P2P architecture [1], peers (nodes) have equal responsibility and capability, and may join or leave the network at any time, thus eliminating the need for a centralized server. Each peer member can make information available for distribution and can establish direct connections with any other member node to download information [25]. Moreover, peers dynamically discover other peers on the network and interact with each other by sending and receiving messages [1]. Accordingly, a client seeking information from a P2P network searches across scattered collections stored at numerous member nodes, all of which appear to be a single repository with a single index [25]. The two core operations in most peer-to-peer systems are: (1) finding the right peers for querying (peer discovery and selection), and (2) the efficient routing of messages (network topology) [17]. When considering P2PKM, core operations may also include: (3) semantic interoperability (a.k.a. semantic coordination, mediation, and resolution) (peer interpretation and translation), and (4) the organization of knowledge nodes (peer membership, clustering, structure, and aggregation).

3.

4.

In P2PKM systems, peers are either forced to group, or can group spontaneously [3]. Groups of peers are commonly referred to as communities, and membership of a community may be open [3] and dynamic. The community analogy is commonly used as there is frequently an implied ‘social’ connection between peers (see, e.g., [3]). Sometimes community membership will be determined by designers (i.e., mandatory membership), rather than through peer interactions. The amalgam of all community instances is commonly referred to as a society [35]. A community of peers may not necessarily share physical proximity, but should possess relational proximity (e.g., shared interest, intent, or practice), which will give the community identity. It is therefore likely that peers within a community will have a high degree of shared semantics and functionality. Once communities and relationships between them are established, then this meta-structure becomes the network typology. Schmitz [27] proposes that communities of peers are formed based upon measures of semantic similarity, which can be calculated through inter-peer comparisons of profile information, queries, or knowledge items. A version of this approach is demonstrated in [29], who outlines a semantic clustering approach for the identification of small world structures. It is of interest to note that other possible methods to define communities, such as through shared structural or functional properties, have not been emphasized in the

What is a Peer?

[4] proposes to model an organisation as a “constellation” of knowledge nodes, which are autonomous and locally managed knowledge sources. The concept of knowledge nodes is similar to [24]’s concept of knowledge clusters. Within this framework, a knowledge cluster is an instance of an ontology, and therefore represents some structured knowledge. A knowledge cluster may be related to the overall knowledge of an agent, a specific task, or to a given topic [24]. Basic operators on knowledge clusters include addition, filtering, search, is-subpart-of, and comparison [24]. If knowledge nodes / clusters can be viewed as object instances, they then effectively act as technical gatekeepers to a knowledge source. At a broader level of composition, peers may also contain networks. [28] addresses small world structure networks - which exhibit special properties, such as a small average diameter and a high degree of clustering. [28] claims such a network typology is effective and efficient in terms of spreading and finding information.

2005 IEEE ICDM Workshop on MADW & MADM

Organizing Peers into Groups

55

whole); while function means that this structure fulfils a purpose [20] (i.e., has intent). [18] holds the view that self-organization implies that local behavior will be (at least partially) caused internally, i.e. independently of the environment (endocausality); or, in other words, processes occurring within the boundary are not controlled or ‘organized’ by an external agent, but (at least partially) by the system itself [18]. This view supports the notion that a degree of autonomy is required to enable self-organization. Autonomy is defined by [27], in the context of DKM, as users maintaining control over their knowledge, but with the willingness to share knowledge with other peers; and therefore, peers are not subject to central coordination. [3] refers to this as ‘semantic autonomy’. Semantic autonomy is a cornerstone principle of DKM [5]. This type of autonomy is perhaps best characterized as autonomy of control (vis-à-vis autonomy of function). However, autonomy has a wider meaning than autonomy of control. [8] states that autonomy means self-governing (a.k.a. self-controlling, self-steering). Moreover, [10] includes cohesion (relating to unity) as a central notion in autonomy, and hence, an entity’s identity: interactions within a system that bind its parts are stronger than the interactions with external systems and internal fluctuations. [9] claims cohesion is important as it both unifies a dynamical object, and distinguishes it from other dynamical objects (i.e., provides order). Therefore, full autonomy implies that local decision making and intent, while possibly influenced by receipt of external knowledge and environmental interaction, is independent of consideration of any external agent. At the other extreme, a complete lack of autonomy implies an agent to be completely under the control and influence of an external agent, with no local initiative. The degree of autonomy associated with members within a society space will influence the type of interactions a peer has with its environment and fellow community members. For example, interactions between members within a specific community may reflect a degree of mutual dependence (i.e., lower autonomy), given shared intent. In this regard, peers can be considered to be semi-autonomous. Moreover, assuming

P2PKM literature. This lack of emphasis perhaps reflects the P2P community’s focus on information search and sharing, rather than accommodating functional orchestration (the latter being a more dominant feature of (web) service architectures), or other models of resource sharing. [24] warn that membership of a knowledge community should not replace the intrinsic goal of an agent for which it was introduced into the system. This view implies that community membership should not necessarily lead to homogenization.

5.

What Does it Mean for Peer Groups to be Self-Organizing?

Gershenson and Heylighen [20] maintain the term self-organizing has no universally accepted meaning; however, self-organizing is often defined as global order emerging from local interactions [19] and interactions with the environment [16]. It therefore follows that global order is an emergent property of the community. Indeed, [26], in describing this process, refers to the concept of emergent classification: the process of obtaining novel classifications of an environment by a self-organizing system, which can only be achieved through structural changes. In some proposed P2PKM systems, local interactions involve the exchange of semantic information - for membership testing, and community definition. However, in other proposed systems, communities may largely be pre-defined: for example, through consideration of ontological classification of local knowledge stocks, or when the design objective is to mimic established human-based social networks (see, e.g., [28], [29], [17]). An aim of self-organization that is frequently cited is to increase order (excluding imposed order) [20]. The degree of disorder present within a system, roughly speaking, is proportional to its degree of entropy [32] as a system high in entropy will likely lack structure or differentiation between components [20]. In addition to order, [19] stress that organization requires structure with function. Structure means that components of a system are arranged in a particular order (e.g., connections that integrate the parts into a

2005 IEEE ICDM Workshop on MADW & MADM

56

processes, including collaboration, coordination, prediction, and reasoning, in which knowledge is not only interpreted, but created (e.g., through insights, or identification of patterns). Such a view supports the assertion that knowledge networks should primarily be defined through function, rather than through semantics. Overall, we suggest that it is unlikely that classification (identity) based on functionality and knowledge methods will produce identical community boundaries (network typology). In summary, the current state of methods to enable self-organization within P2PKM is still at an early stage of development. There are three weaknesses in current approaches in P2PKM in this area: - organization methods emphasize semantics, over function; - organization methods have focused upon network (society-level) organization, rather than self-organization at finer levels of granularity; and, - there is an implicit assumption in a number of P2PKM approaches that knowledge nodes are stable, and thus the issue of node evolution has not yet been adequately addressed in the research literature.

that peer membership is dynamic, no community should be able to successfully define itself (and to re-evaluate its identify and intent) without significant interactions with non-member peers, as identity is relational within a dynamic open system. The above discussion underlies four propositions in relation with the self-organization of peers: first, a society of peers cannot be self-organizing when all peer members have full autonomy, as no negotiation between peer members is possible in terms of establishing identity. In other words, peers need to be semiautonomous. Second, the level of autonomy between peer communities will be less than the autonomy between peer members within communities (within community coupling is greater than coupling with the environment); third, where peer communities have different levels of autonomy within the society, imperfections (inefficiencies) in network topologies may eventuate. Fourth, there is a general inverse relationship between peer grouping granularity and the level of peer autonomy; and full autonomy can exist only at the society meta-level. It is interesting to note that research into selforganization theory has emphasized organization through functional classification, rather than through semantics. For example, the concept of autonomy has primarily been defined in terms of functionality, rather than semantics. In contrast, P2PKM researchers have commonly proposed using the latter approach as a method for establishing group boundaries. The emphasis on functionality in self-organization research is not surprising, given that intent and identity are primarily related to what ones does (function), rather than what one knows. It follows that determining boundaries based upon semantics will emphasize semantic-based functions - which in proposed P2PKM systems, is commonly information search and retrieval, context sharing, and semantic resolution. Simply put: semantic-based organization will tend to divide information across semantically-derived boundaries (i.e., identity without intent). A desirable property of P2PKM systems is to emulate real-world social networks ([4], [3]). Real-world social networks involve knowledge-intensive business

2005 IEEE ICDM Workshop on MADW & MADM

6.

DMAKS

Recently, Zhang and Zhang [35] introduced the concept of a multiagent data warehouse (MADWH) structure as a novel approach for brain modeling. A MADWH is a dynamically evolving structure that classifies and organizes semiautonomous agents into communities and societies for coordinated learning and decision analysis [35]. Within this field of research, brain functions are viewed as being analogous to communities of fine-grained interacting multiagent systems (MACs). An important challenge emphasized by this research is to identify algorithms that emulate the identification and evolution of multi-dimensional structures (e.g., cuboid-based star schemas), thus providing possible analogs for brain functionality decomposition and granularity. Success in emulating assembly and management of knowledge and functional structures may provide insights into natural brain organization, and provide the basis of new approaches for developing intelligent machines.

57

functionality are found by existing agents that may redefine existing functional taxonomies. Once corner agents are identified (agents within the threshold distance), they are organized into base cuboids (corner agent sets), thus representing local (micro) functional structure. In this way, the MADWH emphasizes decomposition based upon functionality, rather than, for example, semantics. The implicit assumption in MADWH is that knowledge is tied to (enables) functionality. Within a DMAKS, a knowledge stock will not be considered part of the logical repository unless it is associated with an active function. In this respect, knowledge associated with active functions becomes active knowledge. Agents supporting functions outside the scope of the knowledge repository (inactive functions) will be excluded (i.e., becoming inactive knowledge). However, the functionality support of a DMAKS repository is constantly under revision, and agents will join or leave the DMAKS repository according to functional demand (activation). Therefore, membership of the DMAKS society is also determined primarily on a functional basis, rather than through semantics (cf. P2PKM). Primary generic functions supported within the DMAKS instance include prediction, analytical, and reasoning functions; while secondary generic functions include pattern identification functions, and alert (arousal) systems. In this way, the functionality of DMAKS is clearly aimed at decision support, rather than at direct support of operational processes. Agents within a MADWH (agent community) belong to dynamic agent cuboids, which organize similar (cooperative or competitive) semiautonomous corner agents into hypercube structures. The value of organizing agents in this way is that the cuboid metastructure can easily accommodate natural growth and evolution across a number of functional dimensions [35] (i.e., changes in the definition of functions or what functions are active will be reflected within the relevant agent cuboids). MADWH also provides a meta-structure for cuboid organization at the agent community and society levels. Local agent cuboids within the MADWH agent society are organized and interconnected through a

Our intention in this paper, however, is not to emulate brain functionality, but to discuss the application of MADWH concepts to the field of DKM. In doing so, we introduce the concept of a Distributed Multi-agent Knowledge Space (DMAKS), which is a distributed logical knowledge repository modeled on the MADWH concept. A DMAKS instance, while being distributed, has clear logical boundaries and a similar meta- and micro-structure to that of a MADWH. Within a MADWH, the distinction between autonomous and semiautonomous agents is important, as the functionality of an autonomous agent (e.g., human being) is materialized through the coordination of encapsulated single-function semi-autonomous agents. Within this context, a semi-autonomous agent is one that does not possess full autonomy (independent control) for its behavior (functionality / actions), and therefore does not take actions without coordination with other agents. In this way, semi-autonomous agents differ from more coarse-grained agents typically found in many MACs, whose actions usually exhibit a greater level of autonomy [34]. The proposal for semi-autonomous agents is predicated upon observations that brain functions are not performed in isolation [35]. For example, the left and right ears of a human being cannot be considered to be truly functionally independent. Zhang and Zhang [35] propose that associated semiautonomous agents can be identified (mined) by examining their degree of functional similarity: similar (but not identical) agents will share one or more corner parameters, which are dimensions of agents’ actions that are common across agent instances. The more distinct are these actions, the greater will be the corner distance between the two (corner) agents. Under this approach, agent similarity can be measured in terms of distance across a multi-dimensional function space, with a theoretical threshold distance logically distinguishing distinct agent communities. [35] and [34] discuss several functional distance measures that can be used to identify corner agents. New cuboid corner agents (or new cuboids) are established in a MADWH through extrapolation, and existing cuboids are fine tuned through interpolation [35], as new knowledge and

2005 IEEE ICDM Workshop on MADW & MADM

58

aspects are technical, legal and organizational, social, cultural and economical aspects [33]. Within computer science, there is also gaining interest in aspect computing [22], which is based on the concept of functional cross-cutting. Examples of cross-cuts (aspects) within the context of operating systems are: algorithm, data structure, reference locality [22].

Hasse-type lattice cuboid meta-structure [35]. This structure can also be applied at the level of society abstraction to interlink all functional communities within the society. These organization structures are also included within a DMAKS instance. The lattice cuboid meta-structure provides conditions for dynamic hierarchies, which allow for emergent definitions of active functions. [18] define a dynamical hierarchy as a dynamical system with multiple levels of nested subcomponent structures, in which the structures and their properties at one level emerge from the ongoing interactions between the components of the lower level. Adapting the lattice meta-structure to DMAKS allows different operations commonly associated with reasoning and interrogation to be performed at different levels of abstraction (e.g., level of problem representation). In a drill down operation, levels of functional abstraction will be reduced until a function (and therefore knowledge set) best fitting the active problem is found. An example of roll-up is to summarize functional knowledge and reasoning through the lattice. Roll-up operations can also be performed by either relaxing or tightening the distance threshold that act to define cuboid dimensions. Typical slice/dice operations can be performed by focusing on a sub-part of the lattice structure. Given that decomposition within DMAKS is based on functionality, rather than semantics, this approach allows for access to knowledge across domains or knowledge nodes. (In other words, multiplicity in knowledge application is directly supported through functional decomposition.) Such access to functionality and knowledge is important in the development of dynamic aspect systems, which is difficult to implement in current forms of P2PKM. An aspect system can be defined as a functional part of a system, limited to some of its properties or aspects ([30], cited in [20]). Aspect approaches present an alternative to system analysis based on subsystem decomposition or object classification. In decision support, an aspect approach is quite often required by looking at a problem along different dimensions, and therefore support for this approach to viewing knowledge and functionality is attractive. From a systems perspective, examples of

2005 IEEE ICDM Workshop on MADW & MADM

7.

Conclusion

Thus, by adapting MADWH concepts to the architecture of a knowledge space in the form of a DMAKS, we have outlined a possible high-level theoretical architecture for self-organizing peers within P2PKM. Advantages of this approach include: - self-organization is based on functionality, rather than semantics; - self-organization is addressed at both a macro (society level) and micro-level (local level) through dynamical hierarchies; - evolution in functional states and taxonomies are possible; - different levels of abstraction of both knowledge and functionality are available; - the native support of functional decomposition enables the construction of dynamic aspect systems; - emergent classification and recognition of functionality is supported through the processes of interpolation and extrapolation; and, - the evolution of active functions and knowledge is managed effectively through a coordinated multidimensional structure.

8.

References

[1] S. S. R. Abidi, and X. Pang, “Knowledge Sharing Over P2P Knowledge Networks: A Peer Ontology And Semantic Overlay Driven Approach” International Conference on Knowledge Management, Singapore, (13-15 December 2004). [2] M. Arumugam, A. Sheth, and I. B. Arpinar, “Towards Peer-to-Peer Semantic Web: A Distributed Environment for Sharing Semantic Knowledge on the Web” Technical report, Large Scale Distributed Information Systems Lab, University of Georgia, (2001). [3] M. Bonifacio, P. Bouquet, P. Busetta, A. Danieli, A. Do-nà, G. Mameli, and M. Nori KEEx: A Peer-to-

59

Peer Tool for Distributed Knowledge Management, Working Paper, (2005). [4] M. Bonifacio, P. Bouquet, and R. Cuel, "Knowledge Nodes: the Building Blocks of a Distributed Approach to Knowledge Management" Journal of Universal Computer Science, vol. 8, no.6, (2002), pp.652-661. [5] M. Bonifacio, P. Bouquet, and P. Traverso. “Enabling Distributed Knowledge Management. Managerial and Technological Implications.“ Informatik: Zeitschrift der schweizerischen Informatikorganisationen, vol. III, (2002), pp.1-7. [6] M. Bonifacio, P. Bouquet, and G. Mameli and M. Nori. “Peer-mediated Distributed Knowledge Management” Technical Report # DIT-03-032, (2003). [7] P. Bouquet, A. Dona, L. Serafini, and S. Zanobini, “ConTeXtualized local ontology specification via CTXML” AAAI 2002 Workshop on Meaning Negotiation, 28th July (2002). [8] J. Collier. “What is Autonomy?”, International Journal of Computing Anticipatory Systems: CASY 2001 - Fifth International Conference, (2002). [9] J. Collier. “Interactively Open Autonomy Unifies Two Approaches to Function”, In: Computing Anticipatory Systems: CASY’03 Sixth International Conference, edited by D. M. Dubois, American Institute of Physics, Melville, New York, AIP Conference Proceedings 718 (2004), pp. 228-235. [10] J. Collier, “Information Theory as a General Language for Functional Systems”, Anticipatory Systems: CASY’99 - Second International Conference, edited by D. M. Dubois, American Institute of Physics, Woodbury, New York, AIP Conference Proceedings, (2000). [11] P. Diderichsen, Selective attention in the development of the passive construction: a study of language acquisition in Danish children. In Engberg-Pedersen, E. and P. Harder Ikonicitet og Struktur, Netværk for Funktionel Lingvistik, Department of English, University of Copenhagen, (2001). [12] M. Ehrig, P. Haase, N. Stojanovic, and M. Hefke, Similarity for ontologies - a comprehensive framework. In Workshop Enterprise Modelling and Ontology: Ingredients for Interoperability, PAKM 2004, (Dec. 2004). [13] M. Ester, J. Kohlhammer, and H. Kriegel, "The DC-tree: A Fully Dynamic Index Structure for Data Warehouses." Proc. 16th Int. Conf. on Data Engineering (ICDE 2000), San Diego, CA, pp. 379-388, (2000). [14] J. M. Firestone, “DKMS Brief No. Six: Data Warehouses, Data Marts, and Data Warehousing:

2005 IEEE ICDM Workshop on MADW & MADM

[15]

[16]

[17]

[18]

[19]

[20] [21] [22]

[23] [24]

[25]

[26]

60

New Definitions and New Conceptions”, White Paper http://www.dkms.com/White_Papers.htm, (1997). J. M. Firestone, “Distributed Knowledge Management Systems: The Next Wave in DSS.” White Paper, http://www.dkms.com/White_Papers.htm, (1997). E. Gonzalez, M. Broens and P. Haselager, “Consciousness and Agency: The Importance of Self-Organized Action”, Networks, vol. 3-4, (2004). P. Haase and R. Siebes, “Peer Selection in Peer-toPeer Networks with Semantic Topologies” Proceedings of the International Conference on Semantics in a Networked World, (ICNSW'04), (2004). F. Heylighen, Mediator Evolution: A General Scenario for the Origin of Dynamical Hierarchies, Presented at: “Interactively Open Autonomy Unifies Two Approaches to Function”, Evolvability & Interaction: Evolutionary Substrates of Communication, Signaling, and Perception in the Dynamics of Social Complexity, (2003). F. Heylighen and C. Gershenson, “The Meaning of Self-organization in Computing”, IEEE Intelligent Systems, section Trends & Controversies - Selforganization and Information Systems, (May/June 2003). C. Gershenson and F. Heylighen, “When Can we Call a System Self-organizing?” Complexity Digest. http://www.comdig.org/ (2003). J. Hoffer, M. Prescott, and F. McFadden, Modern Database Management, 7th ed. Upper Saddle River, NJ: Prentice Hall, (2005). G. Kiczales, J. Irwin, J. Lamping, J. Loingtier, C. Lopes, C. Maeda and A. Mendhekar, “Aspectoriented Programming”, In proceedings of the European Conference on Object-Oriented Programming (ECOOP), (1997). E. Malinowski, and E. Zimányi, “OLAP Hierarchies: A Conceptual Perspective”, CAiSE 2004, (2004), pp. 477-491. P. Maret, M. Hammond and J. Calmet, “Virtual Knowledge Communities for Corporate Knowledge Issues”, Proceedings of ESAW 04, (2004), pp. 33-44. M. Parameswaran, A. Susarla and A. B.Whinston, “P2P Networking: An Information-Sharing Alternative”, IEEE Computer, vol. 34(7). (2001), pp. 31-38. L. M. Rocha, “Syntactic Autonomy” Rocha, Luis M. In: Proceedings of the Joint Conference on the Science and Technology of Intelligent Systems (ISIC/CIRA/ISAS 98). National Institute of

Standards and Technology, Gaithersburg, MD, September 1998. IEEE Press, (1998), pp. 706-711. [27] C. Schmitz, “Towards Self-Organizing Communities in Peer-to-Peer Knowledge Management”, Workshop on Ontologies in Peerto-Peer Communities, ESWC 2005, (2005). [28] C. Schmitz, “Self-Organization of a Small World by Topic” P2PKM 2004, (2004). [29] C. Schmitz, “Towards Content Aggregation on Knowledge Bases through Graph Clustering”, Grundlagen von Datenbanken (2005), pp. 112116. [30] W. ten Haaf, H. Bikker, and D.J. Adriaanse, “Fundamentals of Business Engineering and Management, A Systems Approach to People and Organisations, Delft University Press (2002). [31] A. Tsois, N. Karayannidis, and T. K. Sellis, “MAC: Conceptual Data Modeling for OLAP”, Design and Management of Data Warehouses, (2001). [32] http://en.wikipedia.org [33] J. Zevenbergen, “A Systems Approach to Land Registration and Cadastre”, Nordic Journal of Surveying and Real Estate Research, Vol. 1, (2004). [34] W. Zhang, Nesting, safety, layering, and autonomy: a reorganizable multiagent cerebellar architecture for intelligent control––with application in legged locomotion and gymnastics, IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 28 (3) (1998), pp. 357–375. [35] W. Zhang and L. Zhang, “A Multiagent Data Warehousing (MADWH) and Multiagent Data Mining (MADM) Approach to Brain Modeling and Neurofuzzy Control”, Information Sciences, 167, (2003), pp. 109-127.

2005 IEEE ICDM Workshop on MADW & MADM

61

Applying MultiAgent Technology for Distributed Geospatial Information Services Naijun Zhou 1 and Lixin Li 2 1 Department of Geography, University of Maryland - College Park 2 Department of Computer Sciences, Georgia Southern University [email protected], [email protected]

Abstract



This paper describes a framework of using multiagent technology to facilitate the sharing and query of distributed geospatial data. In particular, under the architecture of distributed Geospatial Information Services (GIServices), this paper proposes to use query agent to process user query, metadata agent to represent the metadata of geospatial data sets, discovery agent to locate candidate data sets related to the user query, schema agent and semantics agent to identify the same or similar schema attributes and domain values in candidate data sets.

• • •



Following the typology of agent by Nwana [4], the agents of metadata, schema and semantics are static agents residing with each data set; and the query and discovery agents are mobile agents that can migrate among different data sources. All agents are deliberative agents, i.e., these agents have internal information and can negotiate with other agents.

1. Introduction The World Wide Web provides a platform for sharing and serving distributed geospatial data. Indeed, a new architecture, Geospatial Information Services (GIServices), has been proposed to support distributed geospatial data storage, query and delivery over the Web [1]. In parallel, the technologies of multiagent data warehousing and multiagent data mining enable the agent-based discovery and integration of data from distributed sources, which can be applied into the design and implementation of GIServices. Using multiagents to query and retrieve geospatial data has been proposed by, e.g., [2] and [3]. However, multiagent technology has not be examined specifically for Web-based distributed GIServices. This paper introduces a framework of using multiagents to facilitate distributed GIServices. The proposed multiagents allow communications among user query, metadata and geospatial databases; and provide a framework to resolve some critical issues of GIServices, including data discovery, schema matching and semantic integration. Specifically, this paper discusses the following agents:

2005 IEEE ICDM Workshop on MADW & MADM

query agent: to query data and return query results to the user; discovery agent: to identify candidate data sets that are suitable for conducting a full query; metadata agent: to present metadata; schema agent: to present schema definitions, and to identify the same/similar attributes between a query agent and each candidate data set; semantics agent: to present the semantic definitions of database domain values, and to identify the same/similar values between a query agent and each candidate data set.

2. A Framework of MultiAgent-based GIServices 2.1 System Architecture The framework of using multiagent technology in GIServices is depicted in Figure 1. Distributed geospatial data sets are maintained locally by their providers, making data update and maintenance more efficient than storing data on a centralized site. Together with every geospatial data set, information of metadata (e.g., date of data production, spatial extent and theme keywords), schema definitions (i.e., the meanings of the database attributes), and semantics definitions (i.e., the meanings of domain values) are also made available through the agents. A query agent accepts and processes user queries, migrates to each

62

local data set, and communicates and negotiates with the metadata, schema and semantics agents. A discovery agent collaborates with the query agent to locate candidate data sets that satisfy the criteria of a certain user query.

Data Set 1

Data Set 2

Metadata Agent Schema Agent Semantic Agent

Metadata Agent Schema Agent Semantic Agent

being able to communicate with metadata agents and to identify candidate data sets. 2.2.3. Metadata Agent. Every data set has a metadata agent. A metadata agent contains not only the metadata of a geospatial data set, but also the algorithm to communicate with the discovery agent. The communication is a negotiation process between the discovery agent and the metadata agent: exchanging and comparing the agents’ information such as the spatial extent, the theme, and the date of data production. The result of the negotiation process is returning the query agent a list of candidate data sets for full query.

Discovery Agent

Discovery Agent

2.2.4. Schema Agent. A schema agent, residing with each data set, contains the definitions of geospatial database attributes. After the discovery agent identifies the candidate data sets, the query agent compares each of its attribute names with the attribute names defined in a schema agent, and finds the same or similar attribute name in the schema agent. Most likely, however, the attribute names by the query agent and by the schema agent are different. The possible solutions to this difference can be the use of a pre-defined lookup table, deploying a single ontology, or relying on a thesaurus.

Query Agent Discovery Agent Data Set n-1 Metadata Agent Schema Agent Semantic Agent

Discovery Agent Data Set n Metadata Agent Schema Agent Semantic Agent

2.2.5. Semantics Agent. A semantics agent aims to find the same or similar domain values (semantics) between the query agent and a data set. A semantic agent sits with a data set, maintains semantic information including the ontology, definitions, and thesaurus of local semantics. Computational algorithms are also provided with each semantic agent, allowing a semantic agent to compare local semantics with the semantics in a query agent.

Figure 1. Applying multiagent technology to facilitate GIServices.

2.2 MultiAgents for GIServices This section examines the agents proposed to support GIServices. These agents work together to accomplish the tasks of finding, querying and delivering distributed geospatial data.

3 An Example GIServices

MultiAgent-based

This section explains with the aid of an example how the agents are represented and how they communicate with each other. This section also briefly discusses the fundamental reasoning methods while leaving more advanced algorithms for future work. Figure 2 illustrates the flow of multiagent communications in order to answer a user query: “find the land use of cropland in Prince George’s County”.

2.2.1. Query agent. A query agent allows users to pose queries by providing a set of terminology (ontology) of the query predicate. The user query is parsed and interpreted by the query agent and is represented in a format that can be understood by other agents. As a mobile agent, the query agent is the major agent to communicate with other agents in order to return query results to the user. 2.2.2. Discovery Agent. Once accepting a user query, the query agent gives part of its information to a discovery agent. The discovery agent is a mobile agent

2005 IEEE ICDM Workshop on MADW & MADM

of

63

User

Message: find land use cropland in PG County

Query Agent

Message: find data of spatial-extent=“PG County” Discovery theme=“Land Use” Metadata

Agent

Agent

Action: compare local metadata to the message, find PG County data set

Figure 4. A message sent by query agent to discovery agent.

Message: find attributeQuery name = “Land Use Type” Schema Agent Agent

Action: compare local schema to the message, find attribute lu

Message: find domainQuery value = “cropland” Semantic Agent Agent

Action: compare its domain values to the message, find cropland/pasture

Message: query Query lu=“cropland/pasture” Agent

Land Use Prince George’s County my-own-ontology discovery agent spatial-extent theme

Action: XML representation of the request

Once the discovery agent receives the spatial-extent task, it will perform the following actions: • •

Geospatial Query result in XML User Database (PG County)

Figure 2. A work-flow communications for GIServices.

of



agent

obtaining the spatial extent in coordinates (X,Y) of the location Prince George’s County using a geospatial gazetteer; communicating with all local metadata agents (a metadata agent is shown in Figure 5) distributed in the network, which has the information of the spatial extent and theme of each local data set; comparing (i.e., spatial query) with every metadata agent’s bounding_cooridnates to find candidate data sets.

Similarly, the discovery agent executes the task of searching data with a theme of land use by comparing the term “land use” to local metadata agent’s theme_keywords; for example, the land use in a metadata agent shown in Figure 5. The metadata agent via the discovery agent returns to the query agent a list of candidate data sets satisfying both the spatial extent and theme criteria.

The user query is represented in XML in a query agent, including the information of the theme, spatial extent, the attribute names, domain values, and the ontology the user applied to pose the query, etc. Figure 3 shows a query agent for the land use example. The query agent sends two tasks to the discovery agent: searching data sets within the spatial extent of Prince George’s County and searching data sets with a theme of land use (Figure 4).

-77.1 -76.7 39.1 38.5 land use

Land Use Land Use Type cropland Prince George’s County my-own-ontology

Figure 3. The XML representation of a query agent.

Figure 5. A metadata agent. Once one or more candidate data sets are identified by the discovery agent, the query agent sends another task of attribute-name to each schema agent of the

2005 IEEE ICDM Workshop on MADW & MADM

64



Figure 6. A message of finding a local attribute name that is the same as or similar to land use. To find the local domain value(s) that is same as or similar to cropland, the query agent communicates with the semantics agent and compares their domain values (semantics). A request sent by the query agent to the semantics agent is shown in Figure 7. The algorithm of comparing and finding the same or similar local domain values (land use codes) for cropland can be based on a thesaurus such as WordNet, or with algorithms introduced by Wiegand and Zhou [5] and Zhou [6]. The semantic relations between the domain values can be sameAs, differentFrom, include, etc., which are expressed in a machine-readable format such as Web Ontology Language (OWL) [7]. Figure 8 is an ontology of the domain values of the attribute lu (i.e., land use codes) in the query agent, and Figure 9 shows local land use codes in a semantic agent having a sameAs semantic relation with the query agent’s cropland.

Figure 9. An OWL representation of the land use codes of the lu attribute name in the Prince George’s County database having a sameAs semantic relation with cropland. After the communications among the agents, the query agent can use query re-write technology to convert a user query to local queries for all candidate data sets, and return query results in XML to the user. A sample local query is shown in Figure 10. Land Use lu crop/pasture x1,y1,x2,y2,…. local-ontology

cropland Prince George’s County my-own-ontology discovery agent domain-value

Figure 10. A local query.

5. Conclusions and Discussions This paper proposes a framework of the use of multiagents to represent, discover and query geospatial data. In particular, this paper proposes to use query agent to process user queries, metadata agent to represent the metadata of geospatial data sets, discovery agent to identify candidate data sets for a full query, schema agent and semantic agent to find the

Figure 7. A request of finding the same or a similar domain value(s) cropland.

2005 IEEE ICDM Workshop on MADW & MADM

65

same or similar attribute names and domain values in candidate data sets. Both multiagent technology and GIServices are emerging research issues that require further investigation. Future work includes the design and representation of the agents’ capability and communication, the methodological research on identifying the same/similar metadata, schema attributes and semantics (i.e., GIS interoperability), and an improved GIServices architecture to support the multiagent technology.

References [1] M. Tsou, and B.P. Buttenfield, “A Dynamic Architecture for Distributing Geographic Information Services”. Transactions in GIS 6(4):355-381, 2002. [2] Y. Luo, X. Wang, and Z. Xu, “Agent-based Collaborative and Paralleled Distributed GIS”, XXth ISPRS Congress, July 12-23, 2004, Istanbul, Turkey. [3] J.J. Nolan, R. Simon, and A.K. Sood, “An Agent-based Approach to Imagery and Geospatial Computing”, AGENT’01, May 28-June 1, 2001, Montreal, Quebec, Canada. [4] H.S. Nwana, “Software Agents: An Overview”. Knowledge Engineering Review 11(3): 205-244, 1996. [5] N. Wiegand, and N. Zhou, “An Ontology-based Geospatial Web Query System”, In Peggy Agouris et al. (eds.), Next Generation Geospatial Information, Taylor and Francis, 2005, pp. 157-167. [6] N. Zhou, “A Study on Automatic Ontology Mapping of Categorical Information”. Proceedings of National Conference on Digital Government Research. Boston, May 18-21, 2003, pp. 401-404. [7] World Wide Web Consortium, “OWL Web Ontology Language”, http://www.w3.org/TR/owl-features, 2004.

2005 IEEE ICDM Workshop on MADW & MADM

66

Published by Department of Mathematics and Computing Science

Technical Report Number: 2005-04 November, 2005 ISBN 0-9738918-0-7