Distributed Data Mining and Multi-Agent Technology: An ... - ijcst

0 downloads 0 Views 322KB Size Report
server. Knowledge Base is used to save knowledge needed in data mining. ... simulator operation, the trained stuff can not only experience all kinds of skills of ...
IJCST Vol. 3, Issue 2, April - June 2012

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

Distributed Data Mining and Multi-Agent Technology: An Integrated Approach 1

Meha Khera, 2Dr. Mukesh Sharma

Dept. of CE, The Technological Institute of Textile & Sciences, Bhiwani, Haryana, India Dept. of CE/IT, The Technological Institute of Textile & Sciences, Bhiwani, Haryana, India 1

2

Abstract Now-a-days many researchers are involved in the research of distributed data mining with multi-agent technology. With the drastic development of network technology and the improvement of level of IT application, distributed database is commonly used. This paper discusses the existing distributed data mining technology and the architecture of distributed data mining system based on multi-agent. Keywords ARTA, Association Rule, Centralized, Data Mining, Distributed, DDM, MADARMA, MAS, Multi-Agent I. Introduction DATA mining is the process of extracting hidden, previous unknown knowledge and rules with potential value to decision from mass data in database. With the rapid development of the distributed database, the centralized data mining cannot meet the demand of the distributed data mining. Therefore, the major research on the distributed data mining system has become the rapidly growing research topic in the field of data mining in the today’s era. Centralized data mining, and distributed data mining is required by circumstance, the distribution of data and isomerism is the difficult part in distributed data mining. Centralized data mining algorithm cannot suit the requirement of distributed data mining. Before discussing the distributed data mining and multi-agent technology, we will discuss about the structure of data mining system. Data stored at databases or data warehouse is first cleansed, integrated and filtered and then stored at database or data warehouse server. Knowledge Base is used to save knowledge needed in data mining. Data mining is the crucial step in which clever techniques are applied to extract patterns potentially useful. This process of data mining is described with the help of the structure as shown in fig. 1.

Fig. 1.The Structure of Data Mining System w w w. i j c s t. c o m

II. Distributed Data Mining Distributed Data Mining (DDM) aims at extraction useful pattern from distributed heterogeneous data bases in order, for example, to compose them within a distributed knowledge base and use for the purposes of decision making. DDM may also be useful in environments with multiple compute nodes connected over high speed networks. Even if the data can be quickly centralized using the relatively fast network, proper balancing of computational load among a cluster of nodes may require a distributed approach. The privacy issue is playing an increasingly important role in the emerging data mining applications. For example, let us suppose a consortium of different banks collaborating for detecting frauds. If a centralized solution was adopted, all the data from every bank should be collected in a single location, to be processed by a data mining system. In fig. 2, a general Distributed Data Mining framework is presented. In essence, the success of various DDM algorithms lies in the aggregation. Each local model represents locally coherent patterns, but lacks details that may be required to induce globally meaningful knowledge. For this reason, various DDM algorithms require a centralization of a subset of local data to compensate it. The ensemble approach has been applied in various domains to increase the accuracy of the predictive model to be learnt. It produces multiple models and combines them to enhance accuracy. Typically, voting (weighted or un-weighted) schema is employed to aggregate base model for obtaining a global model. DDM algorithms have minimum data transfer which is another key attribute of the success of DDM. Distributed data mining is a new research field put forward recent years. Because it has tempting foreground, at present, there are considerable research personal devoting to the research on this field and having made some results. The two basic steps of the typical distributed data mining algorithm are: • partial data analyzing, and producing partial data model(partial knowledge). • combining partial data model in different data points and then getting the overall data model(overall knowledge)

Fig. 2: General Distributed Data Mining Frame Work International Journal of Computer Science And Technology 

447

IJCST Vol. 3, Issue 2, April - June 2012

III. Multi-Agent Systems (MAS) The concept of agent technology has appeared in the development of distributed applied system and shown its remarkable effectiveness. The latest research about agent and developing work in the aspect of distributed application is as discussed below: • Agent technology can improve the application of internet such as the agent which develops “finding person with information”. The agent, according to the information, can initiatively notice information provider that who needs the provided information at present; • Agent technology can improve the application of parallel projects, such as the manager of agent technology developing work. It can make the workflow and programming known to each workstation, and initiatively guide each workstation to promote the work according to the workflow and programming, handle and estimate the reports of work condition of each workstation, and manage centrally all kinds of data, and so on. Agent technology can be used to develop the distributed interactive simulation system. For example, it can connect the simulator of flight training and several workstations in the computer network, and realize many agents imitating airplanes in workstations to form interactive aviation simulation system together with simulator. For this kind of simulator operation, the trained stuff can not only experience all kinds of skills of operating planes, but also realize various kinds of air actions through the interaction with the intelligent autonomy imitating airplane. A Multi-Agent System (MAS) is a system composed of multiple interacting intelligent agents. Multi-agent systems can be used to solve problems which are difficult or impossible for an individual agent or monolithic system to solve. Examples of problems which are appropriate to multi-agent systems research include online trading, disaster response, and modeling social structures. At present, people have begun to apply multi-agent system into the research of distributed data mining system. Multi-agent technique’s applying to distributed data mining system has the following advantages: A. Agent’s Autonomy Agent’s autonomy is correspondent to the autonomy of data source. Agent can visit local data according to local visit limitation and safety strategy, cooperate with information on different data sources. In this way, the protection of private information can be strengthened B. Agent’s Go-ahead It can limit customer’s supervision and the intervening to data mining process. Customers can set objective and method for agents at the early stage. And during the operating process, agents can adjust the task exercising process C. Agent’s Self-adaptation This means agents can choose data source and collect data independently. One important problem of distributed or real environment is the changing of environment, which will result in the changing of data source. In this circumstance, agents can search and select data according to standard set beforehand, such as expected amount, type and quality of data, etc in this way, Agents can be basic tools to search for data source for static data mining method in dynamic environment, thus static method can be used to analyze dynamic data.

448

International Journal of Computer Science And Technology 

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

D. Agent’s Coordination This means it can expand traditional data mining method, which will enable them to suit mass data in distributed environment. The work in [5] proposes the architecture of a customizable MAS for general-purpose distributed data mining, which exploits P2P concepts to improve performance and scalability. In particular, the MAS architecture is organized as a flat P2P network of nodes, each of which is a MAS. Such an organization supports an efficient dynamic load balancing particularly suitable for irregular search algorithms. The MAS has been customized for the frequent subgraph mining problem for the discovery of discriminative molecular fragments. Experimental tests in [5] on real molecular compounds confirmed its effectiveness. A multi-agent technology for Distributed Data Mining and Distributed Classification is developed [11]. Design of both DDM and DC systems puts several new non-specific tasks and challenges. Some key problems are in the paper focus. IV. Distributed Data Mining System based on MultiAgent The different strategies can be used mainly in distributed data mining according to the data themselves, the distribution of the data, the software and hardware resources that can be used, and the required precision. Accordingly, the centralized distributed data mining systems have some differences in the following strategies: A. Data Strategy The distributed data mining can choose the final result of moving data, or moving middle result, or providing forecasting model, or moving data mining algorithm.The distributed data mining system of Local Learning to establish models in each distributed places, and then carry these models to a centre region can be used. Also, the data mining system of Centralized Learning to carry the data to the centre region and then establish models can be used. Besides, some data mining systems use Hybrid Learning, i.e. the strategy combining partial leaning and the centralized leaning. For example, different records are placed in different sites, different attributes of the same records are distributed across different sites, or different tables can be placed at different sites, therefore when gathering data it is necessary to adopt the proper merging strategy. B. Task Strategy The distributed data mining system can choose to co-ordinately use one kind of data mining algorithm in several data stations, and can also choose to use different data mining algorithms independently in each data station. In the mode of Independent Learning, each kind of data mining algorithm is respectively applied in each distributed data station, in the mode of Coordinated Learning, one (or more) data station use one kind of data mining algorithm to coordinate mining task in several data stations . C. Model Strategy There are many methods of combining the forecasting models established in different places. Among these methods, the simple and the most often used one is making use of voting, which is to combine the output of the models of each type according to the majority voting. But the method of Knowledge Probing is to establish a comprehensive model according to the input and output of all kinds of models and the expected output. In this section, the paper also brings about a distributed data w w w. i j c s t. c o m

IJCST Vol. 3, Issue 2, April - June 2012

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

mining system which is based on multi-agent as shown in fig. 3. This system can not only mine local data information, but can also do distributed data mining in different data site point. It is composed of users’ interface agent, users’ information base, knowledge management agent, task management agent, the overall knowledge base, coordinating machine agent, and data mining agent.

1. Else create new branch and record its appearance factor as 1. Based upon the above construction method, reflect all rules mined from knowledge base DB from all site onto association tree. (iv). Let the least appearance factor of rule is given as N, scan through the association tree formed by overall rules base. Now compare appearance factor of each rule of association tree and delete the branch whose appearance factor is less than N. Delete the corresponding rule from rule knowledge base. (v). After the deletion of branches smaller than N, scan all subsites’ database again, and obtain information like support rate, confidence of left branch rules. (vi). Support rate and confidence of each left branch rule in association tree is obtained in this step. (vii). Then determine the overall rule according to given smallest support rate and confidence. The main focus in this paper is to put forward a novel distributed association rule mining algorithm based on multi-agent. The flow chart of proposed algorithm MADARMA is as shown below in fig. 4. MADARMA uses a sub algorithm ARTA(Association Rule Tree Algorithm) and the flowchart for ARTA is as shown in fig. 5.

Fig. 3: Distributed Data Mining based in Multi-Agent V. Distributed Association Rule Mining Algorithm based on Multi-Agent A distributed data mining system based on multi-agent is presented in this section. The proposed algorithm is MADARMA (Multi -Agent based Distributed Association Rule Mining Algorithm). This algorithm works as follows: Suppose a distributed database system S is comprised of n sites S1, S2, S3,……., Sn. Let DB be the distributed database of site S where DB = DB1U DB2U ……….. U DBn. Let size and sizei be the size of database in DB and DBi respectively where size = size1 + size2 +…….. + sizen. Also note that DB is the overall global database and DBi is the local database. Steps of MADARMA Algorithm: (i). First of all, local mining will be performed at each site and local rule set R(i) where i=1,2,……n is obtained. Each local mining use Apriori Algorithm [1] to perform association rule mining. (ii). After obtaining local rule set R(i) from each site point, the result is sent to the main controlling site which builds an overall rule knowledge base. (iii). Now, after the construction of rule knowledge base, an association rule tree is generated for the association rules. This sub algorithm will be called as ARTA( Association Rule Tree Algorithm). It is described as follows: (a). Firstly, create root node of tree and mark it as “null’; (b). Scan the rule knowledge base DB (overall global database) formed by collection from each sub-site. It creates branch in tree with each rule of the form P-> Q mined from site 1where P is the antecedent and Q is the consequent. The leaf node of branch is the consequent of the rule and the appearance factor is recorded as 1 at leaf node. (c). Then compare mining of rules from site 2 with rules already present in the tree. If any of the rules match with some rule in the tree, then branch cannot be created and record appearance factor as w w w. i j c s t. c o m

Fig. 4: MADARMA Algorithm International Journal of Computer Science And Technology 

449

IJCST Vol. 3, Issue 2, April - June 2012

Fig. 5: ARTA Algorithm VI. Conclusion and Prospect Work Due to the rapid advances in information technology and network technology field, there is a huge need to introduce the concept of Distributed Data Mining. There are various problems that are encountered in DDM like heterogeneity and diversity of data. Also, the designing and reconstruction of centralized mining algorithm to adapt the distributed data mining poses a new problem. This paper focuses on some novel approaches to overcome the above said problems: • A multi-agent technology to distributed data mining system is applied. The self adaptability and intelligence of multi-agent provides a novel approach. • A distributed Association rule mining algorithm based on multi-agent (MADARMA) is proposed. We discussed mainly about the association rules and the future work includes classification for DDM systems. References [l] Agrawal R, Shafer J C,“Parallel mining of association rules”, IEEE Transactions on Knowledge and Data Engineering, Vol. 8, Issue 6, Dec. 1996, pp. 962-969 [2] Cheng-Fa Tsai, Yi-Chau, Chi-Pin Chen,“A New Fast Algorithms for Mining Association Rules in Large Databases”, IEEE International Conference, Vol. 7, 6-9 Oct. 2002, pp.6 [3] Cheung D W, Han J W, Ng V T et al.,“A fast distributed algorithm for mining association rules”, in Proc. of the IEEE 4th International Conference Parallel and Distributed Information Systems, Miami Beach, Florida, 18-20, pp. 3142, Dec. 1996. [4] David W. Cheung, Vincent T. Ng, Ada W. Fu, Yongjian Fu,“Efficient Mining of Association Rules in Distributed Databases(DMA)”, IEEE transactions on knowledge and data engineering, Vol. 8, No. 6, December 1996.

450

International Journal of Computer Science And Technology 

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

[5] G. Di Fatta, G. Fortino.,"A Customizable Multi-Agent System for Distributed Data Mining", Proceedings of the 22nd ACM Symposium on Applied Computing (SAC 2007), Special Track on Agents, Interactions, Mobility, and Systems (AIMS), March, 2007, Seoul, Korea. (in press) [6] G. Piatetsky-Shapiro, W. J. Frawley,"Knowledge Discovery in Databases", MIT Press, 1991 [7] H. Kargupta, I. Hamzaoglu, B. Stafford,"Scalable, distributed data mining using an agent based architecture", In Proceedings the Third International Conference on the Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, California, pp. 211– 214, 1997 [8] J. Dasilva, C. Giannella, R. Bhargava, H.Kargupta, M. Klusch,"Distributed data mining and agents", Engineering Applications of Artificial Intelligence, 18(7), pp. 791–807, October 2005. [9] R. Agrawal, R. Srikant,"Fast algorithms for mining association rules in large databases", In Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, August 29-September 1, 1994. [10] R.Agrawal, T. Imielinski, A. Swami,"Mining association rules between sets of items in large databases", In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 207-216, Washington, DC, May 26-28, 1993. [11] Vladimir Gorodetsky, Oleg Karsaeyv, Vladimir Samoilov,"Multi-agent Technology for Distributed Data Mining and Classification", In Proceedings of the IEEE/WIC International Conference on Intelligent Agent Technology (IAT’03) [12] Vuda Sreenivasa Rao,“Multi agent-based distributed data mining: An overview”, International Journal of Reviews in Computing (IJRIC), 2009-2010, pp. 83-92. [13] Vuda Sreenivasa Rao, Dr. S Vidyavathi,"Distributed Data Mining and Mining Multi-Agent Data", (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 04, 010, pp. 1237-1244. [14] Vuda Sreenivasa Rao, S Vidyavathi, G.Ramaswamy, “Distributed Data Mining And Agent Mining Interaction And Integration: A Novel Approach” , International Journal of Research and Reviews in Applied Sciences(IJRRAS), September 2010, pp. 388-398 [15] Yan Zhao, Yong Yao, Zhijng Liu,“An Efficient Distributed Algorithm for Mining Association Rules”, Fourth IEEE International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

proceedings.

Meha Khera is currently, final year M.Tech. (CE) scholar in The Technological Institute of Textile & Sciences, Bhiwani (HR)-India. She is also working as teaching assistant in CE Dept. in The Technological Institute of Textile & Sciences, Bhiwani, HR, India. Her research areas include Data mining, Image Processing, Soft Computing etc. Many publications are there to her credit in National and International level

w w w. i j c s t. c o m

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

IJCST Vol. 3, Issue 2, April - June 2012

Dr. Mukesh Kumar is currently working as Associate Professor & HOD in CE& IT department, The Technological Institute of Textile & Sciences, Bhiwani (HR) - India. He received his Ph.D from M.D University, Rohtak (HR)- India and M.Tech from GJUS&T, Hisar (HR) -India. He has Nine years of teaching experience. His research area includes Mobile Ad hoc Networks, wireless communication, coding theory, Image Processing, Soft Computing, and Data Mining etc. Many publications are there to his credit in many International and National level journal and proceedings. He is Life Member of CSI and other Societies.

w w w. i j c s t. c o m

International Journal of Computer Science And Technology 

451

Suggest Documents