Data mining, software agents, cooperative hybrid systems .... Figure 1: A comparison between conventional data mining algorithms and the new stream-based ...
Agents and Stream Data Mining: A New Perspective Kok-Leong Ong and Zili Zhang
Wee-Keong Ng and Ee-Peng Lim
Deakin University Pigdons Road, Waurn Ponds Victoria 3217, Australia
Nanyang Technological University Nanyang Avenue, Block N4 Singapore 639798
{leong, zili}@deakin.edu.au
{awkng, aseplim}@ntu.edu.sg
Abstract Knowing what to do with the massive amount of data collected has always been an ongoing issue for many organizations. While data mining has been touted to be the solution, it has failed to deliver the impact despite its successes in many areas. One reason is that data mining algorithms were not designed for the real world, i.e., they usually assume a static view of the data and a stable execution environment where resources are abundant. The reality however is that data are constantly changing and the execution environment is dynamic. Hence, it becomes difficult for data mining to truly deliver timely and relevant results. Recently, the processing of stream data has received many attention. What is interesting is that the methodology to design stream-based algorithms, which when combined with agents, may well be the solution to the above problem. In this article, we discuss this issue and present some preliminary results of our work in the Matrix project.
General Terms Data mining, software agents, cooperative hybrid systems
Introduction Knowing what to do with the massive amount of data collected has always been an ongoing issue for many organizations. Today, data are not just ingredients for churning out statistical reports, but are the basis of supporting efficient operations in many organizations. And to some extent, they provide the competitive intelligence needed to survive in today’s economy. Since data can be potentially so important, it becomes increasingly difficult to refrain from collecting any data available. The effect of this is that organizations are overloaded with information. Data mining has been touted to be the solution (or perhaps, the killer application) to the problem of information overloading. It is suppose to replace the human in performing the laborious task of sieving through data and to report only important results. In one sense, data mining technologies have created many successful stories and have in the recent years, gained research and industry interests. Yet, it does not seem to deliver the impact [9] in terms of penetrating every aspects of life or systems (compared to, for example, the Internet).
To appear in the IEEE Intelligent Systems
There are many reasons for a lack of impact – such as human factors, technical issues, or a combination of both. One technical issue that we felt interesting (and worthy of discussion) is that we believe data mining development has “lost touch”, for a while, with the needs of the real world.
The Limitations of Data Mining In our opinion, data mining algorithms (and hence their products) were designed with two wrong assumptions. First, it assumes that the data are static in all aspects. Second, it assumes that the execution environment where the algorithm runs is stable with abundant resources (e.g., memory or computing power). The reality however is that both assumptions do not really hold. First, most data mining algorithms operate on a snapshot of the data. The data may be collected and stored in a data warehouse where during data mining, is assumed to remain static even though new data could have arrived, or parts of the snapshot may no longer be valid. And throughout the lifetime of the algorithm’s execution, this snapshot is assumed to be a reflection of the real world situation. In addition, the data are assumed to be constantly available while the algorithm runs, i.e., there is some unbound or huge storage space for the snapshot where the algorithm can read as many times as it needs. While this may not appear to be an issue, real world data are often many times larger. And we are not talking about scientific data here. In a single day, Centrelink, Australia’s welfare agency, has more than 11 million page requests in their Web logs; Telstra, Australia’s largest telecommunications company, produces 15 million mobile call records; American supermarket chain WalMart records 20 million sales transactions; and Google handles 150 million searches. For such massive amount of data, it would be difficult, if not impossible, to generate any results in useful time by running off-theshelves algorithms – even when unbounded memory, storage and CPU time are available. This is because existing algorithms are simply not designed to do so. The other problem with data mining algorithms is that they all assumed an execution environment with all the resources they need, and that all resources are available until the execution terminates. Although such assumption is common in other applications, data mining algorithms cannot have such luxury by the sheer amount of time it needs to complete a task. In some cases, this may be ongoing as data keep arriving. Hence, data mining algorithms must be aware of the Page 1
conditions of the host environment (e.g., unavailability of host, reduced resources, etc).
pattern in this case is defined as the occurrence of a symbol which we denoted as T in the figure.
This awareness is becoming very important in today’s computing paradigm. In the past, data mining is considered a specialized technology available to a limited number of organizations. They were often operated in a controlled and centralized environment, where the analysis involved an expert user. The process of knowledge discovery follows a strict sequence and algorithms ran in batch modes. Nevertheless, progress in other technologies changed the paradigm.
As in the conventional case, a snapshot of the data is taken when the user begins data mining. In our example, the snapshot contains the symbols {H, D, J, Y, U, E, R, P, . . . }. Once the snapshot is read, the algorithm begins processing using some sophisticated search technique which at some point in time, will usually consume a large amount of memory and CPU cycles as the graph at the top of the figure shows. During execution, data continue to arrive (usually at a rapid rate). In the snapshot approach, this newly arrived data may be stored somewhere until the next analysis is commissioned. As illustrated in our example, the interesting element was missed and the algorithm reported an outdated or invalid result.
Advancements in data storage and acquisition technologies, wireless and mobile technologies, the Internet, and smaller computing devices all contributed to this change. Huge amount of data can now be collected from multiple sources; wireless and mobile technologies created pervasive computing; the Internet gave rise to connectivity; and as a result, users can work across different devices. Most important of all, these technologies generate enormous amount of data that demand ongoing and real-time analysis.
When All are Not Right, Innovation Occurs For a while, data mining researchers overcame the problem by developing faster algorithms and incorporating interactively into the knowledge discovery process. But the amount of data continues to grow exponentially compared to the efficiency of the algorithms. The gap between data mining and the real world enlarges. Of course, when all are not right, innovation occurs. Very recently, the database community has recognized that the existing database approaches (including data mining) are no longer suitable for handling this class of data that arrives continuously at a very high rate. Because of its continuity, they are call data streams [2]. New algorithms for processing and mining streams were proposed [1; 6]. In particular, the mining of stream data do not require a complete snapshot. All results are reflected as soon as possible when data arrive, and it can operate in host environments where resources are scarce. Despite the limited resources, the algorithm is able to compute very quickly using a small amount of CPU cycle. This speedup and lower resource consumption come at the price of lower accuracy in the results; but the error is maintained at a user-acceptable level. The trade-off aside, our interest is to point out that these algorithms are actually designed with the real world data conditions in mind. And it is certainly a better approach especially when absolute results aren’t necessary. However, these algorithms continue to suffer from the shortcoming of assuming a stable execution environment. It is not able to handle dynamic situations, such as when a host becomes unavailable before execution completes. Also, if a host with more resources becomes available, it is not able to take advantage of that situation. To give a clearer picture of what we have discussed so far, Figure 1 shows a timeline of a sequence of events occurring during data mining. The same data are analyzed by both conventional algorithms (in the upper half of the figure) and stream-based algorithms (lower half). We first illustrate the example for the case when the conventional algorithm is used. Our hypothetical problem here is that we are interested in looking for interesting patterns. An interesting To appear in the IEEE Intelligent Systems
To capture the interesting element, the next run of the algorithm can only start after it arrived and is captured in the snapshot. In real life, this would be impossible – because it is impractical to start instances of the algorithm at a regular intervals due to resource availability, and neither does it make sense to have a human observe the incoming data. Even if the element is captured in the snapshot, conventional algorithms may take too long to produce the results (recall that they operate in batch mode) rendering the discovery useless, or there is insufficient time to react. On the other hand, the lower half of the figure illustrates what happens when stream-based algorithms are used. In this case, in place of the snapshot is a summary structure which holds only a subset of what a snapshot would contain. And once started, the algorithm constantly updates the summary structure whenever data arrive (so that it has the best picture of all data seen so far); but the size of the structure will always be bounded as shown in the second graph. Also, the CPU cycles consumed is much lower albeit with some possibility of error. The error probability is generally outweighed by the overall benefits. We see that once the interesting element is detected, the algorithm is able to output this result usually before its value expires. This happens most of the time and is thus more useful. In the conventional case, the chances of missing the interesting pattern or finding it too late is very much higher due to a batch mode operation. Given the improved results and lower resource requirements, the use of summary structures are now very popular in many data mining applications.
Mobile Data Mining Agents Perhaps, it may not be in the context of data mining research to consider the dynamics of the execution environment. That’s where mobile agent technology comes into the picture. We believe mobile agents may well be the other half of the puzzle to close the gap between data mining and the current computing paradigm. Generally, one view of agent technology is that it provides a proxy or wrapper for another technology to function in an autonomous manner. Therefore, a data mining agent fitted with the ability to process data streams will be able to handle the dynamics of the data and execution environment. Above and beyond, the fusion of agents with stream processing techniques for data mining opens up new possiPage 2
Earliest time conventional algorithm finds surprise element. Result is not longer useful or valid.
High
Low
CONVENTIONAL
Conventional algorithm running with the snapshot and its resource consumption pattern.
CPU and memory consumption
Latest time that surprise elements if any should be reported to be of any interest or value to the end-user.
Timeline
SNAPSHOT Earliest possible rerun of the algorithm to detect the surprise element T.
H, D, J, Y, U, E, R, P, ... Start data mining by taking a snapshot of the data. Note the “…” which indicates its large memory requirements
“Surprise” element T arrives. But algorithm continues to look at the snapshot.
Data mining results become available. OUTPUT: No surprises in data.
Timeline
SUMMARY STRUCTURE
Data mining results as soon as the current summary structure is processed. OUTPUT: Surprising data found!
H, D, J, Y, U, E, R, P, Q, W Start data mining, data seen so far are stored in the summary structure. SUMMARY STRUCTURE J, Y, U, E, R, P, Q, W, T Summary structure gets constantly updated as data arrives.
Ongoing data mining until explicitly terminated by the end-user
CPU and memory consumption High
Low
Bounded memory and CPU time
Stream-based algorithms with low resource consumption pattern
. Timeline
STREAM BASED
Detects surprising pattern and gives more buffer time to react
Figure 1: A comparison between conventional data mining algorithms and the new stream-based algorithm by looking at the sequence of events happening in the data. Also shown are two graphs depicting their resource consumption pattern. bilities. For example, stream mining algorithms often have parameters that provide a bound on the computing resources available. In other words, a data mining agent can be aware of the resources required (e.g., amount of disk and memory space) to perform a task beforehand. This information will allow agents to autonomously find appropriate hosts to best carry out its task (hence, mobile data mining agent). The mobility of the agent is not compromised in anyway because no data is actually carried. Instead, they have a summary structure about the data they have seen so far. This summary structure is usually small enough to fit into the limited disk space or memory as discussed earlier. The more exciting aspect of mobile data mining agents is their ability to reconfigure autonomously according to the host environment. For example, when multiple execution hosts are available, the agents can replicate itself onto different hosts to perform mining in a distributed but cooperative manner. This can be done quickly as there is no need to move data between hosts – only the summary structure and the agent itself. Previously, moving to a different host meant moving the massive amount of data which negates the speedup from concurrent execution. Another example is when a host is to become unavailable. Prior to shutdown, To appear in the IEEE Intelligent Systems
the agent can save its execution state and move to another available host. This level of improved robustness, without the involvement of the user, ensures that data mining can happen continuously in the background. Before we move on to discuss our project, it may be interesting to note that the concept of marrying agents with data mining is not new. What makes mobile data mining agents really exciting now is the possibility of getting them right. We have the appropriate algorithms that will better integrate with what agent technology promises. Hence, the next step is to really put them together. Previous works had used conventional algorithms that limit their full potential to operate within a dynamic environment. For example, some data mining agents are stationary and operates in a well-controlled environment. It has access to huge computing resources to enable conventional algorithms to perform to its best. The drawback in this case is the need to upload huge amount of data thus consuming precious bandwidth and increases the likelihood of compromising the privacy of user details. On the other hand, agents that were mobile tend to consume huge amount of resources on the target host. This is not an issue if the host was dedicated. If not, other applications on the host may be starved for resources, or the agent may not have enough resources Page 3
Mobile Mining Agents
Summary Structure Agent
Mobile Mining Agents
OSSM Summary Structure Data buffer
Plan and Coordinate Agent
In the event of a host unavailability, agents can move to another available host. When a host with more resources becomes available, agents can move to perform the task better
OSSM Summary Structure Incoming data are placed into the data buffer.
Upon receiving the summary structure, the plan and coordinate agent creates an execution plan which involves starting instances of agents and dispatch them to available hosts with the summary structure to perform its task.
Mobile Mining Agents
Mobile Mining Agents
Mobile Mining Agents
Mobile Mining Agents
Figure 2: The Matrix system where a number of agents were developed to cooperatively perform data mining using summary structures (in this case, the OSSM structure for frequent pattern mining). As agents are truly lightweight and mobile, they respond very well in achieving autonomous data mining in the dynamic environment we created. to complete its task on time.
The Matrix Project Motivated by the above, we recently started the Matrix project to develop an ecology of agents to perform knowledge discovery on very large data sets (i.e., streams) within dynamic environments. We borrow concepts from data stream techniques (e.g., summary structures) to create agents that are truly lightweight and mobile, so as to achieve a high degree of concurrency and robustness. Our initial implementation is a group of frequent pattern mining agents that report patterns (i.e., frequent itemsets) occurring above some frequency threshold. Through this initial implementation, we report our results that form the evidence to the feasibility of our proposal. Eventually, our ultimate goal in this project is to develop algorithms that truly exploit the benefits of fusing these two technologies. Currently, we have a number of agents in the system as shown in Figure 2. When a user initiates a data mining session on a very large snapshot or a data stream, the summary structure agent is dispatched. By design, this will be To appear in the IEEE Intelligent Systems
the only agent that can arrive at the host, where the large snapshot or data stream is located. It creates the summary structure (called OSSM) within the host’s security sandbox to avoid moving data over the network. The benefits of this approach: privacy is ensured since it’s usually not possible to identify individuals from summary structures, and that very little bandwidth is needed to move agents around different execution environments. Once construction completes, the summary structure agent sends the OSSM to the plan and coordinate agent before processing the next portion of the stream. Upon receiving the OSSM, the plan and coordinate agent identifies suitable hosts and generates a plan that breaks down the data mining task. Instances of data mining agents are then created and sent to the respective hosts together with the summary structure and its responsibility in the problem space. Since there is no further access to the raw data, the agents are dispatched quickly over the network and data mining begins almost instantly. This is different from previous agent data mining systems where data need to be moved to the the target host, or there is continuous I/O request over the network. In our implementation, the OSSM eliminates any
Page 4
network activity once agents are dispatched. This gives a high degree of concurrency and no single point of failure. Also illustrated in the figure are two examples of why data mining can benefit from agent technology. Suppose the Macintosh host (the machine with a thick curved outline) is to be shutdown, then the agent on that host can move to another available host by only packing its execution state and the summary structure. The other case is when the plan and coordinate agent detects the presence of newly available host(s) that have more computing power. By design, the agent operating with the least resources will get to migrate as shown in our figure, where the agent moves from a smaller host to a larger one. These dynamic adjustments are all possible because of a different data mining approach that ensures that discovery can be maintained continuously to yield results that are timely and relevant.
Summary Structure The implementation to demonstrate our proposal utilizes the OSSM [5] as the summary structure. The OSSM is a lightweight structure that holds the frequencies of all 1itemsets in each segment of the database D. A segment in D is a partition containing a set of transactions such that D = S1 ∪ . . . ∪ Sn and Sp ∩ Sq = ∅. In each segment, the frequency of each 1-itemset is registered and thus, the frequency of a 1-itemset ‘c’ in D can be obtained by P n i=1 σi ({c}). While the OSSM only contains the frequency distribution of 1-itemsets across multiple segments, it can be used to give an upper bound on the frequency (b σ ) of any itemset C in D using the formula given below, where On is the OSSM constructed with n segments and σi (c) is the support of an itemset {c} in segment Si .
σ b(C, On ) =
n X
min({σi ({c}) | c ∈ C})
i=1
Let us consider the example in Figure 3. Assume in this set up, each segment has exactly two transactions. Then, we have the OSSM (right table) where the frequency of each item in each segment is registered. By the equation above, the estimated frequency of a pattern C = {a, b} would be σ b(C, On ) = min(2, 1) + min(2, 0) + min(0, 2) = 1. Although this estimate is the frequency bound of C, it turns out to be the actual frequency of C in D for this particular configuration of segments. Suppose we now switch T1 and T5 in the OSSM, i.e., S1 = {T2 , T5 } and S3 = {T1 , T6 }, then σ b(C, On ) = 2! This observation suggests that the way transactions are selected into a segment can affect the quality of estimation. Clearly, if each segment contains only one transaction, then the estimate will be optimal and equals the actual frequency. However, this number of segments will be practically infeasible as it is equivalent to making I/O scans over D. The ideal alternative is to use a minimum number of segments nm to summarize D while maintaining the optimality of our estimate. In our example, the database can be viewed as a data stream of 6 transactions. By the construction property discussed in [7], we can produce a summary structure (i.e., the OSSM) To appear in the IEEE Intelligent Systems
by merging S1 and S2 to create a 2 segment OSSM to represent the 6 transactions. More importantly, the OSSM is effectively 1/3 the size of the original database, but every pattern frequency can still be obtained. Clearly, an agent with the OSSM is more mobile than the one with the entire database. Thus, the goal is to find the best configuration of segments such that the quality of estimate, for a given pattern, is optimal. This segment minimization problem was first addressed in [5], where a number of algorithms were proposed. Generally, these algorithms has a trade-off between the quality of the OSSM and the speed of construction. For example, the proposed Random-Greedy and Random-RC algorithms are fast, but they produce an inaccurate summary of the data. On the other hand, the Greedy algorithm constructs an optimal OSSM with a high runtime that is inappropriate for data streams. Given its potential, we pursued the problem further. Recently, we developed a novel technique to construct a summary of the database, where every pattern’s exact frequency with respect to the database is maintained. Our algorithm, called FSSM, is therefore ideal for creating summary structures for data streams, and to demonstrate the concepts in our proposal.
Algorithm FSSM To give the reader a better idea of how the OSSM structure is obtained, we digress for a moment to discuss the FSSM algorithm. Due to space limitations and for the ease of discussion, we assume the reader to be familiar with the FP-Tree and the OSSM. If not, a proper treatment can be obtained in [5; 7]. First, we need to understand the relationship between the FP-Tree and the OSSM. Lemma 1. Let Si and Sj be two segments of the same configuration from a collection of transactions. If we merge Si and Sj into one segment Sm , then Sm is the same configuration, and σ b(C, Sm ) = σ b(C, Si ) + σ b(C, Sj ). The term configuration refers to the characteristic of a segment that is described by the descending frequency order of the 1-itemsets. As an example, suppose the database has three unique items and two segments, i.e., S1 = {b(4), a(1), c(0)} and S2 = {b(3), a(2), c(2)}, where the number in the parentheses is the frequency of each item in the segment. In this case, both segments are described by the same configuration hσ({a}) > σ({b}) > σ({c})i, and therefore can be merged (by Lemma 1) without loosing accuracy. In the more general case, the lemma solves the segment minimization problem. Suppose each segment begins with a single transaction, i.e., the 1-itemset frequency registered in each segment is either ‘1’ or ‘0’. We begin by merging two single-transaction segments of the same configuration. From this merged segment, we continue merging other single-transaction segments as long as the configuration is not altered. When no other single-transaction segments can be merged without loosing accuracy, we repeat the process on another configuration. The number of segments found after processing all distinct configurations is the minimum number of segments required to summarize D without loosing accuracy. Page 5
Transaction ID
T1 T2 T3 T4 T5 T6
Contents
{a} {a, b} {a} {a} {b} {b}
Segment
1 1 2 2 3 3
{a} {b}
S1 2 1
S2 2 0
S3 0 2
D = S1 ∪ S2 ∪ S3 4 3
Figure 3: A collection of transactions (left) and its corresponding OSSM (right). The OSSM is constructed with a user-defined segment size of n = 3. In the optimal case, the minimum number of segments required is n = 2, i.e., merging S1 and S2 . Theorem 1. The minimum number of segments required for the upper bound on σ(C) to be exact for all C, is the number of segments with distinct configurations. Proof. As shown in [5]. Notice the process of merging two segments is very similar to the process of FP-Tree construction. First, the criterion to order items in a transaction is the same as that to determine the configuration of a segment (specifically a singletransaction segment). Second, the merging criterion of two segments is implicitly carried out by the overlaying of a transaction on an existing unique path (a unique path in the FP-Tree, is a distinct path that starts from the root node, and ends at one of the leaf nodes) in the FP-Tree. An example will illustrate this observation. Let T1 = {f, a, m, p}, T2 = {f, a, m} and T3 = {f, b, m} such that the transactions are already ordered, and σ({b}) > σ({a}). Based on FP-Tree characteristics, T1 and T2 will share the same path in the FP-Tree, while T3 will have a path of its own. For the two transactions overlaid on the same path in the FPTree, they actually have the same configuration: hσ({f }) > σ({a}) > σ({m}) > σ({p}) > σ({b}) > . . . i. This is because σ({b}) = 0 in both T1 and T2 and σ({p}) = 0 for T2 . For T3 , the configuration is hσ({f }) > σ({b}) > σ({m}) > σ({a}) > σ({p}) > . . . i, where σ({a}) = σ({p}) = 0. Clearly, this is a different configuration from T1 and T2 and hence, a different path in the FP-Tree. This leads us to the following theorem and its corollary. Theorem 2. Given a FP-Tree constructed from some collection, the number of unique paths (or leaf nodes) in the FP-Tree is the minimum number of segments achievable without compromising the accuracy of the OSSM. Proof. See [7]. Corollary 1. The transactions that are fully contained in each unique path of the FP-Tree is the set of transactions that constitutes to a distinct segment in the optimal OSSM. Proof. See [7]. From Theorem 2 and Collary 1, the algorithm to construct the optimal OSSM is given in Algorithm 1. Notice that the process is very much based on FP-Tree construction. In fact, the entire FP-Tree is constructed along with the optimal OSSM. Therefore, the efficiency of the algorithm is bounded by the time needed to construct the FP-Tree, i.e., within two scans of the database. Briefly, as each transaction is inserted into the FP-Tree, its 1-itemsets are first ordered by their descending frequencies. To appear in the IEEE Intelligent Systems
While the rationale for this in the context of FP-Tree construction is to ensure maximum overlapping, it also ensures that the every parent node has a frequency that is higher than its children. This ordering in turn translates to the configuration of a segment. More importantly, the insertion of a transaction during FP-Tree construction is equivalent to finding the segment with the same configuration but with the combination analysis (i.e., checking the configuration of segments to merge) avoided. This observation is the rationale behind FSSM’s performance and accuracy; and is what data stream applications require. The purpose of inserting transactions into the FP-Tree is to guide the construction of the OSSM. If a transaction lies completely on an existing path, then we simply increment the counters in the corresponding segment. If however, the transaction requires a new path, additional nodes will be created in the FP-Tree, and a new segment representing the new path will be added to the OSSM.
Initial Results The objective our experiments is to demonstrate the feasibility and potential benefits of our proposal. The initial implementation used in the experiments was created by extending our earlier implementation of the FSSM algorithm, which was initially developed for a snapshot, non agent-based setting. The port involves extensions to support stream-based operations, and wrappers for network communication. The data mining agents were created to demonstrate the possibilities of agents co-operating in a shared-nothing architecture that is made possible by the summary statistics. The physical setup of our experiments is as follows. We first preprocessed every data set into a binary vector format to simulate the scenario of a data stream. The data sets are then placed on a 1 GHz notebook with 512MB RAM, where the summary structure agent performs its task. All data sets used are real-life and their characteristics given in Table 1. We also created four instances of our data mining agents on a remote server having 4-CPUs and 3GB of RAM. On both machines, we ran Windows XP Professional and Windows XP Server respectively. Figure 4 and 5 contains the results of our experiments. Our first experiment records the performance of our summary structure agent operating on the four data sets. We tested with four scenarios: three stream-processing modes each with a different window size, and the fourth treats the data set as a snapshot. The time to create a summary of the entire data set is given in Figure 4(a). In all cases, it’s interesting to note that the stream-based approach leads to better performance than to process the entire data set at Page 6
Algorithm 1 (BuildOptimalOSSM) Builds the optimal OSSM through FP-Tree construction Input D
: set of transactions;
Output Onm : optimal OSSM; 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
find the frequency of each item in D; initialize empty FP-Tree; for all transaction T ∈ D do if (T can be inserted completely along an existing path Pi in the FP-Tree) then increment the counter in segment Si for each item in T ; else create the new path Pj in the FP-Tree, and the new segment Sj ; initialize the counter in segment Sj for each item in T to 1 and others 0; end if end for return optimal OSSM Onm ;
Table 1: Details of real-life data sets. Data Set # Records # Items File size (MB) Accidents 340,183 468 45 POS 515,597 1,657 14.8 Kosarak 990,002 41,270 34 Retail 88,162 16,470 3.8 once. Recall that the size of these data sets are very large and therefore, to process the entire snapshot at once require a huge amount of physical memory. As memory gets depleted, our algorithm falls back onto the disk to write out some of the segments (the algorithm actually manages this because the manager does a terrible job in this case). Hence, more time is needed to process the entire snapshot at once. Once the summary structures are created, they are transferred over the network to the remote server, where the data mining agents are hosted. Notice that in the current setup, we have omitted the plan and coordinate agent. Instead, we simplify our setup by assigning jobs in a round-robin fashion. To see the potential of our summary structure, we plotted the transfer time from the notebook to the remote server in a LAN setting. For the stream-based, the transfer time is the time required to move every summary structure of the data. This is then compared to the case, where the entire data set is first compressed before transferring to the remote server. Even in such a scenario, we see that the database is still more expensive to move over from one host to another. While the difference may not appear serious, recall that the compressed data has to be decompressed before it can be used. If the data is then stored on a network drive, conventional approach will require continuous access, and therefore increases the amount of network communication. In our case, the recorded timing on Figure 4(b) is the entire amount of utilization on the network (except for “Entire DB”). This is also why, we are arguing for the use of summary structures with agents – the mobility allows a more robust setup against a changing execution environment. Our last experiment is to demonstrate another potential of using agents and summary structures. Strictly speaking, we are not really exploiting agent-based characteristics in this To appear in the IEEE Intelligent Systems
particular experiment. Nevertheless, we believe the reader will be able to foresee the potential of using agents to solve a variety of data mining problems after this discussion. Figure 5 shows the results of running the same data mining problem using the summary structures, but with (i.e., Agent-OSSM) and without (i.e., OSSM-Discovery) the division of labor among agents. In all cases, we can see a near linear speedup on the discovery. This near linear speedup is possible because there is no further network access after the summary structures and problem scope are made available to the agents. More importantly and where appropriate, our approach is more cost effective. This is because a high degree of parallelism is possible on general hardware rather than methods such as grid-based data mining.
Conclusions The concept of data mining agents or a system of data mining agents is not new. Many previous works on agent-based framework for data mining have been proposed [4; 8; 10]. However, we felt that marrying agents with the earlier data mining technologies do not fully realize what users expect it to truly deliver. Hence, our discussions here have been motivated from this persistent problem. As illustrated with our examples and a discussion of our recent initiated project, we believe that we have a case to pursue this further. In particular, we hope to present novel use of agent-based properties (with data stream techniques) to deliver a new perspective to knowledge discovery. For example, we can initiate a complex data mining task to one particular agent. Behind the scene, this agent can seek “helpers” in other parts of the agent-network to work on different parts of the problem. This can happen with the user’s knowledge. And because summary techniques are used, data Page 7
2,500
12 POS
Window = 50K
Accidents
Window = 100K 2,000
10
Window = Entire Snapshot
Transfer Time (Seconds)
Construction Time for Different Modes (Seconds)
Window = 20K
1,500
1,000
Kosarak Retail
8
6
4
500
2
0 POS
Accidents
Kosarak
0
Retail
Window=20K
Window=50K
Dataset
Window=100K
Entire DB
Single OSSM
Type of Transfer
(a)
(b)
Figure 4: (a) The cost of creating summary structures from real-life data sets using different construction scenario; (b) the cost of moving summary structures over from one host to another as compared to moving the entire compressed database – which later requires decompression and further network access if using conventional data mining techniques. 10000 9000
Agent Approach OSSM Approach
Runtime (Seconds)
8000 7000 6000 5000 4000 3000 2000 1000 0 POS
Accidents
Kosarak
Retail
Datasets
Figure 5: Another result showing the benefits of agents operating in a shared-nothing architecture. Notice that the speed up is almost linear for each data set tested. This is because agents run independent of one another once they receive the summary structure and the scope of their problem space. privacy can be ensured. In anycase, the opportunity has arrived with the introduction of novel data mining techniques, which we believe are finally a good fit with agent technologies. On their own, both technologies are not novel: the techniques in processing stream data has roots in sampling, approximation, etc; but their combination presents a promising solution that may well realize how data mining technology should be, i.e., to become invisible [3] (like the above) as many data mining experts agree.
References [1] C. C. Aggarwal, J. Han, J. Wang, P. S. Yu. A Framework for Clustering Evolving Data Streams. In Proc. of Int. Conf. on Very Large Databases, Berlin, Germany, September 2003. [2] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams – A New Class of Data Management Applications. In Proc. Int. Conf. on Very Large Databases, Hong Kong, China, August 2002. To appear in the IEEE Intelligent Systems
[3] G. H. John. Behind-the-Scences Data Mining: A Report on the KDD-98 Panel. ACM SIGKDD Explorations, 1(1), June 1999. [4] H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable, Distributed Data Mining Using an Agent-Based Architecture. In Proc. Int. Conf. on Knowledge Discovery and Data Mining, Newport Beach, CA, AAAI Press, 1997. [5] C. K-S. Leung R. T. Ng, and H. Mannila. OSSM: A Segmentation Approach to Optimize Frequency Counting. In Proc. of IEEE Int. Conf. on Data Engineering, San Jose, USA, February 2002. [6] G. Manku and R. Motwani. Approximate Frequency Counts Over Data Streams, In Proc. of Int. Conf. on Very Large Databases, Hong Kong, China, August 2002. [7] K.-L. Ong, W.-K. Ng, and E.-P. Lim. FSSM: Fast Construction of the Optimized Segment Support Map. In Proc. of the 5th Int. Conf. on Data Warehousing and Knowledge Discovery, Prague, Czech Republic, September 2003. Page 8
[8] S. Stolfo, A. Prodromidis, S. Tselepis, W. Lee, W. Fan, and P. Chan. JAM: Java Agents for Meta-Learning over Distributed Databases. In Proc. Int. Conf. on Knowledge Discovery and Data Mining, Newport Beach,CA, AAAI Press, 1997. [9] X. Wu, P. S. Yu, G. Piatetsky-Shapiro, N. Cercone, T. Y. Lin, R. Kotagiri, and Benjamin W. Wah. Data Mining: How Research Meets Practical Development? Knowledge and Information Systems, 5(2), April 2003. [10] Z. Zhang, C. Zhang, S. Zhang. An Agent-Based Hybrid Framework for Database Mining. In Applied Artificial Intelligence, 17(5-6), May 2003.
To appear in the IEEE Intelligent Systems
Page 9