Distributed Data-Intensive Computation and the

0 downloads 0 Views 210KB Size Report
The amount of data stored online doubles every nine months; in other words it ... but an increasing amount of infrastructure is needed to maintain the whole .... They impose different requirements on the process of combining local ..... carries out any residual computation needed to construct a global model (this will become.
Distributed Data-Intensive Computation and the Datacentric Grid D.B. Skillicorn School of Computing Queen’s University [email protected] Abstract The analysis of large datasets has become an important tool in understanding complex systems in areas such as economics and business, science, and engineering. Such datasets are often collected in a geographically distributed way, and cannot, in practise, be gathered into a single repository. Applications that work with such datasets, for example data-mining algorithms, cannot control most aspects of the data’s partitioning or arrangement. As a consequence, both new architectures and new algorithms are needed. This paper presents the design of the DCGrid, a datacentric grid, and describes some parallel and distributed algorithms for data mining that are particularly suited to large-scale distributed datasets. Some of these algorithms permit genuine superlinear speedup so they are effective even for modest levels of parallelism.

1

Motivation

The amount of data stored online doubles every nine months; in other words it grows twice as fast as the speed of the fastest processor available to analyze it. There is useful information in much of this data – for example, most organizations could improve their interactions with their customers if they understood the customers’ needs better; much planning would improve with better analysis of satellite data on land cover and climate; many products and systems would work more efficiently if the physical and chemical processes underlying them were understood more clearly; and there are many other examples. Mainstream data mining aims to find ways to make this kind of analysis possible. The problem is harder for large datasets because of two properties that limit how they can be effectively processed, two properties that come from both the technology of how they are collected, and the human social properties. These properties are: • Datasets are distributed geographically, so pieces of the same logical dataset are physically located far away from each other; and • Datasets are (in practise) immovable.

Because of these properties:

• Algorithms to compute with large datasets cannot assume or control the partitioned structure, the sizes, and the locations of the pieces of the dataset and must take account of the latencies and bandwidths required to move data among the pieces. These properties do not quite hold today – it is still possible (though expensive) to move large datasets routinely. But today’s large datasets are tomorrow’s medium-sized ones, and rate of growth of data far exceeds the probable growth of affordable storage and interconnect. We now explain how these properties arise. 1

1.1

Why datasets are geographically distributed

There are two reasons why datasets tend to be stored in pieces that are separated by substantial distances. The first is that many kinds of data are naturally collected in a distributed way. For example, multinational corporations such as hotel chains or airlines collect information about their customers in different countries. Astrophysicists collect information about stars and galaxies via different telescopes (in the same night). Even organizations that do not operate in multiple countries collect data about their customers via different touchpoints, such as stores, web sites, and customer call centers. It is natural, or at least convenient, to store data gathered via multiple channels in separate repositories. This is all the more the case when the purpose of data collection is primarily transaction processing and monitoring, since the benefits of local storage translate into faster response times. Second, at any given stage of technological development, there is an upper limit on the amount of storage it is reasonable to keep in one place. As storage volume grows, the cost per gigabyte drops – but an increasing amount of infrastructure is needed to maintain the whole system, and it becomes harder and harder to use off-the-shelf solutions. At some point, the costs of the infrastructure and the specialized solutions start to increase the cost per gigabyte. Furthermore, increasing the size of storage necessarily increases the average latency to fetch data. This is important in data-intensive applications since the entire dataset must typically be fetched at least once, so that the latency of the bottom of the storage hierarchy cannot be amortized by caching.

1.2

Why datasets are immovable

If datasets are collected in a distributed way, why can’t they be collected into a single location, and then processed using a standard sequential data-mining algorithm? This is possible in theory (perhaps) but unattractive in practise: • There is not enough bandwidth between networks and storage devices. Although there is a large amount of long-haul bandwidth available (at least in the developed world), the problem is the last few meters: computer architectures are not able to sustain such bandwidth all the way to the storage devices. Data possesses a property analogous to inertia – it is relatively easy either to hold it in storage or to move it over long distances, but the transition between these two states is comparatively slow and requires more complex hardware. • Over geographical distances, the latency of fetching data is dominated by time of flight, which cannot be improved by technological solutions. As datasets get larger, the time required to fetch the data overwhelms the time to compute with it in many situations. • It is not cost-effective for systems to have enough temporary storage to handle large datasets that are fetched temporarily. Even today, most systems would be challenged to find space for a 100GB dataset at short notice, and the problem will only get worse. There are also several reasons why data cannot be gathered into one place for processing even if it were technologically feasible: • Data of certain kinds, particularly about individuals, is not allowed to cross some political boundaries. For example, the European Union forbids export of data about individuals to jurisdictions that have privacy regimes that the E.U. regards as weaker than its own. • Ownership of data is a complex business. The basic principle of copyright prevents copying of data across organizational boundaries, although much copyright data exists only to be 2

copied; for example web pages exist only to be copied to browsers. Some organizations permit copying, but not recopying (e.g. mirroring). In other settings, copying is permitted, but with an understanding that the copy will only exist temporarily. The whole area of rights to use and storage of data is still in a state of flux, but is certainly clear that unencumbered copying of data across organizational boundaries is not the default. There is also a social dimension. Consumers may feel that it is appropriate to provide personal data to an organization they deal with, but may not want that data shared with other organizations. From the organizational perspective, two organizations may have common ownership and feel free to share data between themselves. In other words, consumers may perceive a boundary that data should not cross (between differently named organizations) whereas the organizations themselves see the boundary as an administrative convenience (and hence artificial). There are some specialized settings in which data movement is limited by physical constraints. One is where the natural locus for the computation is a portable device with limited power (in all senses) such as a PDA. It would not be practical to move a large dataset to the portable device and compute a model for the data there – it is a much better solution to compute elsewhere and move only the resulting model to the PDA. Another setting where data movement is limited is where the dataset is collected at a location where the network bandwidth is limited, for example when a deep-space probe collects image data. Once again, it is much more sensible to send code to the probe to compute and return some properties of the images than to transmit the images in their raw form. Similar arguments apply to remote sensors. If these assumptions are valid, then large data-intensive applications such as data mining must take the partitioning and placement of data into account. This requires both architectures for storing and computing with pieces of datasets, which might themselves still be quite large, and algorithms that are able to compute partial results locally in such a way that they can be combined into global results that (a) agree with those that would have been produced by a single sequential computation, and (b) do about the same amount of work overall as the sequential computation would have done. This latter requirement could be relaxed slightly, but for many datasets complexity is a limiting issue. A distributed implementation that was substantially more expensive than the sequential one might limit the sophistication of algorithms that could be applied. In the long run, these requirements may become less important – if the data is necessarily distributed, there is little point in considering the complexity of an unachievable sequential implementation. In Section 2, we consider the implications of datasets that are distributed and immovable for architectures and middleware for data mining, and present some aspects of the design of the Datacentric Grid. In Section 3, we discuss the implications for algorithms, and present some new parallel and distributed versions of data-mining algorithms. Several of these algorithms achieve superlinear speedup. In some cases this has been used to design new sequential data-mining algorithms; in others the speedup is fundamental so that modest parallelism leads to greatly increased effectiveness. There is also much to be learned about the best ways to share local information in pursuit of global information.

2

Architectural implications

If the data cannot be moved to the software, it is clear that the software must be moved to the data. However, the assumption of processor-centricity is deeply wired into the design of computing systems, at all levels; and so this change of approach requires a wide-ranging change in the way 3

we approach distributed computation [41]. We now explore how this changes the way in which architectures and middleware for data mining are designed.

2.1

Code is mobile

The idea of mobile code is not new: applets represent a form of mobility since the code comes from a server but is executed by a browser. There are a number of mobile agent systems, although most provide an abstraction of mobility rather than true mobility, since agents are able to communicate globally and so have no need to physically move. Datacentric grids require genuine mobility: user code must move to data repositories and compute with the data located there. This process is, in principle, very efficient. Data-mining code and the resulting models are typically quite small, certainly in comparison to the data on which they operate. Hence moving code and its resulting models requires orders of magnitude less bandwidth than moving the data. It is useful to distinguish two strategies for executing a computation that uses data from multiple repositories. There are two main differences between them: • They have slightly different resource requirements; and • They impose different requirements on the process of combining local models into global ones. 2.1.1

Scatter-gather style

In the scatter-gather style, code is sent to each data repository containing part of the dataset being modelled. Each fragment of code computes a local model, representative of the local data but able to be combined into a global model. These local models are then gathered at a single site, where the global model is computed. The communication requirement is p transfers of code (where p is the number of pieces into which the dataset is divided), and the collection of p local models. The computation of local models is fully overlapped in time. The combining algorithm for building the global model from the local models has the structure of a reduction. Since the sizes of the partitions are probably outside the control of the data-mining system, the time that each local computation takes to complete is also uncontrolled, so it is useful if the implicit operator of the reduction is both associative and commutative (so that local models can be combined in any order as they arrive). Almost all existing parallel and distributed data-mining algorithms are based on this scatter-gather structure. 2.1.2

Round robin style

In the round robin style, code is sent to one data repository, builds a local model based on the data it finds there, then moves on to the second repository, alters the model to reflect the data found there, and continues in this fashion to visit all of the data repositories containing part of the dataset being modelled. Finally, it returns to the user with a complete global model. This approach is attractive when it is known that some parts of the dataset contain more useful information than others, since visiting these parts first can result in pruning the model search space. For example, if the goal is to compute the most popular video rental in a particular city, the mobile code’s strategy is to maintain a list of potentially most popular videos. Getting the list from the largest video rental store first initializes this list in the best possible way – indeed if it is

4

large enough, it may be possible to truncate this list immediately. The round robin style is also more natural when the repositories to be visited depend on what is found in the data itself. The communication requirement of this style is smaller than that of scatter-gather because only p + 1 transfers are required. However, the processing to build local models is not overlapped in time. The combining algorithm used by the round-robin style has the structure of a fold. This approach has advantages when local models are much larger than the resulting global model, which tends to happen as models become more sophisticated and it is correspondingly harder to know which information is safe to discard locally.

2.2

Data repositories must also be compute servers

If code is moved to data repositories, it follows that such repositories must be able to execute the code; in other words, they must combine the ability to store large volumes of data with the ability to compute with it. Practical datasets are becoming extremely large, terabytes being the upper end at present, so that even pieces of datasets can be large. Data-mining algorithms are compute-intensive, so it is natural to consider the use of parallelism, even for the part of a computation that executes using a local piece of a dataset. Hence it is reasonable to assume that the compute server associated with a data repository is itself parallel; a cluster or conceivably a custom parallel system. The obvious way to build such a data repository + compute server is to use a cluster, for the computation, and network-attached storage, for the data. However, connecting these two parts using standard network technology is difficult. Since data-mining algorithms need to see all of the available data (see the discussion of the role of sampling in Section 3.2), the entire local piece of the dataset must be transmitted from NAS to cluster. It is hard to provide enough bandwidth; and the latency is performance limiting. It is better if the nodes of the cluster are connected directly to a part of the available storage, so that the computation devices and the storage devices are tightly coupled to each other. Such systems exist, for example the Goldrush database engine [45], but they are architecturally unusual.

2.3

Security, privacy, and confidentiality must be respected

Having code from one site execute on another creates a number of important issues related to information flow. At present, these issues arise in two different contexts, and the solutions are quite different in the two cases. First, web browsing often uses applets, which are downloaded from one site and executed at another. In this setting, control of information flow is managed using the language features of Java and sandboxing technology. Second, compute servers exist in computational grids and execute code from multiple remote users. In this setting, control of information flow is managed using authentication – an implicit promise that misuse will be dealt with offline. The weaknesses of current systems to control information flow are amply illustrated by the problems of viruses, worms, spyware, trojan horses, and unwanted automatic updating of software. Much of the discussion and research on information flow issues focuses on protection of systems from malicious users. However, this is a symmetric problem – users need to be equally protected from malicious hosts. The only serious example of which I am aware is the SETI@home program which executes each task on a number of different systems and compares the results to prevent system owners from falsifying the results. This solution is clumsy and relatively expensive, but the problem is a real one. 5

It is useful to distinguish three forms of information flow abuse that must be protected against: 1. Security – Each party does exactly what it has agreed to do. 2. Privacy – Neither party shares information about the other gratuitously. 3. Confidentiality – Neither party discovers what the other is doing. It is also useful to distinguish who needs to be protected from whom. There are three boundaries across which information flow needs to be controlled in both directions: 1. The boundary between user and host. 2. The (implicit) boundary between users because they share common system infrastructure. 3. The boundary of private grids. The requirements for information flow protection and the possible techniques for controlling it depend to some extent on the overarching structure of the data repositories. For example, if all of the data repositories belong to a single organization, then the datacentric grid is a form of virtual private grid, and many information flow issues are less critical. Similarly, when the grid is entirely public, some issues may be less important, although it is important to note that, even when the data itself is public, the resulting models may need protecting; as may the code (which might contain proprietary algorithms) and perhaps even the pattern of public data accessed. The strongest requirement for protection of information flow is for computations that originate within a virtual private grid but use resources from public datasets as well. For here it may be possible to infer what is being computed (and how) within the private grid from the data accessed and code deployed in the public part. We consider first the issue of security. Protecting hosts from users can be done by: authentication, enforcing the connection between actions and the user who instigated them so that consequences can follow misuse; sandboxing and playgrounding [22, 25, 27, 28], so that user programs execute in a resource-limited environment in which they can do as they please but cannot affect anything outside this environment; solutions such as proof-carrying code [31] where the aim is to allow the host to verify what the user is going to do; and, recently, fully-typed programs [18] in which occurrences of actions can be identified and, if necessary, replaced by host-based ones that can be trusted. Protecting users from hosts is a much less studied problem, and no really satisfactory solutions exist. The situation is fundamentally asymmetric since used code executes (necessarily) in a virtual environment controlled by the host. Hence the host has unlimited opportunities to see what the user code is doing and modify the code’s actions to change its environment. Fortunately, the host can see only the surface actions of the code, so the possibility of concealment exists. Some of the proposed solutions are: obfuscation [36], in which the user code actions are deliberately made obscure so that it is hard to know what it is attempting to do; executing the code multiple times and comparing results; adding redundant instrumentation to the code so that tampering is more likely to show up [10], and executing a transformed version of the real computation [7, 35]. Obfuscation is a very weak technique. Multiple execution is expensive. Redundant instrumentation, such as adding computation on dummy variables whose expected result is known, can detect some tampering but does nothing to prevent it. Executing a transformed program does not actually protect the user against random corruption of results by a host, although it does provide protection against deliberate corruption (since the host does not know what to do to produce an apparently-correct but wrong result). 6

Protecting users from other users is superficially easy since each user’s code can execute in a different virtual machine. However, this requires a different kind of trust between user and host. Furthermore, it is not clear that the use of different virtual machines is sufficient to separate users’ actions from each other because of potentially-visible sideeffects on the physical infrastructure that they share. For example, it is well known that the previous contents of a disk track can be read even if the track has been subsequently overwritten with different data. It is hard to assess exactly how to provide privacy because the requirements are often ill-formed – most people have difficulty defining their privacy needs, but recognize when these needs are violated. A good starting point might be to require that neither hosts nor users gratuitously reveal anything about their common actions. Of course, if security is well-implemented then users do not have anything to reveal – but hosts necessarily know what users did. One important issue here is the existence of logs and the use to which they are put. Many privacy issues would be much less difficult if no logs were kept; but this is in direct conflict with hosts need to discover what happened when an anomaly occurs. Privacy may not be important in virtual private grids since organizations typically have a shared purpose and openness with information internally towards that purpose (there are important exceptions). Similarly in a public grid there may not be an assumption of privacy. The world wide web is an interesting analogue – there is little privacy for web browsing, but it is not clear whether public acceptance of this is considered, or the result of ignorance about how much information about them is collected. Confidentiality is a stronger form of privacy in which hosts or users do not reveal the actions (computations but also, importantly, the traffic patterns) of other users. Once again this is a difficult problem because hosts necessarily know what user code is doing. The only solution seems to be to use transformed code. At present, techniques based on cryptography that both hide the functionality of what is being computed and what data is being accessed are known, but their practicality is doubtful. Techniques described by Sander and Tschudin [35] allow the private computation of polynomials, which is a reasonable start since this allows conditionals and loops to be implemented. Confidentiality issues may not be important within virtual private grids, where both the information and algorithms are owned by the same organization. Within public grids it is tempting to think that confidentiality does not matter either. This may not be true – many computations that use public data nevertheless use proprietary algorithms for analyze it, and produce results that (presumably) have value. In essence, all data-mining produces something analogous to intellectual property which ought to, in the first instance, be protected. We do not have effective solutions to the problems of controlling information flows. These issues may limit the deployment of grids for data-intensive problems. On the other hand, information grazing, the speculative processing of data without direct user intervention, may provide a partial solution by concealing whether a given data-intensive computation is the result of a real-time need to learn something or a background search for potentially useful knowledge.

2.4

A new programming environment is required

The programs executed by computational grids are of many different kinds and are algorithmically varied. In contrast, most data-mining applications are relatively simple harnesses around standard building blocks for data manipulation, model building, and model validation. The forms of mobile code exist along a spectrum. At one extreme, the mobile code consists only of selecting and ordering static library code that is stored at data repositories. This approach

7

has proven quite effective in domains where most computations use the same building blocks. For example, the Netsolve system for large-scale linear algebra is of this kind, and the whole web services domain also assumes this underlying model of computation. In this case, many datamining computations could be expressed using simple graphical structures such as those already used within many data-mining packages, or query languages in the style of SQL. At the other extreme, the mobile code consists of programs written in a fully-fledged programming language, although presumably somewhat restricted to make the security problems more tractable. Describing a computation for a datacentric grid requires describing how to divide the computation across the sites that will be used (the distributed part) and how to divide the computation across the processors of each cluster local to a single repository. Work in high-performance computing is relevant here, although few systems have adequately handled such two-level concurrency.

2.5

New techniques for execution planning are required

In contrast to computational grids, many data-mining algorithms are executed repeatedly with small changes. There are two reasons for this: 1. The same algorithm is executed on data that changes with time, for example, stock market data, satellite images, daily sales reports, and so on. 2. The algorithm is being executed by a data analyst who is building a model of the data as a sequence of models, each refined from the previous one. In both cases, the distributed computations executed are very similar, so that there is an obvious benefit to caching the execution plan used the first time and then tweaking it as necessary for subsequent executions. A further complication is the resource discovery problem for datacentric grids. In computational grids, a resource such as a compute server can be described by a few parameters: number of processors, speed of processors, amount of memory, amount of available disk storage, and perhaps the availability of particular software packages. In contrast, a node of a datacentric grid is described by a similar set of parameters for the computational part, but also by some form of description of its data contents. The Web Services Description Language (WSDL) is a partial solution, but it was designed for an environment in which resources are relatively static. In contrast, the contents of a data repository may change rapidly as results of data-mining algorithm executions become objects within the repository.

2.6

Users can be mobile too

It is part of the nature of grid usage that the user is decoupled from the computation, since the computation is (presumably) large and time-consuming and the nature of the grid makes predicting how long a task will take difficult. However, there are factors that make decoupling from computations more typical in datacentric grids: 1. The description of a datacentric computation is less like a program and more like a query and so may be created from a lightweight platform such as a PDA or other handheld device that does not necessarily remain network-connected; 2. The execution of a datacentric computation may often not be part of the thinking process of whoever creates it, that is the creator may not be waiting for the result in real-time. 8

Making datacentric grids accessible from lightweight and mobile access points requires that both the computation itself and the execution planning for it are executed on a more substantial device; and that results are stored near, but not at, the user’s access point in case the access point is not connected at the time the results are produced.

2.7

The design of the DCGrid

The DCGrid [40] is a four-level architecture that addresses the requirements of the previous sections. Starting from the bottom, the entities at these four levels are: 1. Data/Compute Servers (DCS), the nodes at which data-intensive computation takes place. Each DCS has three parts: (a) A compute engine, which may be internally parallel, and is able to execute mobile code using the local datasets as data; (b) A data repository, containing both ‘raw’ datasets, and (local) models that have been cached and are now regarded as datasets in their own right; (c) A metadata tree, which describes the datasets available at this node. There are potentially many DCSs making up a DCGrid. 2. Grid Support Nodes (GSN), the nodes that maintain information about the grid as a whole. GSNs hold two kinds of information: • A directory of Data/Compute Servers describing:

(a) The static properties of each DCS, that is properties such as number of processors, internal network bandwidth, and so on; (b) Dynamic properties of each DCS, such as projected load into the future from existing committed tasks (obviously this information is always slightly stale); (c) Digests of each DCS’s metadata tree.

• An execution plan cache, containing recent execution plans parameterized by their properties as tasks and their achieved performance. GSNs are replicated in different regions of the grid so that there is no single point of failure. In addition, some GSNs may decide to maintain information regionally, rather than for the entire grid. 3. User Support Nodes (USN), which provide computational support for execution planning and storage for user results that are not being passed to users immediately. USNs are surrogates for users. USNs exist near the boundary of networks so that they are close to user devices. 4. User Access Points (UAP), which are the devices via which users create datacentric computations and view results. These may be ordinary network-connected computers, but they are not required to remain connected and so may equally well be mobile devices. USNs and UAPs may exist in the same piece of hardware if it is not a power-limited mobile device. A typical process that occurs when a user generates a data-intensive computation are as follows: 9

1. The user creates a description of the computation at the UAP. 2. This description is passed to the appropriate (e.g. nearest) USN which is responsible for determining what repositories and algorithms are needed. It then formulates a request for service that is sent to a GSN. 3. The GSN is regularly updated on the state of the DCGrid and therefore has an estimate of the global state, including both the locations of data, and the computational resources available into the future for computation with the data. For example, the same dataset may be replicated in more than one place, so an estimate of which site can provide the earliest completion is needed. The GSN computes an execution plan for the computation, including partitioning the computation to match the partitioning of the data, and producing the code that will be sent to each DCS. This latter requires translating the user’s computation and inserting parallelism to match the abilities of the DCS. This execution plan is then cached for possible reuse if it is deemed typical enough that reuse is likely. 4. The code is sent to the appropriate set of DCSs where it is executed. Models that are produced at each DCS may be cached if they are interesting enough. 5. The results of the computation at each DCS are sent back to the originating GSN which carries out any residual computation needed to construct a global model (this will become clearer in the subsequent section). The execution plan is annotated with information about its actual performance so that future estimates can be improved. 6. The global model resulting from the computation is sent back to the originating USN. 7. If the user is connected, the result is passed directly to the user. Otherwise, the result is held until the user connects to the network again. Another datacentric grid design is the Knowledge Grid [8] which has similar goals but is built on top of conventional Globus grid technology.

3

Algorithmic implications

Data-mining applications are unusual in that they are both compute-bound and data-access-bound; in other words, such applications access large amounts of data, but also spend significant numbers of cycles per datum. There is a significant body of work on parallel data mining that grows out of conventional parallel algorithms, and assumes that the arrangement of the data is under program control. For example, parallel decision tree algorithms exploit the hierarchical nature of decision tree construction to divide datasets recursively and build unconnected decision nodes independently [19, 37]. In a similar way, parallel association rule discovery systems partition to tree of itemsets and the data in the same way so that only local accesses are needed [1, 16]. Such algorithms obviously work best in a shared-memory environment but, with some care, can sometimes be extended to work in a distributed-memory environment. However, ensuring scalability is always difficult. The assumption that the arrangement of data can somehow be controlled by the algorithm is increasingly suspect for a number of reasons. For example, the ownership of datasets and the right to compute with them, once synonymous, are tending to move apart. Here, we present data-mining 10

algorithms that do not make this assumption about data – rather these algorithms use data as they find it, making no assumptions about the possibility of rearranging data to suit the algorithm, or of the data being divided into partitions of equal size. Even without the assumption of control over data partitioning and location, the algorithms presented here are scalable and, in some cases, are able to achieve superlinear speedup. Distributed and immovable datasets obviously require that the algorithms that build models from them should be reasonably efficient, in particular that they should be aware of and exploit the available concurrency. The fundamental requirement for such algorithms is that they be able to use the information in each piece of the dataset to build local models independently, and that these local models can be combined into a single, global model without contradictions or redundancy. No general way of doing this is known (other than metalearning), but specific techniques have been discovered for many mainstream data-mining algorithms. Some of these parallel and distributed algorithm variants turn out to be of considerable interest in their own right. The structure of distributed and parallel data-mining algorithms depends on how the dataset itself is divided into pieces, and it is this that we consider next.

3.1

How is the data partitioned?

A typical dataset used for data mining consists of a large number of rows, each describing one object; and a more moderate number of columns, each corresponding to an attribute that objects may possess. Other representations are possible but we will not consider them here. There are two ways in which such a dataset can be divided: 1. By rows, that is each part of the dataset contains all of the information about some of the objects; 2. By columns, that is each part of the dataset contains all of the information about some of the attributes. A partition by rows arises naturally when the data about different objects is collected via different channels. A partition by columns arises naturally when the same objects have different properties in different contexts; for example, a hotel customer may acquire different attributes from hotel stays in different countries. It is conceivable that some datasets might be partitioned by both rows and columns, but this does not seem typical. Because the allocation of data to partitions depends on how that data is collected, the sizes of the different partitions may not be similar and this needs to be taken into account in the design of algorithms. Under the assumption that data is placed depending on how it was collected, it is also important to remember that each piece is a biased sample. This means that the local model is likely to inherit this bias, further complicating the problem of combining models. It is also conceivable that the partition of the dataset into pieces does not arise because of how it was collected but for some other reason. In this case, it is possible that the data in each piece represents not a partition of the data, but a more carefully constructed piece such as a sample with replacement. Such properties, if they could be guaranteed, would again have implications for the design of algorithms.

3.2

Why not use sampling?

An obvious question is this: why not use one piece of the dataset, if it an unbiased sample, or a sample collected by sampling each of the pieces and collecting the results in one place? Such a 11

Figure 1: Quality versus examples processed curve for a typical data-mining algorithm dataset would be much smaller than the entire dataset, perhaps small enough to be kept in one place and analyzed using a sequential algorithm. The reason that this is not effective has to do with the structure of typical datasets used for data mining. Such datasets seem to contain a great deal of redundancy. A model-building algorithm explores a space of possible models which can be very large. Each example from the dataset essentially forces the model builder to consider new parts of this space. Because of the redundancy in the data, the quality of the resulting model tends to improve quickly as the early examples from the dataset are processed. However, after a while, the examples being processed are similar to those seen already and do not add much to the model quality. Hence a graph of model quality as a function of the amount of data seen has the shape shown in Figure 1. At first glance, this seems to support a methodology based on sampling. However, if a larger initial dataset is used, the effect on the curve of model quality versus number of examples processed is shown in Figure 2. Having more examples in the initial data produces a curve with the same characteristic shape, but a higher asymptote, that is a better quality model. To see why this should be so, consider the following: suppose that the dataset is partitioned into classes such that the examples in each class force the model builder to consider a new region of the space of possible models. In typical data-mining datasets, the number of classes is far smaller than the number of examples; in other words, such datasets are quite repetitive and many objects are quite similar to other objects. Now consider a sample from this partitioned view of the dataset. As the model builder begins to look at examples, it is likely that the first few examples will be from different classes so that the resulting model will quickly improve as new regions of the space of models are visited. After a while, each new example is likely to come from a class that has already been used, so the improvement in model quality slows. However, if a larger sample is used, there are likely to be representatives from more classes, so the final model learned from the larger sample will be absolutely better than that learned from the smaller sample. This view is an oversimplification for two reasons: it is not really possible to partition the data into classes; rather the examples form loose clusters of similar novelty. Second, most data-mining algorithms are weak learners, so the novelty of new examples and partitions depends, to some extent, on the choices that the model builder has already made. Nevertheless, it provides some insight into this unusual structure of the quality versus examples seen curve. The 12

Figure 2: Shape of the curve for a larger dataset size net result is that sampling is useful as a route to approximate models, but has serious limitations if the goal is the build the best possible model from the given data. (See [46] for graphs based on real data.)

3.3

Parallel data-mining algorithms

Parallel data mining algorithms apply to datasets partitioned by rows; they build local models that have access to the full set of attributes, but not to all of the objects. This is the easier case. 3.3.1

Basic strategy

It is tempting to parallelize data-mining algorithms directly, in the style in which many other algorithms are parallelized. Some data-mining algorithms can be parallelized without large amounts of communication because different parts of the model can be constructed independently using different parts of the dataset. For example, once a split on a particular attribute has been chosen for a node of a decision tree, the two subtrees can be independently constructed using partitions of the dataset. However, such parallelization requires the algorithm to control where parts of the computation take place and how the data is arranged, and so apply only within a single parallel system. A model, or any component of a model, is a compact representation of the examples from which it has been built. When communication is required, it is always better to move (partial) models than to move the data from which they are built. The basic parallelization strategy for data-mining algorithms is therefore [38]: • Build a local model on each partition; • Collect these local models in one place; • Combine the local models to build a global model. If the algorithm used to build each local model is the sequential algorithm, then the speedup in this local computation phase, relative to the sequential algorithm, is p, the number of processors. 13

Figure 3: Shape of the model quality versus examples processed curve for a parallel implementation (This assumes that the size of each partition is the same; if not, the compute time is dominated by the time to model the largest partition.) The communication cost is low, since all that is required is to gather p local models and these are small compared to the size of the data. The remaining overhead is the computation time required to build a global model from the local models. This is algorithm-dependent, but tends to be small because the local models themselves are small. If the complexity of the underlying sequential data-mining algorithm is c(n) where n is the size of the dataset, then the cost of the parallel algorithm becomes: Parallel computation = c(n/p) Communication = O(p)

Sequential combination = O(p)

Suppose that we modify this parallel execution scheme slightly so that, instead of collecting the partial models in one place, we perform a total exchange and combine the local models into a global model at each processor. The model quality versus number of examples processed curve now looks as shown in Figure 3. From the point of view of each processor, two things have changed: first each processor spends its time more productively (in a steeper region of the curve); second, the combining step produces an improvement in quality without the processor having to process more examples (and essentially for free except for the small amount of work to combine). These improvements introduce a small amount of slackness that can be used to compute the same quality of model as the sequential algorithm with fewer total cycles (a superlinear speedup) or to produce a better model in the same number of cycles. This superlinear speedup is not an artifact of the memory hierarchy, and so indicates the existence of an improved sequential algorithm that uses a ‘bitewise’ strategy: starting with an empty model, build a model from some number of examples, combine it with the existing model, and repeat until all of the data has been examined. This has led to new sequential versions of several standard data-mining algorithms. In some cases, further speedup can be achieved if partial models are exchanged regularly. These speedups occur because the information learned by one processor has the effect of pruning the search space that others must explore. We illustrate in the following sections for particular algorithms. This kind of speedup does not translate into improved sequential algorithms.

14

3.3.2

Neural networks

There are many different forms of neural network. For concreteness we will assume a supervised neural network using backpropagation, but the ideas in this section apply to almost all other forms of neural network, including unsupervised forms such as SOMs. Neural networks are usually trained stochastically, feeding each example to the network, comparing its outputs with the desired result, and then propagating the difference back through the network, adjusting the weights as appropriate. Neural networks make it clear why direct parallelization strategies are unattractive. A value must be propagated across every connection, both forwards and backwards, for every example. Hence any partition of the network among processors requires communication volumes proportional to n (the size of the dataset and the number of examples). The direct parallelization strategy is to create a copy of the neural network in each processor, apply the sequential learning algorithm locally, and then collect and average the weights centrally. Notice that, in this case, the size of the neural network (the model) is related to the number of attributes, m. Hence the cost of the communication and combination phases, while independent of n, are functions of m. In batch learning, the error terms that result from the difference between the produced and desired outputs of the network are accumulated and applied once to update the network weights. This property is the key to improving the parallel algorithm. In the improved version, each of the p pieces of the dataset is further divided into q pieces. Each processor then builds a model (in this case a set of neural network weights) from data of size n/pq and exchanges the resulting error vectors with all of the other processors. Each processor then applies the p error vectors to its own weights, and repeats the whole process with the next piece of data of size n/pq. This produces much more rapid convergence than even the ordinary parallel algorithm. Choosing the size of the pieces to be processed before an exchange is important. The goal is to keep the processor working in the steepest part of the quality versus examples processed curve. If the pieces are too small, then the resulting error vectors are dominated by the idiosyncrasies of the data in these pieces; if the pieces are too large, then computation is wasted improving error vectors that are already acceptably accurate. This parallel algorithm is highly efficient and can achieve substantial speedups even on modest numbers of processors [33, 34]. The key to its performance is that the messages exchanged, consisting of error vectors, are a highly compressed form of information about the part of the dataset held by other processors. Hence each processor is able to learn from all of the data, not just the part that it holds. Notice also that the frequent exchange of partial models acts to reduce the effects of bias in any one of the pieces because whatever unusual examples it holds influence the other processors almost immediately. 3.3.3

Inductive logic programming

Inductive logic programming (ILP) [30] is an example of a data-mining technique in which the model covers the examples. The dataset is a set of positive and negative examples, in some simple clausal form, and the goal is to construct a description that models all of the positive examples and none of the negative ones. There are a number of sequential algorithms (e.g. [29]) but the following abstraction gives the essence of the technique. • Select a positive example at random; 15

• Generalize it (techniques vary) to a description in the proper form, maintaining the property that it does not model any (sometimes ‘only a few’) negative examples (note that this requires a pass through the dataset to check); • Remove the positive examples that the description models (note that this requires another pass through the dataset), and repeat. The final model is some kind of union, perhaps the conjunction, of all of the descriptions learned during each round. The parallel algorithm replicates this sequential algorithm at each processor [42]. Each processor selects and generalizes its own examples. When each processor has accounted for all of its examples, the resulting descriptions are gathered and combined to create a global description. Since each processor is working with a set of examples of size n/p and the computation time is dominated by the need for two sequential passes through the local data, this produces almost perfect linear speedup. A small amount of overhead arises from the communication and computation required for the combining. There are two complicating issues: • Two processors might produce the same, or almost the same, description. This can happen, but in practise is highly unlikely until the last few rounds when the number of remaining examples is small. • The description produced in one processor, which is guaranteed not to model any of the negative examples in that processor, might model a negative example in another processor. There is no complete solution to this problem which, in the worst case, might require a description to be discarded. However, when the number of negative examples is small, as it often is, they can be replicated across the processors. If the number of negative examples is large, they tend, in practise, to be repetitious so that the chance of one processor choosing a description that is untenable elsewhere is much reduced. This parallel algorithm can also be improved to give superlinear speedup by exchanging local descriptions frequently. When a processor receives descriptions from other processors, it can use them to remove not only the positive examples that its own description covers but also those that the descriptions of other processors cover. Hence the pool of positive examples shrinks p times faster than it would otherwise, speeding up the passes through the local data in subsequent rounds. This same algorithm structure applies to all data-mining models that account for examples, for example kDNF [44]. 3.3.4

Frequent sets

Computing frequent sets is the first step in discovering association rules. The difficulty with parallelizing this algorithm is that objects can be frequent overall without being frequent in many individual partitions of the data; and conversely objects can fail to be frequent even though they are frequent in many partitions. Hence it is hard to avoid keeping the entire ranked list of objects and their frequencies for each partition; which is unworkably large. A probabilistic algorithm is known for this problem. It relies on the fact that frequent objects are actually unlikely to seem infrequent, and vice versa [39]. Two probabilistic sequential algorithms [13] are the building blocks for the parallel algorithm. The first maintains a concise sample. A list of a certain size holds the frequent objects. As each 16

object in the dataset is examined, it is added to list with a certain probability. Each occurrence of each object is treated separately, but they are recorded on the list in the obvious compact way using pairs of object identifiers and multiplicities. When the list runs out of room, the projected frequencies of the objects currently in the list are computed; if these all exceed (a constant multiple of) the desired support then the list is lengthened by a small amount. If not, then the entry probability is decreased slightly, and the list is reprocessed to discard those entries that would not have been there given the new entry probability. The second, called a counting sample, maintains a similar list but with three differences: (a) objects that are already present in the list always have their counts updated (whereas before this would only happen with a certain probability); (b) as a result, the procedure for reprocessing the list is different, and (c) when the end of the dataset is reached, the counts of objects in the list are adjusted to account for some occurrences that may have been missed before each object’s first successful insertion. A counting sample is a slightly more precise estimate of frequency and so produces a better set of sufficiently frequent objects. The concise sample algorithm is straightforward to parallelize. Each processor computes a concise sample; these concise samples and their associated entry probabilities are collected; and each sample is reprocessed with the largest entry probability so that each reflects the same frequency boundary. The p lists can then be merged and any objects whose frequency is below the required support discarded. Parallelizing the computation of a counting sample is a bit harder because an object with low frequency in one partition may be discarded and its count lost. The easiest way to get a counting sample in parallel is to maintain both frequencies in the lists at each processor and use the concise sample values for program logic decisions, while computing the counting sample values at the end. It can be shown that both these parallel algorithms produce the same list as a sequential algorithm. The parallel performance is good (assuming that the desired support is reasonable) since the size of the lists is modest and little work is required to merge them at the end. In the spirit of the previous algorithms, parallel frequent set counting is probably improved by exchanging the current entry probabilities among processors at intervals. This would have the effect of maintaining the boundary between frequent and infrequent at about the same place in all processors. However, this has not been studied experimentally. Frequent set counting is usually a component of itemset analysis. The conventional sequential algorithm relies on the levelwise property that all of an itemset’s subsets must be frequent if the itemset itself is to be frequent. Hence many large itemsets are never constructed or their frequencies considered. The probabilistic algorithm clearly cannot afford to consider all of these itemsets explicitly. The solution is to estimate when an itemset is too large to be frequent based on the presence of its subsets in the current frequent object list. With some care, the probability of failing to consider a large itemset that is actually frequent can be made comparable to the similar failure to insert any object in the frequent objects list. The levelwise property is exploited indirectly because the failure of an object to be included in the list of potentially frequent objects reduces the inherent probability of its supersets being potentially frequent. The net effect is that the algorithm only considers a few more itemsets than a levelwise algorithm would. 3.3.5

Singular value decomposition

Singular value decomposition (SVD) [14] allows a dataset, considered as an n × m matrix A, to be expressed relative to a different basis. Formally, the SVD of A is A = U ΣV 0 17

where U is n × m, Σ is m × m, and V is m × m. U and V are orthonormal, and Σ is a diagonal matrix whose elements are a set of decreasing non-negative singular values, σ1 , σ2 , . . ., σr (where r is the rank of A, r ≤ m). The rows of U represent coordinates of the corresponding rows of A in a space whose axes are spanned by the columns of V . Symmetrically, the rows of V represent the coordinates of the corresponding columns of A in a space whose axes are given by the columns of U . The magnitudes of the corresponding singular values describe the amount of variation captured in each dimension. One of the powerful properties of SVD is that the decomposition can be truncated at k so that Uk is n×k, S is k×k, and V is k×m. The natural representation of the rows of U as points or vectors in a k-dimensional space is the most faithful representation of the objects of A in k dimensions. The proximity structure in this space captures similarity and correlation in the original dataset; this can be quite revealing to direct analysis, but also makes SVD an ideal preprocessing step for clustering. SVD has been used extensively for information retrieval [2, 11]. SVD has sequential complexity O(nm2 + m3 ) so it is expensive for typical datasets which have large n. An approximate parallel algorithm is known [43] (as well as several exact parallel algorithms designed for custom architectures with point-to-point networks [6, 26, 47, 48]). Suppose that the matrix A, representing the dataset, has been partitioned across processors. Call each of these submatrices, whose size is n/p × m, A1 , A2 , . . . , Ap . The algorithm implicitly assumes that each of these partitions is a not too biased sample, and this assumption needs to be considered for each dataset. The idea of this algorithm is to compute the SVD for a subset of the rows, and use this basis as an approximation to the true basis of the transformed space [43]. The algorithm is: 1. Select s × n/p2 elements from each processor’s partition and send them to processor 1. 2. Call the matrix formed by concatenating these rows, As . Compute the SVD of As = Us Σs Vs0 (only Σs and Vs are required). √ 3. Compute Σ = p Σs . Compute Σ−1 and the product S = Vs Σ−1 . 4. Broadcast S to all of the processors. 5. Compute Ui = Ai S The local result in each processor is Ai ≈ Ui ΣVs0 and the entire decomposition is     

A1 A2 ... Ap





     =   

U1 U2 ... Up



   Σ Vs0 

The computation cost of this algorithm is: O(s(n/p)m2 + m3 ) giving the expected speedup of a factor of p provided that m  n. The communication cost is O(s(n/p)m + m2 ). The choice of s is, of course, influenced by how biased the partitions of the dataset are expected to be – the more bias, the larger the value of s that should be chosen.

18

3.3.6

Bagging

Bagging is an ensemble technique that builds multiple predictors by selecting, with replacement, subsets of the training data and using these to train each predictor [3]. The resulting bag of predictors is deployed by taking the plurality of the votes of the component predictors, when these are categorical; or the average of the outputs when these are numerical. The idea of bagging is easily extended to a parallel setting, with the model built by each processor considered as one member of the ensemble. However, the part of a dataset at a processor is not, without careful preprocessing, a sample with replacement so there is no guarantee that bagging’s nicest property, decreasing variance, will occur. Nevertheless, small scale studies using parallel versions of bagging suggest that it can behave quite well in practice [46]. 3.3.7

Boosting

Boosting is an ensemble technique that improves bagging by forcing some of the predictors to learn from the more difficult objects, for example those that are close to decision boundaries [4, 12]. Each object in the dataset is given a probability, initially 1/n. Hence, the first learning sample is drawn from the dataset with uniform probability. After the first model is built, the objects in the dataset are reweighted according to whether the model predicts them correctly. Subsequent samples are drawn according to these weighted probabilities, so that these samples contain disproportionately difficult examples. After each new model is constructed, the dataset is reweighted using all of the models so far. This algorithm can be parallelized in a style that, by now, should be familiar [23, 24, 46]. Each processor executes the sequential algorithm on its own piece of the dataset. However, once every processor has built its model, these are exchanged with all of the other processors. The reweighting step is carried out based on how hard each object is to classify according to all the models from all of the processors. Hence the opinion about how difficult an object is to classify (captured by the probability of subsequent selection) depends on information from all of the existing models. Unlike previous parallelizations of sequential algorithms, parallelized boosting has no effect on the work to be done in subsequent rounds, but it should improve the quality of the models built in subsequent rounds and hence provide an overall improvement in convergence. This expectation is borne out in practise – the speedup of parallel boosting is slightly sublinear because of the overheads of communication and more complex voting, but the accuracy of the boosted predictor is improved. Parallel boosting is often faster in practise than parallel bagging because the sample sets used for learning in the later rounds of boosting become highly repetitious and many model builders are faster for such data. 3.3.8

Metalearning

Metalearning is similar to the parallel algorithms described above, except that, instead of combining or merging local models into a global model, fresh learning takes place using the local models as data. There are several advantages to metalearning: 1. The local models can be of different kinds, so metalearning systems can be applied to more heterogeneous forms of data; 2. Metalearning is more general in the sense that no special mechanism for combining local models needs to be discovered. On the other hand, metalearning has some disadvantages as well: 19

1. The learning required to produce the global model is expensive compared to model combination described above; 2. There are no opportunities for superlinear speedup because of shared information during the building of local models; 3. The learning to build the global model is based on biased data (the local models) whose biases are hard to compare. Chan and Stolfo [9] designed a metalearning system in which a model called a combiner was trained either on the outputs of the local predictors and the correct prediction, or on the attributes themselves, the outputs of the local predictors and the correct prediction. (Of course, the latter scheme is unattractive as a parallel algorithm because it requires moving each object in the dataset.) They showed that either of these schemes outperformed voting or other simple statistical aggregations (although we might now expect that weighting predictions by what Breiman calls out-of-bag estimators would produce better results). Guo et al. design a system for metalearning that tries to avoid the opaque nature of Chan and Stolfo’s approach – the combiner is a way of weighting local models and so it is hard to see why it does what it does (and indeed there is usually no deep reason since it depends on the individual biases of the local models). Knowledge probing [15] collects the local models and then uses a test set and some means of combining the predictions of the local models to produce a new training set – the test set with a target attribute labelling determined from the predictions of the local models. Any suitable data-mining algorithm can then be used to build a predictor from this new training data. Provided a transparent algorithm (that is, one whose reasons for prediction are visible) is used, the overall result is more understandable. The big drawback of knowledge probing is that the final step is sequential and requires a test set large enough to represent the entire data space adequately. Of course, this step could also be parallelized, so that a tree of metalearners was used, but in the end simpler schemes may perform just as well. 3.3.9

Summary

Parallel data mining has two roles to play in large-scale distributed data-intensive computation. First, when datasets are divided by rows, as when an organization collects data about different customers at different sites, then these parallelized algorithms can be used to compute global models of the entire dataset. Second, even when a dataset is distributed in other ways, these algorithms make it possible to use parallelism locally in each data repository to improve performance for the construction of local models.

3.4

Distributed data-mining algorithms

Some datasets are naturally partitioned so that attributes of the same object are collected at different locations. This is particularly the case when the objects are people because their mobility tends naturally to bring them into contact with organizations at different places and through different channels. However, there are other settings for which distributed data collection is the normal case, for example astronomical observations because the same star is visible from different places at different times, and satellite observations because the same location on earth is observed by different satellites at different times. In other words, distributed data collection is a side-effect of mobility, either of the objects or of the data collection channels.

20

3.4.1

Basic strategy

Distributed data mining is much more difficult than parallel data mining because the collection of attributes in different locations or via different channels is much more likely to be biased with respect to the properties of each object. This means that it is harder to build local models that can be properly assembled into a global model – the chances of contradictions between local models are inherently greater. For example, a traveller may be an exemplary hotel guest in North America but prone to destructiveness and pilfering in Europe. The revenue model for this customer, derived from each part of the dataset independently, will be very different. The fundamental approach used by the successful distributed data mining algorithms is to choose a basis such that the properties of each piece of the dataset can be expressed in terms of this basis, and the coefficients for each object relative to the basis are almost all zero (or close enough to zero to be treated as such) [20, 21]. Hence the amount of communication required to assemble local models is small. The hard part is choosing the basis to make this happen, without any prior knowledge of the distribution of the examples in the space that the basis defines. Fortunately, several workable bases have been discovered: for example, wavelets [17], and Fourier bases [32]. Matrix decompositions have the useful property that they are symmetric with respect to objects and their attributes. In the case of singular value decomposition, if A = U ΣV0 then A0 = V Σ U 0 (recall Σ is diagonal). The distributed case corresponds to partitioning the rows of A0 so the parallel algorithm described in Section 3.3.5 can be used. Here the orthogonal axes of the transformed space play the same role as other kinds of bases. The difficulty with these approaches is that some exchange of information is required to include the information about cross correlation of attributes in the model. Exchanging examples, particularly examples that are known to be interesting, hard, or anomalous is one way to do this. It is, however, expensive, sometimes requiring a significant fraction of the dataset to be exchanged. For some datasets, it may happen that only a few of the coefficients with respect to the chosen basis are significantly different from zero. Exchanging just these coefficients is clearly much less expensive than exchanging examples. However, it is not known, in the general case, how to choose the ‘best’ basis, that is the one where this is most likely to happen. 3.4.2

Random forests

Random forests shed new light on the issue of sharing cross-correlated information between different partitions of a dataset. In random forests [5] an ensemble of predictors is built from the dataset. For each predictor, a set of k attributes are selected at random. A random sample of size n is drawn, with replacement, from the dataset, creating a training dataset of size n × k and a set of objects that were not selected, even once. This latter set is kept as a form of test set. A predictor is then learned on the training set, and its accuracy estimated using the test set (called an out-of-bag estimator). When the predictor is deployed on new data, it can be weighted by the accuracy determined using the out-of-bag estimator, which can also be a useful guide to the rate of convergence of the overall ensemble.

21

From our point of view, the interesting property of random forests is that they work quite well even when k = 1. In other words, a predictor based only on a single attribute can still improve the accuracy of an ensemble in which it participates. How can such an ensemble take account of cross correlations between partitions of the dataset? The answer is that it does it indirectly because of the way that weights influence the overall vote of the ensemble. For example, consider the data below: a1 0 0 0 0 1 1 1 1

a2 0 0 1 1 0 0 1 1

a3 0 1 0 1 0 1 0 1

target t 0 0 0 0 0 0 1 1

The predictor p1 learned from attribute a1 and the target learns the model a1 → t; the predictor p2 learns a2 → t; and predictor p3 (perhaps) learns a3 → t. The correct global model is a1 ∧ a2 → t. The calculated prediction accuracies of the three models are: p1 0.75, p2 0.75, p3 0.5. For large enough data we would expect that the out-of-bag estimates for such predictors would be close to the actual prediction accuracies. If the three predictors are learned independently, gathered in one place, and deployed as an ensemble with weighted voting, then we see that the ensemble achieves prediction accuracy of 1. The cross-correlation needed for good prediction, that both a1 and a2 should have the value 1 to predict the target value to be 1, is implicit in the local predictors and their weights. Hence random forests provide another mechanism for extracting local information in a compact way (the predictors which are typically much smaller than even a single column of the dataset) and combining it to form an accurate global predictor. 3.4.3

Privacy across datasets

One of the attractive properties of parallel and distributed data-mining algorithms is that they make it possible for different organizations to gain the benefits of combining their information without having to reveal it. For example, individual companies in an industry sector can build global models of customer demand or revenue without any requirement to share specific individual information. This works whether the individual company datasets are about different objects or about different attributes of the same objects. In this context, opaque data-mining models such as neural networks make it hard to reverse engineer anything much about the data from which they were constructed, so revealing local models reveals little about each organization’s data. Other local models such as association rules and decision trees reveal broad-brush properties of each organization’s data, but do not reveal information about individual objects or attributes. 3.4.4

Summary

Data-mining algorithms need to be simultaneously parallelized at two scales: one to handle the geographical distribution of pieces of a dataset, and the other to handle the distribution of data within a single data repository. Direct parallelization is not attractive because it requires large

22

amounts of communication. Fortunately, most data-mining algorithms can be parallelized by replication, building a local model associated with each piece of the data and then combining these local models to produce a global model. Surprisingly, this strategy works even when the data is partitioned by columns, so that attribute values belonging to the same object are separated. Furthermore, parallelized data-mining algorithms have attractive speedup properties. Almost linear speedup is usually achievable directly because the amount of communication required to gather the local models is small (since the models themselves are small relative to the size of the raw data), and the amount of computation required to merge local models into a global model is also usually small. Even better speedups can often be achieved because the information learned by one processor can be used to prune the model search space being explored by other processors. Exchanging models regularly is one way in which this can be made to happen. A parallel perspective on datamining algorithms not only produces superlinear speedup, but also suggests better, bitewise, ways of implementing sequential versions of the same algorithms.

4

Summary and conclusions

We have argued that the size and rate of growth of datasets, as well as an increasing number of barriers to moving them freely, suggest that code mobility is more attractive than data mobility. This change has implications in the design of large-scale distributed systems and middleware, making a number of issues suddenly more important: combining computing power and data storage in single nodes, finding workable solutions to information flow and security issues, designing new programming environments, and finding new ways to plan execution and cache the resulting plans. We have illustrated some of these issues using the particular design decisions of the DCGrid. When data is distributed, software cannot assume anything about its structure and arrangement, and must be prepared to work with it as it is. We have suggested that this requires parallel and distributed algorithms that can be efficient no matter what data arrangement they find. This requires a particular algorithm structure in which local information is extracted and communicated only when it is in as compact a form as possible. Fortunately, algorithms with this structure are of interest in their own right, exhibiting a number of novel forms of speedup, including some fundamental ones related to branch and bound.

References [1] R. Agrawal and J. Shafer. Parallel mining of association rules: Design, implementation and experience. Technical Report RJ10004, IBM Research Report, February 1996. [2] M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573–595, 1995. [3] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [4] L. Breiman. Arcing classifiers. Annals of Statistics, 26(3):801–849, 1998. [5] L. Breiman. Random forests–random features. Technical Report 567, Department of Statistics, University of California, Berkeley, September 1999.

23

[6] R. Brent. Parallel algorithms in linear algebra, August 1991. [7] Ran Canetti, Yuval Ishai, Ravi Kumar, Michael K. Reiter, Ronitt Rubinfeld, and Rebecca N. Wright. Selective private function evaluation with applications to private statistics. In Proc. 20th ACM Symposium on Principles of Distributed Computing (PODC 2001), Newport, Rhode Island, August 26-29 2001. [8] M. Cannataro and D. Talia. The knowledge grid. CACM, 46(1):89–93, 2003. [9] P.K. Chan and S.J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In U. Fayyad and R. Uthuruswamy, editors, Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1995. [10] A. Chander, J. Mitchell, and I. Shin. Mobile code security by java bytecode instrumentation. In 2nd DARPA Information Survivability Conference and Exposition, DISCEX II, pages 1027– 1040, 2001. [11] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. [12] Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, pages 148–156, 1996. [13] P.B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of ACM Conference on the Management of Data, pages 331–342, 1998. [14] G.H. Golub and C.F. van Loan. Matrix Computations. Johns Hopkins University Press, 3rd edition, 1996. [15] Y. Guo, S. R¨ uger, J. Sutiwaraphun, and J. Forbes-Millott. Meta-learning for parallel data mining. In Proceedings of the Seventh Parallel Computing Workshop, 1997. [16] E.H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In Proceedings of ACM Conference on the Management of Data, pages 277–288, 1997. [17] D. Hershberger and H. Kargupta. Distributed multivariate regression using wavelet-based collective data mining. Journal of Parallel and Distributed Computing, 61(3):372–400, 2001. [18] C. B. Jay. Distinguishing data structures and functions: The constructor calculus and functorial types. In TLCA, pages 217–239, 2001. [19] M.V. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In Proceedings of IPPS/SPDP’98, 1998. to appear. [20] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Collective data mining: A new perspective toward distributed data mining. In H. Kargupta and P. Chan, editors, Advances in Distributed Data Mining. AAAI/MIT Press, 1999.

24

[21] H. Kargupta, B. Park, E. Johnson, E. Sanseverino, L. Silvestre, and D. Hershberger. Collective data mining from distributed vertically partitioned feature space. In Workshop on distributed data mining, International Conference on Knowledge Discovery and Data Mining, 1998. [22] L. Gong and M. Mueller and H. Prafullchandra and R. Schemers. Going Beyond the Sandbox: An Overview of the New Security Architecture in the Java(TM) Development Kit 1.2. In Proc. the USENIX Symposium on Internet Technologies and Systems, December 1997. [23] A. Lazarevic and Z. Obradovic. The distributed boosting algorithm. In KDD2001, pages 311–316, August 2001. [24] A. Lazarevic and Z. Obradovic. Boosting algorithms for parallel and distributed learning. Distributed and Parallel Databases, Special Issue on Parallel and Distributed Data Mining, 11(2):203–229, 2002. [25] Xavier Leroy. Java Bytecode Verification: An Overview. Proc. CAV’01. LNCS, 2102:265–285, 2001. [26] F. T. Luk. “Computing the singular value decomposition on the ILLIAC IV”. ACM Transactions on Mathematical Software, 6:524–539, 1980. [27] Dahlia Malkhi and Michael K. Reiter. Secure execution of java applets using a remote playground. Software Engineering, 26(12):1197–1209, 2000. [28] G. McGraw and E. Felten. Securing Java: Getting Down to Business with Mobile Code. John Wiley and Sons, 1999. [29] S. Muggleton. Inverse entailment and Progol. New Generation Computing Systems, 13:245– 286, 1995. [30] S. Muggleton. Scientific knowledge discovery using inductive logic programming. Communications of the ACM, 1999. [31] G. C. Necula and P. Lee. Research on proof-carrying code on mobile-code security. In Proceedings of the Workshop on Foundations of Mobile Code Security, 1997. [32] B. Park and H. Kargupta. Constructing simpler decision trees from ensemble models using fourier analysis. In Proceedings of the 7th Workshop on Research Issues in Data Mining and Knowledge Discovery, ACM SIGMOD, pages 18–23, 2002. [33] R.O. Rogers and D.B. Skillicorn. Using the BSP cost model to optimize parallel neural network training. In Workshop of Biologically Inspired Solutions to Parallel Processing Problems (BioSP3), in conjunction with IPS/SPDP’98, March 1998. [34] R.O. Rogers and D.B. Skillicorn. Using the BSP cost model to optimize parallel neural network training. Future Generation Computer Systems, 14:409–424, 1998. [35] Tomas Sander and Christian F. Tschudin. Protecting mobile agents against malicious hosts. Lecture Notes in Computer Science, 1419:44–??, 1998. [36] L. F. G. Sarmenta. Sabotage-tolerance mechanisms for volunteer computing systems. In ACM/IEEE International Symposium on Cluster Computing and the Grid (CCGrid’01). Brisbane, Australia, May 2001. 25

[37] J. Schafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proceedings of VLDB22, Mumbai, India, 1996. [38] D. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, 7(4):26–35, October– December 1999. [39] D.B. Skillicorn. Parallel frequent set counting. Parallel Computing, Special Issue on Parallel Data Intensive Algorithms and Applications, 28(5):815–825, May 2002. [40] D.B. Skillicorn. The case for datacentric grids. In Workshop on Massively Parallel Programming, IPDPS2002, to appear. [41] D.B. Skillicorn and D. Talia. Mining large data sets on grids: Issues and prospects. Computing and Informatics, Special Issue on Grid Computing, 21(4):1–16, 2002. [42] D.B. Skillicorn and Y. Wang. Parallel and sequential algorithms for data mining using inductive logic. Knowledge and Information Systems Special Issue on Distributed and Parallel Knowledge Discovery, 3(4), 2001. [43] D.B. Skillicorn and X. Yang. High-performance singular value decomposition. In Grossman, Kamath, Kumar, Kegelmeyer, and Namburu, editors, Data Mining for Scientific and Engineering Applications, pages 401–424. Kluwer, 2001. [44] L.G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, November 1984. [45] P. Watson and G.W. Catlow. The architecture of the ICL Goldrush megaserver. In C. Goble and J.A. Keane, editors, Advances in Databases, Proceedings of the 13th British National Conference on Databases, Springer Lecture Notes in Computer Science 940, 1995. [46] C. Yu and D.B. Skillicorn. Parallelizing boosting and bagging. Technical Report 2001–442, Queen’s University Department of Computing and Information Science Technical Report, February 2001. [47] B. Zhou and R. Brent. The parallel implementation of the one-sided Jacobi algorithm for singular value decompositions. In Proc. of 3rd Euromicro Workshop on Parallel and Distributed Processing, pages 401–408, January 1995. [48] B. B. Zhou, R. P. Brent, and M. Kahn. A one-sided jacobi algorithm for the symmetric eigenvalue problem. In Proc. of 3rd Parallel Computing Workshop, November 1994.

26

Suggest Documents