An Evaluation of Preference Clustering in Large-Scale Multicast Applications Tina Wong Randy Katz Steven McCanne Department of Electrical Engineering and Computer Science University of California, Berkeley ftwong,randy,
[email protected] Abstract— The efficiency of using multicast in multi-party applications is constrained by preference heterogeneity, where receivers range in their preferences for application data. We examine an approach in which approximately similar sources and receivers are clustered into multicast groups. The goal is to maximize preference overlap within each group while satisfying the constraint of limited network resources. This allows an application to control the number of multicast groups it uses and thus the number of connections it maintains. We present a clustering framework with a two-phase algorithm: a bootstrapping phase that groups new sources and receivers together, and an adaptation phase that re-groups them in reaction to changes. The framework is generic in that an application can customize the algorithm according to its requirements and data characteristics. We conducted detail simulation experiments to study various issues and tradeoffs in applying clustering to different preference patterns and application classes. We found that clustering successfully exploits preference similarity and utilizes network resources more efficiently than when it is not used. Also, application-level hints can be incorporated in our algorithm, which are instrumental in the creation of an effective grouping of sources and receivers. Our algorithm handles changes dynamically, and also limits multicast “join” and “leave” disruption to the application.
I. I NTRODUCTION The deployment of the Multicast Backbone (MBone) has enabled a variety of large-scale applications in the Internet, such as video conferencing, electronic whiteboard, information dissemination and network games. These applications would otherwise bombard the network and content servers if unicast is used. IP Multicast is an efficient point-to-multipoint delivery mechanism because data disseminated to a large group travels only once through the common parts of the network. However, this efficiency is often constrained by network and end-host heterogeneity, when receiving rates and processing speeds vary at the receivers. We also observe preference heterogeneity in multicast applications with rich data types and configurable user interfaces, where receivers range in their preferences for application data. For example, in news dissemination, subscribers customize the service to receive only the desired news categories. Players in a network game require detailed and frequent state updates only from others they are closely interacting with. In Internet TV broadcast, viewers channel-surf a few channels among the many available at any time. In the limit of complete preference similarity, multicast is the optimal communication paradigm; in the limit of complete heterogeneity, unicast should be used instead. Between these two extreme scenarios is a spectrum where we need to group receivers with the same preferences together. However, in the limit of many small groups, the control overhead associated with multicast forwarding state becomes unacceptable. The tradeoff lies herein: we cluster receivers and sources within an application into approximately similar groups to maximize overlap in preferences and satisfy the constraint of limited network resources. This work was supported by DARPA contract N66001-96-C-8505, by the State of California under the MICRO program, and by NSF Contract CDA 9401156.
In this way, the application controls the number of multicast addresses it uses and the number of connections to be maintained by its data sources. Clustering assumes simple network primitives provided by the current IP multicast service model: packets sent to a multicast address are delivered to all end-hosts subscribed to that address. An alternate solution to accommodate preference heterogeneity is to have sources transmit all their data, and receivers filter out the undesired data. However, this solution is inappropriate, because it wastes both network resources and CPU processing cycles in handling the unnecessary data. Another approach, similar to layered multicast for congestion control [14], [20] and multicast filtering in DIS [13], is to send different versions of data on separate multicast addresses. Though the benefit of this scheme is that preferences are well-matched, the main drawback is that the number of addresses used scales linearly with the number of sources and/or the granularity of preferences in an application. While the introduction of IPv6 provides ample distinct multicast addresses, the more severe problem of multicast routing state still remains [3], [7]. The detrimental cost arises from the memory needed at the routers to store this state, and more so the processing of periodic control messages to maintain them. To combat this linearity of growth in the number of multicast addresses used, we can send a single data stream that models average preferences to all the receivers. This is analogous to the SCUBA protocol for Internet video conferencing [1], in which votes are collected from receivers to determine the popularity of video sources, and then to decide which sources to allocate most of the total session bandwidth. This approach works well if receivers exhibit a consensus among their preferences, e.g., in a lecture broadcast, the audience is usually interested in only the people currently holding the floor. However, applications do not always show such consensus, e.g., in news dissemination and network games as explained earlier. Assuming consensus in these applications leads to poor preference matching at the receivers. To deal with applications where receivers exhibit multiple modes of preference, we can create a separate multicast session for each group of sources and receivers with the same preferences. This is analogous to proxy-based schemes to accommodate network and end-host heterogeneity [2], in which a proxy is instantiated to service clients’ requests in a fine-grained manner. Although this approach transmits and processes only the data matching receiver preferences, it is impractical if the number of these groups is large. This is because the total data rate injected into the network is unregulated, and the control overhead from using a large number of multicast addresses not considered. In this paper, we present a clustering framework consisting of a two-phase algorithm: a bootstrapping phase that groups new sources and receivers together, and an adaptation phase that re-
Sources S1
R1
Receivers
groups them to dynamically handle changes in preferences and the departures of old sources and receivers. The algorithm also governs the total data rate injected into the network across all the sources according to the preferences, which helps to avoid and accommodate network congestion. Clustering also improves scalability of the application with respect to number of users that can be supported, because it reduces the application’s network and processing requirements. Our framework is generic in that an application can customize the algorithm according to its requirements and data characteristics. The protocol to coordinate sources and receivers to perform clustering is outside the scope of this paper, and we describe it elsewhere [21]. We conducted detailed simulation experiments to study various issues and tradeoffs in applying clustering to different preference patterns, data types, and application classes. We found that:
R2
R3
R4
S3
S4
S2 S3 S4 1111111111111111111 0000 1111 0000000000000000000 0000 1111 000000000000 000000000000000000 111111111111111111 000000000000 111111111111 0000 1111 0000 0000111111111111 1111 0000000000000000000 1111111111111111111 0000 1111 000000000000 111111111111 000000000000000000 111111111111111111 000000000000 111111111111 00001111 1111 0000 1111 0000 1111 0000000000000000000 1111111111111111111 0000 1111 000000000000 111111111111 000000000000000000 111111111111111111 000000000000 111111111111 0000 1111 0000 1111 0000 1111 0000000000000000000 1111111111111111111 0000 1111 000000000000 111111111111 000000000000000000 111111111111111111 000000000000 111111111111 0000 1111 0000 1111 000 111 000 111 C1 111 000 000 111 0000 1111 0000 C21111 1111 0000 0000 1111 000 111 000 111 0000 1111 0000 0000 1111 0000 00001111 1111 0000 1111 00001111 1111 0000 1111 0000 1111 0000 0000 1111 0000 1111 00001111 1111 0000 1111 0000 1111 0000R4 1111 R1 R2 R3 S1
(a)
(b)
Fig. 1. The grouping receivers (GR) scheme. A preference matrix is shown in (a), with each element representing a preference vector the corresponding receiver assigns on the source. We use a scalar to simplify the figure: H denotes high-quality and L low-quality. Receivers R1 and R2 are grouped because they show the same preferences towards all four sources; likewise for R3 and R4. The data transmissions and subscriptions are shown in (b). R1 and R2 get data from cluster C1, to which sources S1 and S2 send highquality data to, and S3 and S4 low-quality. Each cluster uses one multicast address. Sources
Based on our results, we designed and evaluated an adaptation process consisting of a simple control loop that slowly backs off in the absence of opportunity for beneficial re-grouping and quickly converges otherwise. The remainder of this paper is organized as follows. In Section II, we describe the problem setting of applying clustering in multicast applications, and detail the proposed clustering framework. Section III discusses the simulation experiments and results. Section IV compares our work to related research. Section V describes future work, and we conclude the paper in Section VI. II. P ROPOSED S OLUTION We propose a generic clustering framework that groups approximately similar receivers and sources in a multicast application with the constraint of limited network resources. We use the following terminology: A source represents a logical stream of data. There are multiple data sources in an application which can originate from a single end-host or different end-hosts. A receiver is interested in certain sources. A cluster represents a group of similar sources and receivers. One or more multicast addresses is associated with a cluster. A partition is the set of clusters that contains all the sources and receivers in an application.
S1
R1
Receivers
clustering successfully exploits preference similarity in multicast applications and utilizes network resources more efficiently than when it is not used. application-level hints can be incorporated in clustering, which are often instrumental in the creation of an effective grouping of sources and receivers. sampling a small subset of sources and receivers during clustering significantly improves scalability of the algorithm, while only introducing minimal impact on the grouping that results. the adaptation phase of the algorithm successfully minimizes multicast “join” and “leave” disruption to the application through incrementally refining the grouping.
S2
11111111111111 00000000000000 00000000000000 11111111111111 H H L L 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 H H L L 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 L L H H 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 L L H H 00000000000000 11111111111111 00000000000000 11111111111111
R2
R3
R4
S2
S3
S4
1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 H H 1111111 L L 0000000 1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 H H 0000000 L L 0000000 1111111 0000000 1111111 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 L L 1111111 H H 0000000 1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 L L 1111111 H H 0000000 1111111 0000000 1111111 (a)
11111111 0000 0000S2 S3 000001111 11111 0000S4 0000 1111 0000 00000 11111 0000 00001111 1111 0000 1111 000001111 11111 0000 1111 0000 1111 0000 1111 00000 11111 0000 1111 0000 1111 0000 1111 00000 11111 0000 1111 000 111 000 111 C1 111 C2 000 000 111 0000 1111 000000000000000000 111111111111111111 0000 1111 000000000000 111111111111 000000000000000000 111111111111111111 000000000000 111111111111 0000 1111 0000 1111 000 111 000 111 0000 1111 000000000000000000 111111111111111111 0000 1111 000000000000 111111111111 000000000000000000 111111111111111111 000000000000 111111111111 0000 1111 0000 1111 0000 1111 000000000000000000 111111111111111111 0000 1111 000000000000 111111111111 000000000000000000 111111111111111111 000000000000 111111111111 0000 1111 0000 1111 0000 1111 000000000000000000 111111111111111111 0000 1111 000000000000 111111111111 000000000000000000 111111111111111111 000000000000 111111111111 0000 1111 0000 0000 1111 000000000000000000 111111111111111111 0000 1111 000000000000 111111111111 000000000000000000 111111111111111111 000000000000 111111111111 00001111 1111 0000 1111 S1
R1
R2
R3
R4
(b)
Fig. 2. The grouping sources (GS) scheme. S1 and S2 are grouped together because R1 and R2 want high-quality data from both, while R3 and R4 want low-quality. Each source layers its data into one base layer and one enhancement layer when sending to a cluster. R1 and R2 subscribe to two layers from C1 to get high-quality data from both S1 and S2 but only one from C2 to get low-quality data from both S3 and S4. Here, each cluster uses two multicast addresses.
A. Grouping Schemes Network resources can be shared by grouping receivers with similar preferences into clusters—a scheme we call GR. Figure 1 is an illustration. For example, in a network game, we can use GR to group players that are close together in virtual space because they are focused on the same objects. A source sends data to all clusters containing receivers interested in itself, and a receiver gets data from the one cluster that best matches its preferences. An alternative is to share resources by grouping sources that receivers find similarly “interesting” or “uninteresting” (GS), as illustrated in Figure 2. For example, in news dissemination, we can use GS to group categories that users found collectively useful. A source sends data to just one cluster, and a receiver gets data from all clusters containing its desired sources. GS is also amenable to data layering. In Section III, we examine the tradeoffs between the two schemes through simulation experiments. B. Objective Function The objective function in clustering is to maximize the overlap of preferences in each cluster of a partition, while satisfying the constraints of network state and bandwidth consumption. The number of possible ways to arrange N objects into K clusters is approximately K N . An exhaustive search algorithm that finds the optimal solution is impractical for applications with real-time constraints. We formulate clustering with an approximation algorithm. Our algorithm is divided into two phases:
Layers 1&3
Layers 1&2
S4 1111 0000 00000 11111 000001111 11111 0000 0000 1111 00000 11111 00000 11111 0000 1111 0000 1111 00000 11111 00000 11111 0000 1111 0000 1111 00000 11111 00000 11111 0000 1111 0000 1111 00000 11111 00000 11111 0000 1111 000 111 000 111 000 111
S2 00000 11111 111111111 0000 00000 0000 1111 00000 11111 0000 1111 00000 11111 0000 1111 00000 11111 0000 1111 00000 11111 0000 1111 00000 11111 0000 1111 00000 11111 0000 1111 00000 11111 0000 1111 00000 11111 000 111 000 111 000 111 S1
The unsynchronized membership model in IP multicast says that members can join and leave a multicast group at any time. Because of this, the bootstrapping phase of the algorithm must be on-line to group new sources or receivers as they come into existence. We use a greedy approach which adds a new source Sn to the cluster Gk containing the most “similar” sources to Sn , if GS is used, and analogously for GR. We define “similar” later. Note that the algorithm only groups sources in GS and receivers in GR; the subscription and transmission of data is implicit, as explained above. We assume the number of clusters available, K , is fixed. We describe its derivation later. While the algorithm needs to be on-line during the bootstrapping phase, it can be off-line in the adaptation phase, working on a snapshot of the current configuration of sources, receivers, preferences, and partition. We use the k-means method, also known as “switching”, proven to converge to a locally optimal solution [8]. It works as follows. For each source Sn in some cluster Gi , the algorithm switches Sn to another cluster Gj if it is more similar to the set of sources belonging to Gj , when GS is used. The process is analogous in GR. One benefit of k-means is that it incrementally refines the current partition to arrive at the adapted partition. This minimizes the disruption to the application during adaptation, because it limits the number of multicast join and leave operations that results. A potential disadvantage of k-means is its unbounded convergence time. The algorithm stops only when none of the sources can be moved from its current cluster to another. In Section III, we present heuristic to bound the running time of the adaptation phase that has minimal impact on the resulting partition. We now describe what constitutes similar sources. Abstractly speaking, a set of sources are similar if the receivers’ preferences towards them contain large overlaps. To put it mathematically, we use the distance function d(Sn ; Gk ) as the measure of similarity between a source Sn and a set of sources in a cluster Gk . The cluster mean Gk represents the overlap of preferences among the receivers towards the sources in Gk , which is a vector with each element Rm ;Gk denoting the percentage of data receiver Rm desires of the total from Gk . The distance between Sn and Gk is the average decrease in this percentage if Sn is added to Gk . Thus, the smaller the distance, the more similar Sn is to the sources in Gk . The idea is analogous in GR and we omit the details here. The definitions of the distance function and cluster mean are amenable to different application classes and data types. To illustrate our framework, consider the case where preferences are “binary” in that each receiver either wants all data from a given source, or none. In other words, given a preference matrix P , each element is defined as P~m;n 2 f0; 1g, the binary preference receiver Rm assigns on source Sn . The sources are sending at equal data rates. Each element bRm ;Gk in bGk is the fraction of sources in Gk that Rm finds interesting,
bRm ;Gk =
P
Sn 2Gk P~m;n jGk j
Sources S1 Receivers
A bootstrapping phase to handle the joining of new sources and receivers to the application. An adaptation phase to deal with changes in preferences and the departures of old sources and receivers.
S2
S3
S4
R1
3
3
1
1
R2
1
1
2
2
S3
R1 (a)
R2 (b)
Fig. 3. Derivation of K. In (a), each element of the preference matrix represents the number of layers the corresponding receiver wants from the source. Given A = 6 and L = 3, K = 2. In (b), when S1 and S2 are grouped into one cluster, and S3 and S4 another, the number of addresses used is only 4. This is because receivers only demand 2 distinct layers from each source. Layers 2 and 3 from S1 and S2 can be combined since layer 2 is not explicitly requested.
In the distance function db , we penalize adding Sn to Gk if this causes a receiver who is currently not subscribing to Gk to start doing so only for Sn . Reversely, we reward a cluster that is wellformed where a receiver is either interested in both Sn and the sources in Gk , or none at all. That is,
db (Sn ; Gk ) =
M X R
m=1
(
m ; Gk ; G k [ S n )
(1)
where is defined as,
?1; 1; bRm ;Gk ? bRm ;Gk Sn ; [
if bRm ;Gk if bRm ;Gk otherwise
=
bRm ;Gk
=0
Sn
[
Since approximately similar sources or receivers are grouped together to satisfy the constraints, either superfluous or deficient data can result at the receivers. While the former inefficiently utilizes the resources available, the latter is unacceptable to some applications. An example is an application with layered data organization that tolerates deficient data, such as video broadcasting. The distance function should calculate both the amount of superfluous and deficient data, and weighs them accordingly. In the case when sources are not sending at equal data rates, the estimated source data rates should be used in the formulation instead of the number of sources. C. Constraints We heuristically satisfy the constraints in the algorithm. The first constraint of network state arises from the cost of using of multicast addresses, which is in the form of forwarding state at the routers and the control overhead to maintain this state. This becomes detrimental to the network when a large number of addresses are in use concurrently. This constraint is satisfied before the partition is formed. The number of clusters K is simply the number of addresses A available to the application, if data is not layered. Otherwise, K is defined as b A L c, where L is the maximum number of layers deemed useful at the application. As illustrated in Figure 3, this is the worst-case estimate of K because receivers might not need all layers from all sources. A heuristic to alleviate this problem is to re-run the algorithm with a larger K if the previous K results in unused addresses. The second constraint of bandwidth consumption accounts for the total session bandwidth available to an application. This
is to avoid and accommodate network congestion. We leave the problem of resource allocation to others. For example, ISPs can be responsible for allocating blocks of multicast addresses and available data bandwidth to an application. This constraint is considered after the partition is formed. In GS, the total session bandwidth available to an application is divided among the sources according to the average weights assigned by the receivers. In GR, each source further allocates the assigned bandwidth among the clusters based on the data sent to each. III. E XPERIMENTAL R ESULTS We conducted five simulation experiments to study various issues and tradeoffs in applying clustering to accommodate preference heterogeneity in multicast applications. In the first experiment, we varied preference distributions and application classes to look at the feasibility of clustering in solving this problem. We also examined the tradeoffs between using GR and GS in our algorithm. We then compared the algorithm to a simple roundrobin scheme and to a locally optimal algorithm to validate its performance. To improve scalability of the algorithm to large population sizes, we explored heuristics to reduce its runningtime that do not impact its outcome. We also applied the algorithm to dynamically react to changes in preferences and study the associated overheads and tradeoffs. In the last experiment, we designed and evaluated an adaptation process consisting of a simple control loop. Note that applying an analytical model from classical clustering theory literature is inappropriate here because:
in other clustering domains, complete data sets are known in advance. But the unsynchronized IP multicast membership model means our algorithm must be on-line. to handle the above, and to improve scalability of the algorithm, we augment the algorithm with sampling heuristics. our generic framework allows the application to customize the algorithm according to its properties and requirements. We used a binary function as described in Section II-B to represent preferences and sources sending at equal data rates. Results for other functions are expected to be analogous. Performance is measured in terms of average receiver “goodput”—the amount of useful data divided by the total received—to indicate the efficiency in the utilization of resources. A cluster here represents one multicast address. We modeled three preference patterns:
Zipf. Preferences collectively follow a perfect Zipf’s distribution. This means the expected number of receivers interested in the ith most popular source is inversely proportional to i. We modeled this because several studies have shown that Web access follows Zipf’s law [4]. Multi-modal. Preferences fall into modes. We partitioned sources evenly into five modes, and organized receivers so that each selects from sources in only one mode. This maps to applications with geographical correlations, e.g., in weather reports dissemination, a user is only interested in certain regions. Uniform. Preferences are random. This is the worst-case scenario as there is no correlation among receivers’ interest. This serves as a baseline pattern.
The order in which receivers’ interest are presented to the algorithm is random for all the experiments. We also modeled two application classes:
Categorical. Each user is interested in 5% to 10% of 100 categories available. The sources are the categories, and the receivers the users. This resembles “live data feeds” such as stock quote services. Since in this application class there is usually a limited number of categories but a much larger user population, the default number of receivers is 1000. Spatial. We use a 32x32 grid to represent a virtual space where the distribution of participants on the grid follows one of the the previously described patterns. This is to model collaborative applications like network games. Each participant is interested in others located within a radius of 5.65 units from itself, chosen because this results in about 10% of the positions of the grid. The sources and the receivers are the participants, and at each position there is an avatar which serves only as a source. The default number of participants is 100. vspace1ex A. Experiment 1: Clustering Feasibility We found that clustering achieves higher average receiver goodput than when it is not used. Figures 4 and 5 illustrate the performance of our algorithm as we varied the number of addresses available, given the categorical and spatial workload, respectively. The lines are jagged because we generated a different preference pattern for each value of the x-axis. Without clustering, i.e., when there is only one multicast group available, the goodput is about 10%—the percentage of sources that a receiver is interested in. Depending on the preference patterns and grouping schemes used, clustering improves goodput by a factor of about 2 to 4 when only few addresses are available. Figure 4 also shows that GS is better than GR for the categorical workload. Except when few addresses are available— in which case the goodputs are indistinguishable —GS outperforms GR by a factor of 1.5 to 2 given Zipf interest. Similar results are observed with uniform interest. GS groups sources that receivers find similarly interesting or uninteresting, which is apparent in both patterns: each receiver deems around 10% of the sources useful and the rest useless. GR locates receivers that are alike, which is not found in either pattern; there is not much correlation of interest among the receivers, except for the few popular sources in the Zipf pattern. With multi-modal interest, GR yields slightly better goodput than GS when there are few addresses, after which GS begins to out-perform GR marginally. GR can find and group receivers belonging in the same mode into one cluster. However, because GR allows sources to send the same data to multiple clusters, the data rates achievable are constrained by the total session bandwidth. Thus, GS remains the better scheme here. In general, given the choice to group sources or receivers, it is more effective to group the smaller of the two. One question Figure 4 poses is: why does goodput flatten out when there are almost as many addresses as number of sources? This is because the algorithm does not always map new sources onto the unused addresses. It groups the first sources onto the same address if they are requested by the same receiver, unless the number of sources sending to that address is above a speci-
Average receiver goodput
Zipf interest, 100 data sources, 1000 receivers,