Multicast Architecture with Adaptive Dual-state - Semantic Scholar

1 downloads 0 Views 375KB Size Report
ship from forwarding information, and (ii) apply an adaptive dual-state approach to optimize for the different objectives of active and inactive groups. MAD works ...
Multicast Architecture with Adaptive Dual-state Tae Won Cho1 Michael Rabinovich2 K.K. Ramakrishnan3 Divesh Srivastava3 Yin Zhang1 1 2 3 University of Texas at Austin Case Western Reserve University AT&T Labs–Research {khatz, yzhang}@cs.utexas.edu [email protected] {kkrama, divesh}@research.att.com Paper ID: #1569059771 Number of Pages: 12

ABSTRACT

• First, because of the increasing amount of electronic content produced and consumed by multicast-friendly applications, multicast will need to manage an ever increasing number of groups, with a distinct group for each piece of distributable content.

Multicast is an approach that uses network and server resources efficiently to distribute information to groups. As networks evolve to become information-centric, users will increasingly demand publish-subscribe based access to finegrained information, and multicast will need to evolve to (i) manage an increasing number of groups, with a distinct group for each piece of distributable content; (ii) support persistent group membership, as group activity can vary over time, with intense activity at some times, and infrequent (but still critical) activity at others. These requirements raise scalability challenges that are not met by today’s multicast techniques. In this paper, we propose the MAD (Multicast with Adaptive Dual-state) architecture that can scalably support a vast number of multicast groups, with varying activity over time, based on two key novel ideas: (i) decouple group membership from forwarding information, and (ii) apply an adaptive dual-state approach to optimize for the different objectives of active and inactive groups. MAD works across administrative boundaries by partitioning routers into “MAD domains,” enabling autonomous decisions in local domains. Using a prototype implementation in the Emulab testbed as well as large-scale simulations, we show that MAD promises to meet the challenges of large-scale information-centric networks.

1.

• Second, group activity naturally varies significantly over time, with intense activity at some times (e.g., during periods of natural disasters), and infrequent activity at others (e.g., when monitoring for potential natural disasters). Since the criticality of the information disseminated may be independent of the level of group activity, and group membership may be long-lived, the membership of the multicast group needs to be maintained persistently to support timely information dissemination. Large scale dissemination of information is of course enabled by information aggregators. However, there are a variety of drawbacks from depending purely on information aggregators, mainly in terms of timeliness and coverage. Timeliness depends on how often the information from the producers gets updated at the aggregator; given the dynamic nature of information producers, it is almost impossible for all relevant information to be available at the aggregator in a timely manner. Coverage depends on the set of information producers chosen by the aggregator; given the vast number of information producers, it is infeasible for aggregators to provide complete coverage. These drawbacks of information aggregators make it desirable to enable the information producers themselves to make their information available over the information-centric network to any and all consumers, on an as-needed basis. Information-centric networking is thus the key next step for a communication network to become a large scale utility for information access and dissemination.

INTRODUCTION

With network connectivity becoming nearly ubiquitous, the opportunity and challenge in front of us is to enable sharing, dissemination and providing access to the vast amount of electronic information that is available and is continually being created. Diverse applications like financial markets, access to multimedia content and large-scale software update have hastened the migration from the simple methods of point-to-point communication of information between a single producer and consumer to multicast based methods, which use network and server resources efficiently to disseminate information from producers to groups of interested consumers. As the needs of these applications continue to grow, and newer information-intensive applications like online role-playing computer games and cooperative health-care become pervasive, users will increasingly demand continuous, publish-subscribe based access to information from a multitude of distributed information producers. An important issue, of course, is how to identify “information”. It is critical to enable sharing of information at a fine enough granularity to ensure that only relevant and nonredundant information is accessed and disseminated. Producers and consumers of a specific piece of fine granularity information can be viewed as members of an information-centric multicast group. A consequence of this model is that multicast will itself need to evolve.

1.1

Related Work and Limitations

Supporting such fine granularity information-centric multicast raises many challenges that are not met by today’s IP and overlay multicast technologies. IP multicast has focused on efficient forwarding of information to a large active group of recipients, with the goal of efficient lookup for forwarding. IP multicast-style approaches (at the network layer [1] or at the application layer with “overlay multicast” [3, 4]) try to keep a relatively small amount of state (limited number of groups and the associated interfaces downstream with recipients for the group). However, this state is maintained at every node in the multicast tree of the group for efficient forwarding. Thus, maintaining state is expensive, and these existing models for multicast use considerable control overhead (periodic refresh and pruning) to try to minimize the amount of state retained. IP multicaststyle approaches are inappropriate for our problem, for several reasons. 1

• First, IP multicast-style approaches are appropriate for a relatively small number of groups, but are not feasible for the type of scale we consider here — billions of groups — with reasonable amounts of memory at individual network nodes.

multicast) when information is frequently generated, while also enabling the membership of a group to scale to large numbers where the group membership may be long-lived. To this end, in this paper, we describe Multicast with Adaptive Dual-state (MAD), our novel architecture that can scalably support a vast number of multicast groups with varying activity over time, on today’s commercial hardware in an efficient and transparent manner.

• Second, when groups are long-lived, but have little or no activity over long periods of time, maintaining the membership state in IP multicast-style approaches requires a lot of control overhead (relative to the activity) to keep it from being aged-out. Recently, Ratnasamy et al. [10] proposed Free Riding Multicast (FRM), which speeds up forwarding and reduces routing state by caching all source-based tree edges of some groups at the sources and including a Bloom filter with all the edges into the message itself. This allows other nodes to forward the message without storing any routing state, using only the information in the message header itself, and to quickly determine the next hops just by checking which of its interfaces are included in the Bloom filter. However, FRM does not effectively meet the needs of groups with large membership sets because this can result in an excessively large packet header to encode the identity of all the members that are recipients of the message (even with the help of the Bloom filter). Moreover, FRM can also impact forwarding efficiency negatively. There have been several efforts to reduce multicast forwarding states in routers. REUNITE [13] reduces forwarding states at the data plane by keeping forwarding entries only at branching routers, while non-branching routers use unicast routing to forward traffic and keep the multicast control table in the control plane. Hop-by-hop Multicast [6] starts with similar ideas of [13], but considers the asymmetric unicast routing and adopts source-specific channel abstractions. However, the number of branching routers can grow as the size of group increases in above schemes, increasing the multicast state. Also, multicast forwarding tables are stored by soft-state, which entails high control overhead for groups with little data activity. Explicit multicast (Xcast) [2] focuses on the large number of small multicast groups. Xcast encodes the list of destination nodes into every packet, thus it does not require any address allocation or forwarding states at routers. Instead, multicast routers have to parse the protocol header of every packets at every hops, imposing limits on supporting large-size groups. At the other extreme, to maintain state on a more permanent basis (and address the second limitation, above, of IP multicast-style approaches), “hard state” protocols such as ATM have been designed. State is retained unless explicitly torn down by the end-points. Control overhead is minimal, but the nodes in the network have to potentially keep a lot of state around. These approaches also face severe limitations for scaling in the number of groups and the state associated with each group; hence are inappropriate for our problem. We seek improvements over these traditional approaches. In contrast to IP multicast-style approaches, we wish to minimize the amount of control overhead associated with keeping state up over a long time, especially when groups are inactive. However, for active groups, we wish to take advantage of the structures that IP multicast designs have adopted. Thus, we seek the best of both worlds – forwarding efficiently (a la IP

1.2

Contributions and Outline

This paper makes the following key contributions. • First, we propose the MAD architecture that separates the maintenance of group membership state from the state needed for efficient forwarding of information. We show that group membership state can be maintained scalably in a distributed fashion using a hierarchical membership tree (MT). • Second, the MAD architecture treats active groups and inactive groups differently, based on a recognition that a predominant number of the vast number of groups MAD seeks to support are expected to be inactive at any given time. Active groups use IP multicast-style dissemination trees (DT) for efficient data forwarding while inactive groups use membership trees for this purpose, without adversely affecting the overall forwarding efficiency. • Third, the MAD protocol ensures that there is a seamless transition between the use of the dissemination tree and the membership tree for forwarding of information, with no end-system (application or user) participation in the determination of when a group transitions from being active to inactive, or vice versa. • Fourth, MAD works across administrative boundaries by partitioning MAD routers into “MAD domains” enabling local domains to autonomously decide whether to use a DT or an MT for data forwarding within the domain, and when to transition between them. • Fifth, we did a detailed analytical and experimental evaluation of the MAD architecture. ◦ Our analysis demonstrates the scalability of MAD in maintaining group membership state at very large scale. ◦ We also perform large-scale simulations to show that MAD simultaneously achieves information forwarding efficiency and scalability in terms of group membership state maintenance. ◦ We built a prototype implementation of MAD using FreePastry [11]. We evaluated this prototype implementation in the Emulab testbed [14] to demonstrate the feasibility of scaling the MAD architecture. The rest of this paper is organized as follows. Section 2 describes an overview of the MAD architecture, motivated by two key design tradeoffs. The details of the MAD protocols are provided in Section 3. An analysis of the scalability of the MAD architecture is discussed in Section 4. Experimental results based on our prototype system are presented in Section 5. We conclude in Section 6. 2

Percent of Updates

100 Publishing Rate

1000 100 10 1

80 60 40 20 0

0

20 40 60 80 100 Percent of Publishers

(a) Publishing Rate (# of updates/month)

0

20 40 60 80 100 Percent of Publishers

(b) CDF of Feed Updates

Figure 2: Publishing Characteristics of RSS Feeds Figure 1: MAD environment.

2.

Twenty years of networking research has produced efficient mechanisms for multicast communication. In order to deliver data to a multicast group, two pieces of information are required: (i) group membership, i.e., which end hosts are members of the group, and (ii) forwarding information, i.e., how to reach these group members. Most existing multicast solutions combine group membership maintenance and forwarding route discovery into a single protocol. Unfortunately, these have well-known scalability problems with regard to the number of groups and group members. To achieve both scalability and efficiency, we observe that the amount of data traffic generated by different multicast groups often exhibits a skewed distribution in practice. Specifically, despite the large total number of groups, we expect most (e.g., 90%) groups to generate relatively infrequent and/or small amount of data traffic. Meanwhile, we expect the small fraction (e.g., 10%) of active groups to account for the vast majority of data traffic. Using the measurement data from [8], we extract the publishing behavior of 429 RSS feeds. Figure 2(a) illustrates that only 5% of the groups publish more than 100 updates/month, and only 10% of the groups contribute 75% of the entire feed updates as shown in Figure 2(b). Consequently, the overall data forwarding efficiency is likely to be dominated by the forwarding efficiency of active groups, whereas the overall state requirement and control overhead is largely determined by the state requirement and control overhead of the inactive groups. MAD is based on a recognition that there is a continuum in terms of activity for groups. Hence, the MAD architecture treats active groups and inactive groups differently, achieving both efficiency and scalability. We leverage two key ideas: (i) decoupling group membership and forwarding information, and (ii) applying a novel adaptive dual-state approach to optimize for different objectives for active and inactive groups.

MAD ARCHITECTURE & TRADEOFFS

MAD seeks to provide efficient multicast service to a billion users and a trillion multicast groups (with potentially long-lived membership information) using today’s commercial software and hardware platforms. In this section, we provide an overview of our architecture to realize this ambitious goal, along with the core tradeoffs that we exploit in the design of MAD. While we envisage MAD to begin as an overlay multicast service, we believe that the scalability and efficiency it offers for large-scale information dissemination will eventually result in the architecture migrating to the underlay. The primary barrier to such an evolution, (e.g., to IP multicast) is the current address space constraints in network layer protocols such as IPv4. While we believe that these issues will be overcome in the future, it is beyond the scope of this paper to address them. However, we believe such a constraint should not preclude the creation of a more scalable architecture that is applicable for the future Internet. Publish-subscribe multicast services that MAD is targeting must support user profile management, authentication and fine-grained access control, among other functions. In this paper, to focus on the routing aspect of the problem (which is challenging in itself), we explicitly decouple user management from the multicast service. Specifically, we assume the environment illustrated in Figure 1. The MAD multicast service overlay consists of logical overlay routers, which reside on (or are owned by a provider of) the physical overlay routers. When a physical overlay router fails, all the logical overlay routers that it owns can be taken over by other physical overlay routers (see Section 3.6). User management functionality is concentrated in the external subscription manager, which maintains subscriptions for all users connecting at a given MAD site. The subscription manager partitions all its subscriptions among the logical overlay routers owned by the site and initiates or cancels group subscriptions with corresponding logical overlay routers on behalf of the end users. Thus, every logical overlay router serves a single aggregated local subscriber representing all users assigned to it by the subscription manager. From the perspective of an MAD overlay router, no knowledge of end users is required and the only entities an overlay router communicates with are other overlay routers and its single aggregated local subscriber.

• MAD uses a hierarchical, distributed state management approach for efficiently storing group membership state at very large scale, resulting in one to two orders of magnitude state reduction compared to CBT. A small number of logical overlay routers are organized in a membership tree rooted at a core router for each group, and maintain state for that group. Except for the core, a node at each level in the membership tree (i.e., a root of a subtree) is selected so that it contains at least a minimum number of subscribers for that group in the subtree. Messages to an inactive group are forwarded along the membership tree.

2.1 MAD Architecture Overview

• MAD can use any existing multicast protocol for efficient 3

forwarding — it just needs a small hook to interface with such a protocol (see Section 3.4.4). Our specific instantiation of MAD uses the Core Based Tree (CBT) [1] protocol for efficient forwarding for active groups. We refer to the CBT of an active group as the dissemination tree, in contrast to the membership tree that MAD maintains for every group. The MAD membership tree is created on a per group basis (as opposed to applying a centralized scheme or having a single shared tree across all the groups). The root of each tree is selected by computing a hash function of the group ID, thus avoiding a separate resolution process to map a group to its associated root. The group’s root then becomes responsible for monitoring the group activity and changing its state between active and inactive. To work across administrative boundaries, MAD partitions routers into “MAD domains” and allows individual MAD domains to make autonomous state transition decisions for the same group. These mechanisms allow MAD to both better balance the load and reduce traffic concentration. To minimize control overhead, the membership tree is constructed on top of a static hierarchy (i.e., the base tree; see Section 3.2). As a result, the membership tree requires no modification when there are changes in overlay topology, network performance or overlay unicast routes, as long as there is no change in the group membership or tree node failure. Infrequent updates may be performed to provide reasonable forwarding efficiency while delivering data to inactive groups. In addition, to minimize the state requirement, the membership tree is designed to be highly structured so that only a bounded small number of overlay nodes maintain membership information for each group. In contrast, a traditional multicast dissemination tree can potentially involve a large fraction of nodes, that would then have to maintain state.

of supporting well over a trillion multicast groups with 64K overlay routers that use about 3GB of memory per router. Meanwhile, MAD achieves the low forwarding latency of the dissemination tree for active groups. Therefore, by combining the membership tree and the dissemination tree, MAD is able to achieve the “best of both worlds”.

3.

DESIGN AND IMPLEMENTATION

The MAD protocol consists of five major components: (i) the membership tree protocol, which specifies how to create, maintain, and send messages over the membership tree, (ii) the dissemination tree protocol, which specifies how to create, maintain, destroy, and send messages over the dissemination tree, (iii) state transition mechanisms for switching between the membership tree and the dissemination tree, (iv) mechanisms for operating across domain boundaries, and (v) failure recovery. Below, we first introduce notation and then provide details of each component.

3.1

Preliminaries Symbol 2ℓ F H(s) gid(g) M T (g) DT (g) core(g) BT (g)

Meaning number of logical overlay routers, with IDs ∈ [0, 2ℓ ) first-hop router for a subscriber s 128-bit group ID for group g membership tree for group g dissemination tree for group g (when g is active) common core of M T (g) and DT (g) base tree for constructing M T (g)

Table 1: Notations. We first introduce some notations (summarized in Table 1). The MAD multicast service overlay consists of 2ℓ logical overlay routers, identified by unique ℓ-bit logical overlay router IDs. The 2ℓ logical overlay routers reside on ≤ 2ℓ physical overlay routers. Each logical overlay router r ∈ [0, 2ℓ ) is owned by a single physical overlay router, whereas each physical overlay router may own one or more logical overlay routers. We assume the existence of a resolution procedure for mapping a logical router r to its current physical owner. With only 2ℓ logical overlay routers to resolve, it suffices to use a centralized name server (with replication to recover from failures). We use F H(s) to denote the first-hop logical overlay router for a subscriber s. Each multicast group g is identified by a unique 128-bit group ID gid(g). As stated above, each group g maintains a membership tree M T (g) to record its set of members. If g is active, it also maintains a separate dissemination tree DT (g). In the current MAD protocol, M T (g) and DT (g) share a common core (i.e., root) logical overlay router core(g). To balance the load and reduce traffic concentration, we apply a hash function H(·) to map the 128-bit group ID into a random ℓ-bit core ID: core(g) = H(gid(g)). A benefit of this scheme is that it eliminates the need for a separate resolution procedure for mapping group ID’s to core ID’s, which can be expensive with a trillion groups. Finally, we associate a base tree BT (g) with each group g as the basis for constructing M T (g) (see Section 3.2).

2.2 Design Tradeoffs MAD exposes and exploits the following core tradeoffs: Control efficiency vs. forwarding efficiency: A key design decision in MAD is to achieve a tradeoff by moderately reducing data forwarding efficiency for inactive groups while obtaining a significant reduction in the overall state requirement and control overhead. We believe such a tradeoff is essential in order to provide multicast service to trillions of groups (with potentially long-lived membership information) on today’s commercial hardware and software platforms. CPU processing vs. state consumption: To minimize state consumption and control overhead, MAD combines the membership tree with unicast routes to forward data to inactive groups. As a result, it requires more CPU processing than traditional multicast for forwarding data to inactive groups. However, we transition to the the more efficient dissemination tree for forwarding data once an inactive group becomes more active. The additional overhead related to the membership tree seems acceptable given the importance of state reduction in order to support a vast number of groups. We conducted a detailed analytical and experimental evaluation of the MAD architecture. Our results show that the membership tree can achieve one to two orders of magnitude state reduction over CBT. As a result, MAD is capable

3.2

Membership Tree Sub-protocol

3.2.1 Overview 4

BT(g): A, B, ..., O Subscribers: S1, ..., S5 S_min: 2 (2) MT(g): A, B, C, E C

(5) A (3) B

memory per logical overlay router. For example, with ℓ = 16 and k = 16, each logical overlay router requires only around 2MB memory for all the base trees, which is rather modest. If the base trees are computed using the bit-correction scheme, then even this modest amount of memory can be saved because the parent and children can be efficiently computed on the fly as a function of core(g). Bound on the size of M T (g): Let D be the maximum depth of the BT (g). Given a group g with S(g) subscribers, our membership tree construction ensures that at each level of nodes are included in M T (g). ThereBT (g), at most SS(g) min fore, the total number of nodes on M T (g) is bounded by in the worst case. With sufficiently small D (e.g., D × SS(g) min D = 4) and sufficiently large Smin (e.g., Smin = 16), the size of M T (g) is bounded by a small fraction of S(g). In contrast, a dissemination tree could involve all 2ℓ logical routers in the worst case, and all such routers would have to maintain forwarding state for the group. Later in Section 4, we use extensive simulation and detailed analysis to show that M T (g) achieves significant state reduction in the average case. Membership tree state: Each level-i logical overlay router r ∈ M T (g) maintains the following state for group g: (i) r.g.mtChild: a k-bit bit-vector that indicates whether each of its k children at level i − 1 also belongs to M T (g). For example, in Figure 3 the value of B.g.mtChild is 01, indicating that child D is not in M T (g) but child E is. (ii) r.g.mtFHs: a list of first-hop logical overlay routers n satisfying (1) n’s local subscriber is a member of g, (2) r is an ancestor of n in BT (g), and (3) there is no other r′ ∈ M T (g) that is an ancestor of n and a descendant of r in BT (g). For example, in Figure 3, field B.g.mtFHs contains first-hop logical router I (for S1 ) but neither J (for S2 ) nor K (for S3 ). Note that J and K are recorded in E.g.mtFHs instead.

(2) D H

I S1

E J S2

K S3

F L

M S4

G N

O S5

Figure 3: Example of membership tree construction. For a given group g with core c = core(g), our design of the membership tree M T (g) minimizes state requirement by limiting the number of on-tree nodes. Specifically, our protocol first imposes a static k-ary base tree BT (g) (with root c) that spans all the logical overlay routers. We then construct M T (g) from BT (g) by including only the core c and those nodes that have at least Smin subscribers in their subtrees. More formally, for any logical router r, let S(g, r) be the number of subscribers in the subtree of BT (g) that is rooted at r. We have M T (g) = {r | S(g, r) ≥ Smin } ∪ {c}

(1)

Figure 3 shows an example of membership tree construction. In this example, BT (g) is a k-ary tree (with k = 2) that spans all the logical routers A, B, · · · , O. There are 5 subscribers to the group: S1 , S2 , S3 , S4 , S5 (whose first-hop logical routers are I, J, K, M, O, respectively). M T (g) is constructed to include the core A together with all those nodes r with S(g, r) ≥ Smin . Therefore, with Smin = 2, the resulted M T (g) consists of nodes A, B, C, and E. The values of S(g, r) for r ∈ M T (g) are shown in parentheses in Figure 3. Base tree construction: The base tree BT (g) serves multiple important goals in MAD: (i) By constructing the membership tree on top of the constant-degree base tree, we can easily bound the maximum depth of the membership tree. (ii) The static nature of the base tree also allows the membership tree to stay largely unmodified when there are changes in overlay topology, network performance or overlay unicast routes (as long as there is no change in the group membership or tree node failure). As a result, the control overhead for membership tree maintenance is reduced. (iii) The base tree also serves as the basis for membership tree based multicast forwarding (see Section 3.4.4). In our current MAD protocol, all groups with the same core share a common base tree. So there are 2ℓ different base trees in the system, one rooted at each logical overlay router. To balance the load, we generate BT (g) randomly with core(g) as the random seed. Specifically, we start by placing all 2ℓ logical overlay routers at the leaf nodes (i.e., level-0 nodes) of the k-ary tree. A level-i node is then randomly selected from its k level-(i-1) children. This can be efficiently implemented by the bit-correction scheme commonly used in DHT literature (e.g., [11]). If desired, we can further take proximity into account and refine these random base trees to avoid links with poor quality. Finally, maintaining these base trees is straightforward – all it requires is for each logical overlay router to store its parent and k children in each of the 2ℓ base trees, which costs (1 + k) × 2ℓ × ℓ-bit

3.2.2 Membership Tree Maintenance Our protocol for the creation and maintenance of M T (g) maintains the following invariant: any node r ∈ M T (g) is either the core c = core(g) or the root of a subtree of BT (g) that contains at least Smin subscribers of g. Joining a membership tree: Messages of type MtJoinMsg are sent by subscribers to join a membership tree. Such messages are sent towards the core c = core(g) along the base tree BT (g) in a bottom-up fashion. Meanwhile, the decision on node creation is made in a top-down fashion. Specifically, when a subscriber s wants to join group g, it first sends a MtJoinMsg to its first-hop logical overlay router f = F H(s). f determines the core c as a hash function of the group ID gid(g), and then forward the MtJoinMsg towards c (which includes the logical router ID of f ). Note that unlike protocols like CBT, when an overlay router r not on the tree M T (g) receives a MtJoinMsg for group g, r does not instantiate any state. Instead r simply forwards the MtJoinMsg towards the core c. Eventually, this message will reach either the core c or a node that is already part of M T (g). Let p be this first on-tree node and suppose the MtJoinMsg is forwarded to p by p’s child n in BT (g). After receiving this MtJoinMsg, p first adds f to its first-hop router list p.g.mtF Hs. Node p then checks to see if it has accu5

1 2 3 4 5 6 7 8 9 10 11

f listed in c.g.mtF Hs and each child n of M T (g) in the bit-vector c.g.mtChild. c forwards M to f and to all children n through overlay unicast. If the local subscriber of c also belongs to g, M is forwarded to the local subscriber through underlay unicast.

⊲ c is the core of group g ⊲ node p receives MtJoinMsg(f ,g) from its child n in BT (g) if (p 6= c and p 6∈ M T (g)) p forwards the message towards c along BT (g) else p.g.mtF Hs = p.g.mtF Hs ∪ {f } Sn (g) = the subset of first-hop routers in p.g.mtF Hs that are descendents of n in BT (g) including n itself if (|Sn (g)| ≥ Smin ) p informs n to create state for g and members in Sn (g) by sending a MtNodeCreateMsg to n // n may in turn inform its own child to create state p.g.mtChild[n] = 1 p.g.mtF Hs = p.g.mtF Hs − Sn (g) end end

• After a child n receives M , it repeats the same procedure above to forward M to first-hop routers listed in n.g.mtF Hs and its own children in M T (g) (as specified in n.g.mtChild). Once f receives M , it then sends it directly to its local subscriber through underlay unicast.

Figure 4: Handling the MtJoinMsg. mulated Smin members at its child n. If so, p sends a MtNodeCreateMsg to inform n to create state for group g and the Smin members. After n creates this state, p then removes the Smin members from p.g.mtF Hs, and marks the bit corresponding to n in the bit-vector p.g.mtChild to 1, indicating that there are on-tree node(s) down stream. Note that n may find that all its Smin members belong to one of its k children. In this case, n further informs this child to create state for g and those members by sending a MtNodeCreateMsg. Figure 4 shows the detailed behavior when a node p receives a MtJoinMsg from child n to subscribe to group g on behalf of first-hop router f . The protocol ensures reliable delivery and processing of the MtJoinMsg with an suitable acknowledgement and retransmission mechanism. Leaving a membership tree: Messages of type MtLeaveMsg are used to leave a membership tree. Specifically, the firsthop router f = F H(s) for a subscriber s sends a MtLeaveMsg towards the core c along the base tree BT (g) until the first ontree node n is reached. After n receives the MtLeaveMsg, n first remove f from n.g.mtF Hs. n then checks if n.g.mtChild is empty and n.g.mtF Hs contains fewer than Smin subscribers. If so, n concludes that the membership tree invariant is violated (i.e., |Sn (g)| < Smin ) and that it should no longer stay as a node in M T (g). n then sends a MtNodeDeleteMsg to its parent in M T (g) with the list of first-hop routers in n.g.mtF Hs (which will be maintained by this parent henceforth) before n purges the state associated with group g. Note that the parent of n may also decide to delete itself from M T (g) and sends a MtNodeDeleteMsg to his own parent (i.e., the grand-parent of n in M T (g)).

3.2.3 Membership Tree Based Forwarding The membership tree structure can be combined with the up-to-date unicast routes to deliver both control and data messages to the group as follows. • When a node wishes to multicast a message M to group g, it first forwards M to the core c = core(g) through overlay unicast. This allows us to easily implement critical functionality such as per group access control, authentication, authorization, accounting and traffic monitoring at the core. Existing multicast protocols such as Scribe [4] take a similar approach. • After c receives M , c first extracts all the first-hop routers 6

Avoiding redundant overlay unicast messages: In our protocol for multicast forwarding, overlay unicast is used by a node n to forward the messages to its children in M T (g) as well as the list of first-hop routers n maintains. A naive implementation of the above procedure for multicasting the message could lead to a situation where the same message is forwarded many times to the same next hop overlay router if this next hop is shared among several overlay unicast routes. MAD supports the option to trade processing overhead for better bandwidth efficiency by packing all these “redundant” messages into a single message and defining a variable-length header to specify all the recipients (as in Xcast [2]).

3.3

Dissemination Tree Sub-protocol

MAD uses a separate dissemination tree for each active group to achieve better forwarding efficiency. The dissemination tree can be maintained using any existing (or future) multicast protocol. Our current instantiation of MAD uses the standard Core Based Tree (CBT) [1] protocol for constructing and maintaining the dissemination tree. For convenience, the dissemination tree and the membership tree share a common core. Dissemination tree state: We assume that the number of overlay unicast neighbors for each overlay router is upper bounded by M axN odeDegree. Since the overlay topology is under our control, it is possible to enforce such constraint. We currently set M axN odeDegree = 31 in MAD, which means each overlay router can have no more than 31 neighbors. Under this assumption, each node n in a dissemination tree simply maintains a bit-vector n.g.dtChild (with M axN odeDegree+1 bits) to indicate whether there are any subscribers downstream for each neighboring overlay router. An extra bit in n.g.dtChild is used to indicate whether the local subsciber of n is a member of the group. Finally, a hash table indexed by the 128-bit group ID is used to store all the dissemination tree state for different groups. Dissemination tree maintenance: The current group status (e.g., whether it is active or inactive) is decided by the group’s core in the membership tree. Once the core c for a given group g determines that a dissemination tree needs to be created, it multicasts a DtBuildMsg along the membership tree M T (g) in a top-down fashion to all first-hop overlay routers of M T (g). Every first-hop router that receives this message joins the dissemination tree for g by sending a DtJoinMsg towards the core c = core(g) to join the dissemination tree DT (g). The DtJoinMsg is handled according to the standard CBT protocol. To change an active group g back to inactive state, core c multicasts a DtDestroyMsg(g) along

M T (g) (using membership tree based forwarding) to inform all the first-hop routers to leave the dissemination tree for the group. Every first-hop router f receiving this message sends a DtLeaveMsg towards the core c to leave the dissemination tree DT (g), which is processed as usual for CBT. Dissemination tree based forwarding: Similar to membership tree based forwarding, in order to send a message M to a group g over the dissemination tree DT (g), M is first sent to the core c through overlay unicast. From there, data is forwarded down DT (g) according to the CBT protocol. In addition, when the data reaches a node n whose local subscriber bit in n.g.dtChild is set to 1, it is then forwarded to the local subscriber of n through underlay unicast.

ily cause group members to miss that message, which may be unacceptable. Our solution is to deliver data messages over both M T (g) and DT (g) during the transition (i.e., in t-MAD mode) and stop delivering data to a subtree of M T (g) only after the root of the subtree is certain that all existing group members residing in the subtree can successfully receive data from the dissemination tree. We determine when a first-hop router can successfully receive messages from the dissemination tree by requiring the first-hop router to send a DtJoinCompleteMsg towards the core along the base tree after receiving its first DT -delivered message. The complete process of switching from i-MAD to a-MAD mode is described in an extended version of the paper [5].

3.4 Mode Transition

3.4.3 Switching from a-MAD to i-MAD Mode

A major challenge in the design of MAD is how to ensure the smooth transition between membership tree based forwarding (i.e., the inactive mode) and dissemination tree based forwarding (i.e., the active mode). In particular, it is essential to avoid disruption of the multicast data delivery service during mode transition. Note that we are assuming in this section that the core is tracking activity and makes global mode transition decisions. In Section 3.5, we will extend MAD to allow individual “MAD domains” to make autonomous mode transition decisions for the same group.

The transition from the active mode to the inactive mode is much simpler and can be performed almost instantaneously. Specifically, once the core c decides that the dissemination tree should be destroyed, it immediately changes the group state from a-MAD to i-MAD and all subsequent messages sent to the group g are delivered over the membership tree M T (g). The tree bit of the message header is set to 1 to indicate that M T (g) is used for message delivery. Meanwhile, the core c initiates the destruction process of the dissemination tree by multicasting a DtDestroyMsg down the membership tree M T (g) as described in Section 3.3. Any node on M T (g) that receives the DtDestroyMsg changes the mode for g from a-MAD to i-MAD. Note that the core need not wait for the destruction process of the dissemination tree to complete. To cope with potential loss of DtDestroyMsg (e.g., due to transient network disconnects), the core also sets the flush bit to 1 in all messages sent to the group in the i-MAD mode, signaling the destruction of the dissemination tree.

3.4.1 Overview Our basic strategy for achieving smooth mode transition is to require every group in the system to always maintain the membership tree even when the group is considered active and the dissemination tree has been constructed. Having an always up-to-date membership tree ensures that during the transition period we can use membership tree based forwarding to deliver messages reliably to all the group members, with very little additional control overhead. To facilitate smooth mode transition, each node n in a membership tree maintains the following additional information for a given group g: (i) n.g.mode, the current mode of the group, which can have value i-MAD (i.e., inactive mode), aMAD (i.e., active mode), or t-MAD (i.e., transient from inactive to active mode); and (ii) n.g.tmFHs and n.g.tmChild, the transient mode state that records the list of first-hop routers and children in M T (g) that have not yet successfully joined the dissemination tree (i.e., are not yet reachable from the core in DT (g)) in the t-MAD mode. We also piggyback the following control information in every data message sent by the core – a tree bit and a flush bit. The tree bit indicates which tree is used to deliver the message (1 for M T , 0 for DT ). When the flush bit is set, a node purges the DT state associated with the group ID of the message, if any. See below for details on the role of such information.

3.4.4 Multicast Forwarding in Different Modes As stated above, a message sent to group g is first sent to the core c through unicast (unless the sender is the core itself). The core then multicasts the message down either M T (g) or DT (g) or both trees based on the current mode of the group. An on-tree node forwards message in the following way: • i-MAD : Forward data message to all children in M T (g) (as specified by the bit-vector mtChild) and all first-hop routers listed in mtF Hs. If the local subscriber bit in mtChild is 1, deliver the message to the local subscription manager. • a-MAD : Forward data message to all children in DT (g) (as specified by dtChild). Also deliver data message to the local subscription manager when the local subscriber bit in dtChild is 1. • t-MAD : Forward data message over both M T (g) and DT (g). When forwarding messages along M T (g), use tmChild and tmF Hs instead of mtChild and mtF Hs to avoid forwarding messages to those users that can already successfully receive data from DT (g). As mentioned, MAD can utilize any existing multicast mechanism for content dissemination to active groups. The only requirement is that the mechanism supports the following hook: Upon receiving the first message over the dissemina-

3.4.2 Switching from i-MAD to a-MAD Mode Every new group g initially stays in the i-MAD mode. When the group becomes active, the core c may decide to improve forwarding efficiency by creating a separate dissemination tree DT (g). A key challenge here is how to determine whether the construction of DT (g) has completed because prematurely sending a message over an incomplete DT (g) can eas7

tion tree (i.e., using the existing multicast mechanism) after joining the group, the router must send a DtJoinCompleteMsg as described in Section 3.4.2.

To forward multicast traffic, the core sends traffic to leaders in all participating MAD domains; the leader in each MAD domain then forwards the traffic to the leaf nodes.

3.4.5 Handling Joins and Leaves in Different Modes

3.5.2 Autonomy

When a new subscriber s wishes to join a group g, s first contacts its first-hop router f = F H(s). If f is already on M T (g), it simply sets the local subscriber bit in f.g.mtChild to 1. Otherwise, f sends a MtJoinMsg towards the core along BT (g). The MtJoinMsg keeps propagating until it reaches a node p already on M T (g). The MtJoinMsg is handled by p based on its mode as follows. • i-MAD : In this mode, p first updates its membership tree state (i.e., p.g.mtChild and p.g.mtF Hs). p then responds with either a MtJoinAckMsg to indicate that the MtJoinMsg is successful but f does not need to create any state associated with group g, or a MtNodeCreateMsg to inform f to create membership tree state with f.g.mtF Hs initialized to the list of subscribers in the MtNodeCreateMsg. For the latter case, after f creates the membership tree state, it is possible for f to further inform one of its own children to create membership tree state (if that child has ≥ Smin subscribers).

MAD supports autonomous decisions in MAD domains. A local MAD domain can make its own decision for specific groups to communicate using either DT for efficient forwarding or a resource efficient MT. This enables us to exploit: • The spatial locality of group activity • The resource efficiency in local administrative domains Specifically, a group can be in DT mode to efficiently forward frequent updates to large number of receivers in a local domain, where popular local events are being held. For example, to use resource efficiently, MAD domains in a resourcestarved region (e.g., with low-end routers) may not be able to afford the use of the more state-intensive DT communication for all the globally popular groups that are of less interest within the region.

3.5.3 Locating Leader ID and Core ID Since all group communications have to go through the leader in the domain, it can experience a heavy load in terms of traffic and state. We alleviate this problem by (a) distributing leader roles to multiple routers, and (b) actively reducing the forwarding states for each DT at the leader node. The list of leader node IDs are maintained in all the routers within the domain. We prepend the domain ID to the node ID to support multiple MAD domains. MAD routers in Super-domain has a special domain ID; the core of the group is selected by picking one from Super-domain using a hash value of the group ID. Also, the leader of the given group in each domain is selected from the leader list in a similar manner.

• a-MAD : In this mode, p first processes the MtJoinMsg just as in the i-MAD mode. p then informs f to join the dissemination tree by sending a DtBuildMsg. Upon the receipt of this DtBuildMsg, f sends a DtJoinMsg to join the dissemination tree. • t-MAD : In this mode, to avoid potential race conditions, p temporarily refuses to take on a new child. Therefore, p responds with a MtJoinNackMsg to inform f that f needs to resend the MtJoinMsg after a timeout. When s decides to leave group g, it contacts its first-hop router f = F H(s). f then tries to remove s from both M T (g) and DT (g). Specifically, to remove s from M T (g), f acts as if it has received a MtLeaveMsg and handles it as described in Section 3.2. Meanwhile, f checks if it is on DT (g). If so, f follows usual CBT mechanisms by sending a DtLeaveMsg towards the core as described in Section 3.3.

3.5.4 Additional Details Group management: To enable MAD to operate across administrative boundaries, leaders forward multicast traffic to outside a domain in a manner similar to MBR (Multicast Border Router) [7]. When building MT state, leaders do not export subscriber information to the core even if the current enrollment size is below threshold. MAD domains achieve local privacy by containing sensitive data - such as number of subscribers and subscriber IP addresses - to be within the local network. Mode transition: Instead of having a group change modes from MT to DT across the entire network in an all-or-nothing mode change, each domain can decide to alter the forwarding mode for a group and inform that decision to the core node. Depending on the global activity and resource availability, the core can then decide to use MT or DT to reach leader nodes. Note that core-to-leader communication can use the different mode from leader-to-leaf communication even for the same group.

3.5 MAD across Administrative Domains MAD groups a set of routers in the same region or network domain (e.g., university network, corporate network, and AS) to form a “MAD domain”. MAD domains serve two goals: (i) enable MAD to operate across multiple administrative domains, and (ii) handle heterogeneity and load imbalance by promoting autonomous decisions in local networks.

3.5.1 Leaders and Super-domains A subset of routers in a MAD domain are eligible to be selected as the leader for a group. Each group has a corresponding leader router in a domain, and all communications for this group (both in and out of the domain) pass through its leader. The set of possible leaders are exposed outside the MAD domain. The union of leaders in all of the MAD domains form the Super-domain. The Super-domain is responsible for forwarding multicast traffic between domains. The group Core is selected from participating leaders in the Super-domain.

3.6

Failure Recovery

MAD handles failures through replication. Specifically, for each physical overlay router in our system, we designate a set of k physical overlay routers as its shadow routers. To 8

gid

mode

0011 1100 0110 0111

a-MAD a-MAD i-MAD t-MAD

MT mtChild mtFHs 0101 f1 , f2 0101 f3 0000 f2 1001 f4 , f5

DT dtChild 1001 1101 – –

transient mode tmChild tmFHs – – – – – – 1000 f5

tion of packet-level events) to compute the state requirement and forwarding cost of dissemination tree, membership tree, and MAD for any given group. Note that our latency results do not include any queueing delays. Topologies: Our simulator constructs overlay network topologies as follows: (i) generate an underlay network topology that comprises 16,000 routers in all; (ii) use shortest hopcount routing to obtain the underlay distance (i.e., hop count) between each pair of routers; and (iii) construct the overlay network topology as the union of m edge-disjoint Minimum Spanning Trees (MSTs) for the logical full-mesh (i.e., clique) over the 16,000 routers. The underlay topology we consider is either a power-law topology (pow-16k), or a transit-stub topology with stub (access) nodes and transit nodes (ts-16k). In addition to computing hop count between overlay routers, we also use the DS 2 delay space synthesizer [15] to directly synthesize a realistic underlay distance matrix (ds2-16k). Simulation setup: In our experiments, we randomly form 100 multicast groups each with a fixed group size. We then vary this fixed size of the multicast group with respect to the number of first-hop routers. For each group, we compute the required state and forwarding cost of both DT and M T . The forwarding cost is measured by the sum of overlay link latency weighted by the number of messages traversing each link. We then synthesize the state requirement and forwarding cost of the MAD protocol by applying a 10/90 rule. Specifically, we assume 10% groups are active, in a-MAD mode, and that they contribute to 90% of the data traffic. Simulation results: The results are summarized in Figure 6, where every data point is the average over 100 random groups (of the same size). Figure 6(a)-(c) show the maximum number of groups 216 overlay routers (each with 3GB memory) can hold for the three different topologies simulated. By combining M T with DT , MAD is able to achieve nearly an order of magnitude state reduction over DT . Note that the results for MAD are synthesized using the 10/90 rule. ThereDT +0.9×stateMT fore, we have: 0.1×statestate ≥ 0.1. That is, even DT if the M T requires zero state, the overall state reduction is limited by a factor of 10. So MAD achieves close to the absolute optimal. In addition, Figure 6(d)-(e) show the forwarding efficiency of DT , M T , and MAD. We see that the total forwarding cost for MAD is virtually identical to DT (both in delay and number of hops traversed) and can significantly outperform M T , especially as the size of the group increases. MAD thus achieves both the state reduction of M T and the improved forwarding efficiency of DT .

Table 2: Data Structure. 1 2 3 4 5 6 7 8 9 10 11 12

MtJoinMsg (group, src, FH router) MtJoinAckMsg (group, src, FH router) MtJoinNackMsg (group, src, FH router) MtLeaveMsg (group, src, FH router) MtNodeCreateMsg (group, src, FH router list) MtNodeDeleteMsg (group, src, FH router list) DtBuildMsg (group, src, FH router list) DtDestroyMsg (group, src) DtJoinMsg (group, src) DtJoinCompleteMsg (group, src) DtLeaveMsg (group, src ) MadDataMsg (group, src, tree (MT/DT), flush, FH router)

Figure 5: List of MAD Protocol Messages minimize replicated state, we only store state related to leaf nodes of the membership tree. Once the membership tree is repaired, it can perform normal multicast data delivery. For each logical overlay router ℓ owned by a physical overlay router p, the shadow routers of p only need to replicate the user subscription state (i.e., mtF Hs) for all enrolled groups. The replicated state is saved in the HDDs of all the shadow routers of p. There is a keep-alive message exchange between a physical overlay router p and its shadow routers p′ for fast failure detection and recovery. For each logical overlay router ℓ that is previously owned by p, p′ needs to recover the role of ℓ in every group that ℓ is involved in. Leaf : ℓ may be a leaf node in the membership tree M T (g) for group g. To take over such a role, p′ sends MtJoinMsg towards the core of group g along base tree BT (g). We save bandwidth by aggregating multiple MtJoinMsg with the same core into a single message with a list of group IDs. On-tree node : ℓ may be an internal node of either a membership tree or a dissemination tree. MAD takes advantage of on-tree nodes in both M T and DT sending heartbeat messages down the tree. Upon failure detection, the child repairs the tree by sending a join message (i.e., either MtJoinMsg or DtJoinMsg) to its parent. Core : ℓ may be the core for a group. After a failure, p′ starts receiving join messages from children of ℓ in group g. p′ infers the mode information from the received DtJoinMsg. Table 2 summarizes the data structures for the MAD protocol and Figure 5 lists the messages used in the protocol.

4.2 4.

Formal Analysis of State Requirement

Consider an overlay with M overlay routers. Given a group g with core c and S distinct first-hop overlay routers, below we analytically derive the expected state requirement for both DT (g) and M T (g) with respect to S. Our analysis for M T (g) is quite general. For DT (g), for simplicity, we only consider the case where the union of all unicast routes destined to the core c forms a k-ary tree (which is similar to BT (g)). Let D = ⌈logk M ⌉. In this k-ary tree, all M overlay routers are first placed at the leaf (i.e., level-0) nodes of the tree. Each level-(i + 1) node is then selected as one of k children at level i (0 ≤ i < D).

SCALING OF MAD TREES

In this section, we conduct extensive simulations on realistic network topologies to examine the state requirement of MAD trees and the tradeoff between state reduction and forwarding cost. To gain additional insights into the scaling of MAD trees, we also analytically derive the state requirement.

4.1 Simulation Evaluation Simulator: We developed a simulator (with 7,656 lines of C/C++ code, that achieves scalability by avoiding the simula9

100

10 1 0.1 0.01 0.001 1e-04

100

MAD MT DT

10

Number of Groups (Trillions)

MAD MT DT

Number of Groups (Trillions)

Number of Groups (Trillions)

100

1 0.1 0.01 0.001

1

1

35000 30000 25000 20000 15000 10000

20000

0.1 0.01 0.001

4 16 64 256 1024 4096 16384 Number of First-Hop Routers in a Group

1

4 16 64 256 1024 4096 16384 Number of First-Hop Routers in a Group

(c) Groups Per 216 Nodes (topo: ts-16k) 16000

MAD MT DT

14000 Total Underlay Hops

Total Delay (msec)

40000

MAD MT DT Total Underlay Hops

45000

25000

1

1e-04

4 16 64 256 1024 4096 16384 Number of First-Hop Routers in a Group

(a) Groups Per 216 Nodes (topo: ds2-16k) (b) Groups Per 216 Nodes (topo: pow-16k) 50000

MAD MT DT

10

15000 10000 5000

MAD MT DT

12000 10000 8000 6000 4000 2000

5000 0

0 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of First-Hop Routers in a Group

(d) Total Delay (topo: ds2-16k)

0 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of First-Hop Routers in a Group

0

(e) Total Underlay Hops (topo: pow-16k)

1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of First-Hop Routers in a Group

(f) Total Underlay Hops (topo: ts-16k)

Figure 6: Scaling of MAD on different overlay topologies. Let Mi = M × (k −i − k −(i+1) ) for any 0 ≤ i < D and MD = 1. Then there are Mi overlay routers that appear at levels 0, · · · , i of the k-ary tree but fail to enter level i + 1 or above (0 ≤ i ≤ D). Let n be such a node (with maximum level i). Then n is the ancestor for exactly k i distinct leaf nodes. Let random variable Xi denote the number of the S first-hop routers that are descendents of n in the k-ary tree. Assuming that the S distinct first-hop routers are selected at random, then Xi has a hypergeometric distribution: Pr(Xi = −S (S)(M ki −n) n) = n M . The mean µi and variance σi2 of Xi are (ki ) i i ) S M−S S given by µi = k i M and σi2 = k (M−k M−1 M M .

Number of Groups (Trillions)

100

1 0.1 0.01 0.001 1e-04 1

4 16 64 256 1024 4096 16384 Number of First-hop Routers in a Group

Figure 7: Analytical results on the scaling of MAD trees. The expected number of nodes on M T (g) with maximum level i is therefore Mi PiMT , and the expected total number PD−1 of nodes on M T (g) is 1 + d=1 Mi PiMT . Each node n on M T (g) stores a 128-bit group ID (as the key for the forwarding table), a 16-bit bit-vector n.mtChild, and a list of 16-bit first-hop router IDs (n.mtF Hs). Since each 16-bit first-hop router ID is stored only once, the total amount of state devoted to mtFHs is 16 × S bits. Therefore, the expected total number of bits for the membership tree state is:   D−1 P stateMT ≤ 16×S + (128+16)× 1 + Mi PiMT

Dissemination tree state requirement: For any given leveli node, the probability for it to be on DT (g) is given by PiDT = 1 − Pr(Xi = 0) = 1 −

MT (k=16) DT (k=16) DT (k=4) DT (k=2)

10

−S (S0 )(M (M −S) ki −0 ) = 1 − kMi M ( ki ) (ki )

The expected number of nodes on DT (g) with maximum level i is therefore Mi PiDT , and the expected total number of PD nodes on DT (g) is i=0 Mi PiDT . Each node n on DT (g) stores a 128-bit group ID (as the key for the forwarding table) plus a 32-bit bit-vector n.dtChild. Therefore, the expected number of bits for the dissemination tree state is: P DT stateDT = (128 + 32) × D i=0 Mi Pi

d=1

Numerical results: We numerically compute the state requirement for membership trees (with k = 16) and dissemination trees (with k = 2, 4, 16) for different sizes of the group and the overlay network. Figure 7 shows the number of groups that can be handled by an overlay with 216 routers (each with 3GB memory). Note that for this analysis, we ignore the overhead for state replication and assume that the load is perfectly balanced. We see that with 3GB per router, our membership tree protocol can support 4.7 trillion groups with group size of 12 and 1.3 trillion groups of size 64. Meanwhile, the membership tree provides between 1–2 orders of magnitude state reduction over the dissemination

Membership tree state requirement: For a level-i node n (0 < i < D) in BT (g) to become part of the membership tree M T (g), n needs to be the ancestor for at least Smin out of S first-hop routers. The probability for this to happen can be bounded by applying Chebyshev’s Inequality:  2 σi PiMT = Pr(Xi ≥ Smin ) ≤ max(σi ,S −µ ) min i

where µi and σi2 are the mean and variance of Xi (see above). 10

topology no. node no. Emulab node no. backbone link avg. link latency

76 95 36 7.36ms

distribution no. group 7600 no. first-hops 76 avg. group size 10

Forwarding state efficiency: We performed experiments to show the benefit of the membership tree(M T ) over the dissemination tree(DT ) from the perspective of total state stored at multicast routers. Figure 9(a) shows the number of nodes that maintain state for a group, as a function of the number of subscribed first hop routers(F H). With our design of the membership tree, only a small number of nodes need to maintain state for the group, and this number grows slowly. With DT (as with other multicast frameworks), as more of the routers become part of the dissemination tree, correspondingly a larger number of nodes maintain state for the group. M T has a significant benefit over the DT especially when the group size is small, which would be the dominant case if the group population followed a Zipf’s law. Forwarding path efficiency: We compare forwarding path efficiency of M T with DT ’s by measuring the number of hops from core to the first hop routers (i.e., avg. depth of tree). Figure 9(b) illustrates the increased average hop count going from the core to the first-hop routers with M T . M T has a penalty of resulting in an indirect delivery path (i.e., the M T is built based on the base tree BT , which may not be optimal for this set of subscribers), whereas DT follows the shortest path. Processing load: We measure processing load in terms of the average message processing load of an on-tree node (i.e., avg. number of outgoing links per node). We observe that as the number of first hop routers grows, an on-tree node on M T handles more messages compared to one on DT , as seen in Figure 9(c). This reflects the tradeoff of additional processing (forwarding on a more shallow tree, and hence a larger number of interfaces per node) with M T compared to DT , to get the benefit of more efficient state maintenance. We bound this processing overhead in our implementation by limiting the number of children for a leader in M T (k = 16). Note that we cannot have such a control in a DT . Thus, in a largescale environment, DT may not always have this processing load improvement compared to M T . Switching overhead: We measure the cost of switching modes, going from M T for inactive groups to DT for active groups, in terms of both time and bandwidth as the number of F Hs increase. Figure 10(a) shows the switching time in milliseconds to completely make a transition from a M T to the DT . In the implementation, we piggyback the control message (DtBuildMessage) with data messages and also limit the rate of generation of multicast data at one every 400 milliseconds. The time it takes for the entire tree to switch to a DT takes about two to four data message transmission intervals (between 800ms and 1600ms in the figure). This latency would improve if the control message were sent out separately without delay. Figure 10(b) shows the number of extra messages produced for switching between modes by counting the number of DtJoinCompleteMsgs sent. This overhead is the same as the number of edges in the M T .

Table 3: Test Setup

Figure 8: Topology of sprintlink-us tree. These results are consistent with our simulation results on more realistic topologies (i.e., Figure 6(a)-(c)) and clearly demonstrate the MAD’s ability to significantly reduce state requirement by decoupling membership and dissemination.

5.

EXPERIMENTAL EVALUATION

In this section, we compare the forwarding efficiency and processing load for the membership tree sub-protocol (M T ) and the dissemination tree sub-protocol (DT ), and cost of mode switching on an actual system implementation. We have implemented a prototype system for MAD with Java 1.4.2-12, using FreePastry 1.4.4 as the base infrastructure. We chose Pastry to provide us a convenient abstraction primarily for node maintenance and unicast routing, without any tie-in to the DHT architecture. Details of the MAD protocol including all the headers and the other details are in an extended version of this paper [5], and have been implemented in the prototype evaluated in this section.

5.1 Evaluation Setup Topology: Our experiments were conducted on the Emulab testbed [14]. Our system emulation used 95 Emulab nodes out of the maximum 364 nodes available. The physical nodes ranged from systems with an Intel Pentium III with 256MB RAM to Xeon with 2GB RAM. For a realistic topology, we used information available from Rocketfuel [12] for the sprintlink-us topology. We note two constraints from the Emulab environment: a physical machine is limited to 4 Ethernet interfaces, and each physical link is a 100 Mbits/sec Ethernet. The link latency for each edge is based on [9], and we assume the links are loss-free. To control the total number of nodes in the topology, we vary number of routers in a city (PoP). Routers in the same city are connected via a LAN with 0 latency. Figure 8 illustrates the node assignment of Emulab (left), and the geographical topology of sprintlink-us (right). User/group distribution: We use a Zipf’s distribution (α = 1.7) for group’s population in the range (gmin : 4 and gmax : 7600). We assume no locality in the user population, and randomly pick users from the entire set of nodes in the emulation experiment and assign them to an individual group to meet the pre-calculated group population. We use the default settings of pastry-1.4.4. Table 3 summarizes the setup.

6.

CONCLUSIONS

Information dissemination will be a key functionality of the evolving network infrastructure. Multicast, which uses

5.2 Experimental Results 11

10

DT MT

4 3 2 1

1

0 0 10 20 30 40 50 60 70 80 Number of First-Hops

30 Number of Message

5

DT MT Hop Counts

Number of Nodes

100

25

DT MT

20 15 10 5 0

0

10 20 30 40 50 60 70 80 Number of First-Hops

0 10 20 30 40 50 60 70 80 Number of First-Hops

(a) Total number of nodes keeping tree state (b) Average hop count from core to first-hop (c) Average number of outgoing links per routers on-tree node

Figure 9: Forwarding Efficiency formed simulations with realistic network topologies to show that MAD achieves the “best of both worlds”, obtaining the information forwarding efficiency of DT, and the scalability of state-maintenance with MT for supporting a vast number of groups. Finally, we developed a prototype implementation of MAD using the FreePastry infrastructure. Using Emulab, we emulated a reasonable network topology (representing the “sprintlink-us” topology from Rocketfuel), and demonstrated the feasibility of scaling the MAD architecture.

100 Number of Message

Switching Time (msec)

2000 1500 1000 500 0

80 60 40 20 0

0 10 20 30 40 50 60 70 80 Number of First-Hops

(a) Average Latency (msec)

0 10 20 30 40 50 60 70 80 Number of First-Hops

(b) Number of Messages

Figure 10: Overhead for Switching from M T to DT

7.

network and server resources efficiently for this functionality, will be a fundamental building block for information-centric networks. Given the ever-increasing amount of information being produced and consumed, multicast will need to evolve to manage a vast number of groups, with a distinct group for each piece of distributable content. Since group membership may be long-lived, and the criticality of the disseminated information is typically independent of the level of group activity, multicast will need to evolve to support persistent group membership. Supporting these needs raises challenges that are not met by today’s IP and overlay multicast technologies. In this paper, we described our architecture, Multicast with Adaptive Dual-state (MAD), that can scalably support a very large number of multicast groups, with persistent group membership, based on two key novel ideas. First, we decouple group membership state from the state needed for efficient forwarding of information. Group membership state is maintained scalably in a distributed fashion using a hierarchical membership tree (MT). Second, we apply an adaptive dualstate approach to optimize for the different objectives of active and inactive groups. Active groups use IP multicast-style dissemination trees (DT) for efficient data forwarding while inactive groups use their MTs for this purpose, without adversely affecting the overall forwarding efficiency. We avoid traffic concentration for the DTs by having a root that is selected on a per-multicast group basis, and share the same root for the MT as well. We described the MAD protocol in detail, with particular emphases on the seamless transition of a group between the use of the DT and the MT for forwarding of information, failure recovery, and supporting MAD across multiple administrative domains. We examined the characteristics of the MAD protocol through analysis, simulation and a prototype implementation. Our analysis demonstrates the scalability of MAD in maintaining group membership state at very large scale. We also per12

REFERENCES

[1] A. J. Ballardie, P. Francis, and J. Crowcroft Core based trees. In Proc. SIGCOMM, 1993. [2] R. Boivie, N. Feldman, Y. Imai, W. livens, D. Oooms, and O. Paridaens. Explicit multicast (Xcast) basic specification. IETF Internet draft, 2000. [3] A. Bozdog, R. van Renesse, and D. Dumitriu. Selectcast: a scalable and self-repairing multicast overlay routing facility. In Proc. of ACM workshop on survivable and self-regenerative systems, 2003. [4] M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron. SCRIBE: A large-scale and decentralized application-level multicast infrastructure. IEEE J. on Selected Areas in Comm., 2002. [5] T. Cho, M. Rabinovich, K.K. Ramakrishnan, D. Srivastava, and Y. Zhang. Multicast architecture with adaptive dual-state (extended version). http://www.research.att.com/˜divesh/ papers/mad07.pdf [6] L. HMK Costa, S. Fdida, and O. CMB Duarte. Hop-by-hop multicast routing protocol. In Proc. of SIGCOMM, 2001. [7] D. Estrin, D. Farinacci, A. Helmy, D. Thaler, S. Deering, M. Handley, V. Jacobson, C. Liu, P. Sharma, and L. Wei. Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification. RFC2362, 1998. [8] H. Liu, V. Ramasubramanian, and E. G. Sirer. A Measurement Study of RSS, a Publish-Subscribe System for Web Micronews. In Proc. of IMC, 2005. [9] R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. Inferring link weights using end-to-end measurements. In Proc. of IMW, 2002. [10] S. Ratnasamy, A. Ermolinskiy, and S. Shenker. Revisiting ip multicast. In Proc. of SIGCOMM, 2006. [11] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 2001. [12] N. T. Spring, R. Mahajan, D. Wetherall, and T. E. Anderson. Measuring ISP topologies with Rocketfuel. IEEE/ACM Trans. Netw., 2004. [13] I. Stoica, T. S. E. Ng, and H. Zhang. REUNITE: A recursive unicast approach to multicast. In Proc. of INFOCOM, 2000. [14] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An integrated experimental environment for distributed systems and networks. In Proc. of OSDI, 2002. [15] B. Zhang, T. S. E. Ng, A. Nandi, R. Riedi, P. Druschel, and G. Wang. Measurement-based analysis, modeling, and synthesis of the Internet delay space. In Proc. SIGCOMM IMC, 2006.