Dec 8, 2006 - With the increasing demand on 24*7 uptime monitoring applications, we envision ... network data processing and dynamic tree management ...
SATI: Scalable And Traffic efficient data dissemination Infrastructure for sensor-based distributed information services Sungjae Jo, Younghyun Ju, Kyungmin Cho, Seungwoo Kang, Mungyung Ryu, and Junehwa Song CS/TR-2006-265 December 8, 2006
KAIST Department of Computer Science
SATI: Scalable And Traffic efficient data dissemination Infrastructure for sensor-based distributed information services
Sungjae Jo, Younghyun Ju, Kyungmin Cho, Seungwoo Kang, Mungyung Ryu, and Junehwa Song Dept. of EECS., Korea Advanced Institute of Science and Technology, 371-1 Guseong-Dong, Yuseong-Gu, Daejon, 305-701, Korea {sjjo, yhju, kmcho, swkang, mgryu, junesong }@nclab.kaist.ac.kr
Abstract When highly distributed data streaming services are fully realized, they will bring up new issues in delivering data streams to data consumers. First, data delivery scheme should be traffic efficient because the Internet can be easily inundated by a massive number of data streams. Also, the probing messages generated by many delivery schemes waste network resources. These problems are not easily solved by existing data delivery schemes because none of them consider the economic usage and measurement of network resource. This paper proposes SATI, an infrastructure that provides not only shared management and measurement of network information but also scalable and traffic efficient data delivery. Experimental results show that SATI is scalable and greatly reduces network traffic.
I.
INTRODUCTION
This paper considers an infrastructure to support data dissemination for Sensor-based Distributed information Services (SDS). Examples of such services include weather forecast, large-scale scientific workflow and intelligent traffic services[14][15][16]. These services are uniquely characterized by the following three attributes; high volume of transformed output stream, heterogeneous QoS sensitivity, and large number of globally distributed consumers. SDS usually transforms its inputs from multiple data sources into more valueadded information such as the estimated typhoon route, which is quite huge in size. For example, CORIE system[14], a typical example of environmental observation and forecasting system, generates 5 to 20 GB of daily forecasts about the conditions of Columbia River. The information produced by SDSes is often fed into mission-critical applications such as the autonomous infrastructure maintenance system or warning dispatchers. Thus, it is common that SDSes have their service-specific QoS requirements such as low end-to-end latency or high bandwidth. With the increasing demand on 24*7 uptime monitoring applications, we envision that the number of SDSes and their users would be explosive. Despite SDSes’ popularity and usefulness, however, little work is done on the efficient data delivery support. Our infrastructure provides efficient data distribution services to massive number of globally distributed end-hosts. SDS providers can rapidly develop their services, without knowing about the details of data delivery mechanisms. Also, the redundant efforts of developing data delivery mechanisms can be eliminated. Considering an explosive growing rate of SDSes and millions of consumers, such wide-area data delivery infrastructure should address a number of issues. First and the foremost, SDSes would consume a significant portion of wide-area network resources with their heavy and persistent traffic. Because wide-area network resources are scarce and shared by other services, data dissemination scheme should consider the economic network consumption. Second, the performance of data delivery should not be severely degraded even in the presence of a large number of globally dispersed consumers. SDS consumers usually want to receive their query results within their QoS thresholds such as low end-to-end latency. Lastly, the data delivery scheme for SDSes could cover the whole network with countless probing packets. In the context of SDSes, the delivery paths are generally long and numerous because the number of service consumers easily reaches more than 10K and they live in diverse locations. Assuming all participating nodes periodically performs network probing aggressively, the amount of probing packets could be huge. It is estimated in [10] that the total amount of probing packets in an overlay of one hundred nodes reaches 1-GB per day. Furthermore, since it is common that many service consumers subscribe to multiple streams, many delivery paths in different data delivery schemes are overlapped, causing redundant probing packets to measure the same links. Thus, the infrastructural support for SDSes should carefully consider the cost-effective way of the network measurement.
In responding to these requirements, Our proposed SATI architecture addresses aforementioned concerns as follows: 1) shared measurement and management of network information for reducing probing overhead, 2) in-
network data processing and dynamic tree management mechanism for maximizing traffic-efficiency and delivery performance and 3) the two-tier hierarchy to minimize the state complexity and support the scalable wide-area data delivery. In order to save the probing cost, SATI creates abstract view of physical network, called as Virtual Network Map. VNM is composed of participating nodes’ coordinates, which take inter-node latencies and embed them in relative coordinate space[12]. VNM enables cost-effective latency prediction between nodes that never communicated. VNM information is efficiently updated by using our selective gossip algorithm. Further, SATI shares VNM with participating SATI nodes, which eliminates the need for data delivery schemes to perform network measurements. SATI also significantly mitigates the network resource usage by downsizing the amount of in-transit data streams in two ways. First, SATI takes consumers’ interests as a form of simple query operators such as filters or aggregators and places them on data forwarding nodes for eliminating data uninteresting to service consumers. We term such operators as Stream Data Attenuators (SDA). Because users’ interests usually have a degree of spatiotemporal locality, SDA can substantially avoid the bandwidth consumption of unwanted data traffic. Second, SATI reduces in-network traffic by dynamically replacing costly paths with more economic paths. The proposed cost function and path switching mechanism incrementally reconfigure the tree structure in three directions; decrease the bandwidth consumption by aggregating the effect of SDAs, reduce the path latency or both. Unlike the traditional approaches[7][9], SATI data delivery scheme achieves globally efficient path replacements by using the approximated global network map composed of nodes’ coordinates, called Virtual Network Map(VNM). Due to the huge probing and state maintenance overhead, it is virtually impossible for the existing data delivery schemes to collect and utilize the global network knowledge. Lastly, coupled with VNM, SATI nodes are organized into the hierarchical structure for achieving the better scalability. This hierarchy allows SATI to locally confine state and communication overhead. Besides, it makes the performance of data delivery schemes scalable even in the presence of massive service consumers.
II.
APPLICATION SCENARIO
As an exemplary application scenario, consider the Meso-scale Weather Forecast Service (MWFS). Each year, meso-scale weather – floods, tornadoes, hail, strong winds, lightning, and winter storms – causes hundreds of deaths, routinely disrupts transportation and commerce, and results in annual economic losses more than $13B. Aggregating continuous data feeds from various weather data sources like radar images and wind speed, MWFS operates numerical weather predication models for predicting, tracking these meso-scale weather events and accessing the possible damages. Since the raw predication outputs are hard to understand, they are usually converted to a series of 2D or 3D models and transmitted to a large number of consumers. The distribution of possible consumers is very wide, ranging from the municipal organizations to weather-critical companies. Also, some people install weather prediction gadget on their desktop. Consumers may specify the area of interest for selectively receiving forecast information. The weather forecasts are required to be timely delivered because
they are a very crucial factor in operational decisions made by weather-critical companies like insurance company.
Ground Radar SN
Infrastructure Surveillance
Satellites
MWPS
Fire Stations AoI = Flushing
AoI = NY Weather SN
Gov. Organizations
Companies
Individuals
Fig. 1.
The Meso-Scale Weather Forecast Service
III.
PROBLEM STATEMENT
In this section, we describe the system and network model, and state our solution objectives formally. We also outline the requirements that our solution should satisfy. TableⅠ shows the notations that will be used throughout the paper. A. System and Network Model 1) System Model: In SATI, there are SDSes that generate data streams, a set of SATI nodes N = {n1, n2,…, n|N|}, and a large number of consumers. Each SDS sends its data stream to the designated SATI node and the node disseminates data to the consumers of the SDS. Each consumer submits his specific range of interest about a certain SDS and receives data from the closest SATI node. The interest of a consumer is described as a form of simple query such as filters or aggregators. We term the query operators as Stream Data Attenuators (SDA). Since we focused on the data delivery among SATI nodes, we will simply call the node which receives data from a SDS as a source and the node which receives and delivers data from the source to one or more consumers as the source’s consumer. 2) Network Model: The SATI nodes, N, compose an overlay network which can be modeled as a complete directed graph G = (V, E), where V = N is the set of vertices and E = V × V is the set of edges corresponding the unicast paths between pairs of nodes. To build an efficient dissemination scheme, the nodes in V are organized into a data delivery tree whose root is a source. Each tree consists of a set of nodes V’ ∈ V containing a source and the source’s consumers, and edges E’ ∈ E. B. Problem Statement
Our aim is to minimize the usage of network resource without degrading the consumer-specific QoS. In this paper, we assume that the primary consumer-specific QoS metric is the end-to-end delay. This assumption is reasonable because most of data stream services have soft real-time requirements. Each consumer specifies the TABLE I NOTATIONS USED IN THIS PAPER
Notation
Description
ni
the node whose id is i
ei , j
an on-tree edge connected between ni and nj
LRi
a set of on-tree edges on the direct path
(nj = ni’s direct parent)
from ni to the root
λi
the sum of BW consumed for input streams at ni’s direct parent
C
R i
the sum of traffic on the direct path from ni to the root
ϕ
−
R ei , j
the sum of traffic from nj to the root after removing ei,j
ϕeR
+
the sum of traffic from ni to the root
i, j
after adding ei,j
Qi
the set of SDA registered at ni
θi
the required traffic of sending output stream at ni
θ (Qi )
− ' j
θ (Qi )
+
' j
δ (Qi )
Δ (ei , j ) m
after applying Qi the required traffic of sending output stream at node j after removing Qi the required traffic of sending output stream at node j after adding Qi the aggregated selectivity of Qi the cost of direct path nj to the root from after removing/adding ei,j
upper bound of the delay as his QoS condition.
As a metric of the usage of network resource, we propose the following cost function representing the amount of data traversing on the data delivery paths. cos t (T ) =
∑ BW (e
ei , j ∈L
i, j
) ⋅ Lat (ei , j )
(1)
where L is the set of edges used by a data delivery tree T, BW(ei,j) is the bandwidth consumed in the edge for delivering the stream and Lat(ei,j) is the end-to-end delay from node i to node j.
This cost function models the amount of data streaming in a link as the product of bandwidth consumed and
link latency. It means that the longer the latency is, the more likely it is for a data stream to traverse many routers and physical links. Considering the cost function, our objective can be formally stated as follows: When there are a source s, a set of consumers D, and a set of QoS condition C, find a tree spanning D ∪ {s}, such that cost(T) is minimized while C is satisfied. Our problem is basically in line with the delay-bounded minimum cost tree problem[17], which is known as NP-complete. However, in the context of SATI, the complexity of this problem is more complicated with SDAs. SATI places each consumer’s SDAs on the nodes in the path from a source to the consumers and uses them for filtering out unwanted data streams. Thus, the bandwidth of the outgoing link in a node is heavily dependant on the aggregate selectivity of all registered SDAs. Figure 2 shows how the migration of SDAs by the change of the tree path affects the bandwidth consumption. Node R is the root of the tree and Node A, B, C and D submits SDAs having different value ranges, which are noted next to the node. In Figure 2-(a), node E and R receives and registers two SDAs from node A and B. As a result, node E has to receive all data whose values fall in between [10-15] and [20-25]. However, in case that node B switches to node F instead of node E, the SDA from node B is deregistered from node E and the root. Then, it is migrated to node F so that the total data traffic traversing from the root to node E could be reduced to a half. This result means that deregistering an SDA from node B not only affects the link cost from node B and E but also the link cost from node E and the root. Also, this analysis applies the same when registering an SDA. Taking this effect into the consideration, we extend our cost function as follows: cos t (T ) =
∑θ
ei , j ∈L
i
(2)
⋅ Lat (ei , j ), where θ i = δ (Qi ) ⋅ λi
Also, note that changing the current location of node B to more cost-effective one requires the network information of all foreign sub-tree paths and corresponding SDA information. R
[10-15,20-25]
[20-25]
A
[10-16, 20-25]
F
E
[10-15]
B
C
R
[10-16]
[10-14]
D
(a) Before the path change
[12-16]
[20-25]
E
A
[10-15]
[20-25]
[10-16, 20-25]
F
B
C
[10-16]
[10-14]
D
[12-16]
(b) After the path change
Fig. 2. The effect of SDA in the bandwidth consumption
The desirable solution should consider the following set of requirements that make the solution practi cal. 1) Scalability: it must be scalable to the large number of consumers, as well as the number of concurrent services. Especially, since the consumers are assumed to be globally distributed, the performance of the data delivery should not be severely degraded to distant consumers. When the resources are insufficient to provide complete and timely steam delivery to every consumer, solutions are required to be gracefully degraded.
2) Adaptability: the performance of the SATI data delivery tree should not be compromised by the network changes such as latency and stream data rate. To ensure the persistent data delivery, the tree must be adaptive by making the incremental changes to the existing tree structure. 3) Light-weight state and maintenance overhead: The state complexity of our solution should be low even in the presence of a large number of participating nodes. Also, the solution should not aggressively send probing packets amoing overlay nodes. C. The Solution Space and Its Construction Cost Finding the optimal tree structure, hence the location of the corresponding SDAs, requires the complete global network and SDA information. Ideally, the optimal approach is that a node has the complete knowledge about the current network conditions and all SDA information by performing and collecting the latencies between all node pairs. Apparently, its maintenance overhead would become unacceptable in the presence of a large number of nodes. The existing decentralized approaches[7][9] just collect localized information such as the latency from the children or its direct ancestors. However, this localized information often eliminates a large number of opportunities to make a better choice, only to find the local optimal solutions. This motivates us to propose near-optimal but practical alternative heuristics with a reasonable maintenance cost.
IV.
SATI OVERVIEW
The design of SATI is motivated by the trade-off problem between the completeness of information about the global network topology and the probing cost needed for collecting this information. To address this problem, SATI places Virtual Network Layer (VNL) between Physical Network and Data Delivery Layer (Fig. 3). VNL exploits two mechanisms: 1) Virtual Network Map (VNM), which provides global network information by transforming physical network to abstract form and 2) Virtual Network Structure (VNS), which makes the system more scalable by organizing SATI nodes into the two-tier hierarchy. With the support of VNL, Data Delivery Layer can construct scalable and effective data delivery paths and maintains them more easily.
Data Delivery Layer Virtual Network Layer Physical Network Layer
Fig. 3. Three layers of SATI
In VNL, the physical network is transformed to a VNM by assigning each node a coordinate in a Euclidean space. In the map, the distance between two nodes approximates the network latency between them. Using Vivaldi algorithm[12], this mapping is simply done. For computing its own coordinate, each node measures the latency to/from a small number of nodes. Then, it builds its VNM by gathering such computed coordinate values
of other nodes. VNM based approach facilitates the construction and maintenance of data delivery paths in several ways. First, SATI can select the globally optimized data delivery paths because it has the approximated global view of the entire network. Second, sharing coordinates of nodes to construct and maintain paths eliminates redundant network probing among trees. Once a node computes its coordinate, all trees containing the node can use it. Also, SATI provides VNM with low cost. There are two kinds of costs to use VNM, the network probing cost to update nodes’ coordinate and the communication cost to exchange coordinate information between nodes. The probing cost is small because a coordinate is updated only with a small constant number of measurements with other nodes. Also, we minimize the communication cost by adapting effective information sharing mechanism called selective gossip, which will be described in Section V. VNS is the mechanism that organizes SATI nodes in the coordinate space of the VNM into the two-tier structure. VNS groups nodes into clusters based on proximity in the VNM and again the head nodes of the clusters form the upper level cluster. Using VNS, the VNM and data delivery trees are maintained as the twotier structure. This makes SATI scalable by significantly reducing the maintenance cost of the VNM and delivery trees. On VNL, Data Delivery Layer constructs and maintains a data delivery tree for each data source. This procedure is conducted as the two steps. First, a data delivery tree is initially constructed to keep consumers’ QoS conditions. Then, the data delivery tree dynamically changes existing paths to more cost-effective paths. This approach makes data delivery trees adaptive to dynamic network changes and member changes. In the rest of this section, we describe the detailed structure of SATI, data delivery trees as well as the state information maintained in each node. A. SATI Structure SATI nodes are clustered based on proximity in the VMM. Based on the clustering result, the VNM is maintained into the two-tier; a global VNM and local VNMs. They are shown in Fig. 4-(b). Each VNM consists of coordinate information of nodes. The coordinate information of a node is composed of the tuple . Coordinate error means the confidence of the coordinate and is used in coordinate update.
Fig. 4. Logical structure of SATI nodes adapting VNM and VNS
The global VNM and local VNMs consist of coordinate information of head nodes and the cluster member nodes, respectively. Head nodes maintain the global VNM and its local VNM and the other nodes only maintain its local VNM. Deciding the size of the cluster is crucial because it directly affects the size of global VNM and local VNMs. For example, if the size of clusters is too small, head nodes should maintain the large size of global VNM. Otherwise, cluster member nodes should have the large size of the local VNM. Thus, in order to balance the size of local and global VNM, SATI chooses the desirable cluster size as
(N: the number of nodes). If the
cluster size is smaller than k1 (< N ) or larger than k2 (> N ), the corresponding clusters are merged or split. We set k2 to 3k1 in order to prevent clusters from being merged again soon after clusters split. From these two conditions, 1) k2 = 3k1 and 2) an average size of clusters =
, we set k1 and k2 to
and
,
respectively. Then, the maximum size of a local VNM becomes 3 N /2 and the maximum size of the global VNM becomes 2 N . Fig. 5 shows the two-tier structure of the data delivery tree. When the first consumer in a cluster requests the service subscription, the head node of the cluster elects a node as Service ProXy (SPX) of the service. SPX is responsible for the data dissemination to consumers in the same cluster. Therefore, the source only needs to disseminate its data to its SPXes. Based on the coordinate information of receiver nodes, sources and SPXes centrally compute optimal paths for data delivery trees. Thus, sources only need to maintain the coordinate information of its SPXes, while SPXes uses coordinate information of its local VNM.
Fig. 5. Two tier structure of data delivery tree
Because of the two-tier structure, even if there are a massive number of consumers, the tree construction and maintenance of sources and SPXes are scalable. Also, compared with the traditional approach[5][6] of using head nodes as the upper layer data delivery paths, The use of SPXes improves load distribution among participating nodes.
V.
SATI VIRTUAL NETWORK LAYER
The SATI protocol in Virtual Network Layer has three main components: (1) VNM construction and maintenance (2) cluster management, and (3) node join and leave. In this section we details the description of
each protocol. A. VNM Construction and Maintenance A VNM is constructed as the following two-step procedure. First, nodes compute their coordinates using Vivaldi algorithm [12] through a small constant number of measurements with other nodes. Then, nodes construct the VNM by sharing their coordinate information using a selective gossip algorithm. By conducting this procedure iteratively, nodes maintain their VNMs up to date. Usually, network measurements are performed with nodes which send/receive data to/from the node by piggybacking in-transit data. This method greatly reduces the measurement overhead and works well in SATI where data streams are persistently traversing around among numerous nodes. If a node is not actively participating in data delivery, measurements are done with randomly-chosen nodes in its VNM. In this case, the measurement overhead is also not significant because the number of measurements in one turn is a small constant. When a node updates its coordinate, measurements with close nodes only can result in an inaccurate relative coordinate with distant nodes. Therefore, a non-head node occasionally has to perform measurements with 3-5 head nodes of other clusters by receiving the global VNM from its head node. Since most of SATI mechanisms are based on VNM, it is important that VNM has the latest coordinate information of other nodes. Thus, when coordinate information is updated, it should be promptly propagated to other nodes which maintain the same VNM. In order to address this problem, we propose a fast information exchange algorithm, named as selective gossip. In this algorithm, each node periodically exchanges its VNM with several other nodes, and then updates its VNM with other nodes’ VNMs by comparing the recent update time attribute. Suppose X0,…,Xn-1 as nodes in a VNM and they are sorted by node ID, and let c be a some constant. Then, node Xi exchanges its VNM with the following set of nodes E. E=
UX
0≤ k ≤ c −1
⎡
i+ nk
c
⎤
, where
X j = X j mod n
(3)
This selection algorithm confines the communication cost to c at a time because the size of set, E, is c and guarantees the fast propagation time by the following theorem. Theorem 1: Using selective gossip algorithm, Xi’s
coordinate information is propagated to all other nodes
belonging to [X0,…,Xn-1] in O (n log(log n)) times. 1c
Proof: Let T(i, j) (i < j) be the propagation time of Xi’s information to all nodes in [Xi…,Xj]. Then, the propagation time of Xi’s information to all nodes in [X0…Xn-1] can be represented as T(i, i+n) because X j = X j mod n . The following lemma is trivial to prove.
Lemma 1. T (i, j ) ≤ T ( m , n ) , if j − i < n − m Lemma 2. T (i, j ) = T ( m , n ) , if j − i = n − m Lemma 3. T (0, H ) < H , if H is a some constant When one exchange is finished, Xi’s information has been propagated to X following equation holds.
⎡ ⎤
i+ n1 c
,…, X
⎡
i + n c −1 c
⎤
. Thus, the
T (i, j) ≤ 1 + max(T (i, i + ⎣n1 c ⎦),T (i + ⎡n1 c ⎤, i + ⎣n2 c ⎦),...,T (i + ⎡nc−1 c ⎤, n))
Let m be 2n. Then, T(i, i+n) can be computed as follow. T (i, i + n) = T (0, n) (Q Lemma 2) ≤ T (0, m) (Q Lemma 1)
≤ 1 + max(T (0, ⎣n1 c ⎦) ,..., T (⎡nc−1 c ⎤, m) ≤ 1 + T (⎡nc−1 c ⎤, m) ≤ ⎣m ⎡n c−1 c ⎤⎦ + T ( ⎣m ⎡n c−1 c ⎤⎦ × ⎡n c−1 c ⎤, m)
⎡
⎤
≤ 2 n 1 c + T ( 0 , ⎡n c −1 c ⎤ ) ≤ 2 kn 1 c + T ( 0 , n ( c −1 c ) ) k
≤ 2 logc c−1 (log H n) × n1 c + H k
(Q n ( c−1 c ) < H , if k > 2 logc c − 1 (logH n) and Lemma 3 ) ∴ T (i, i + n) = O ( n1 c log(log n))
B. Cluster Management For the most updated membership of its cluster, a head node receives heartbeat messages from all other cluster member nodes. Also it is responsible for regulating the cluster size and managing the cluster membership from node joins and leaves. 1) Head selection: To minimize the communication overhead, head nodes should be placed in the center of their cluster. For this reason, the head node in a cluster is periodically reelected by the current head node. Let X be a cluster consisting of X0….Xn-1, the current head node Xh and Xi - Xj the distance between coordinates of Xi and Xj. Then, the next head node Xh’ is selected as (4). ⎧⎪ X , if ∑n ( X p − X k ) 2 + σ < ∑n ( X h − X k ) 2 k =1 k =1 X h' = ⎨ p ⎪⎩ X h , otherwise
(4)
where, X p = minx ∈X {∑n ( X i − X k ) 2 } and σ is some constant. In this algorithm, σ is used as the threshold value to prevent i k =1 frequent changes of head nodes.
2) Cluster Refinement: In SATI, the cluster diameter is important because data delivery trees are constructed into the two-tier, based on the assumption that nodes in one cluster are close to each other. However, it is possible that the cluster diameter increases gradually because nodes’ coorindate can be spread out through the repetited updates. Accordingly, each node periodically selects the closest cluster X’ with the global VNM received from its head node. If the head node of X’ is close enough, the node joins to X’. 3) Cluster Split and Merge: As described in Section Ⅳ.A, the cluster size is limited in [k1, k2]. To maintain the cluster size, a head node periodically checks the size of its cluster and if the cluster size is not in [k1, k2], it splits or merges the cluster. Suppose that a cluster X consists of nodes X0,…,Xn-1 and the head node of X is Xh. If the size of X become over k2, Xh splits the cluster into two clusters whose sizes are larger than k1. The cluster split is performed as follows:
1. Partition [X0,…, Xn-1] into two sets C1 and C2 whose sizes are larger than k1 with its local VNM. 2. In C1 and C2, select a node whose sum of distances with other nodes is the smallest as new head nodes Xh1 and Xh2 3. Inform Xh1 and Xh2 that they become the head node of new cluster composed of nodes C1 and C2 respectively. In step 1, nodes are partitioned in a way to minimize the area of resulting two sets using Linear split algorithm in R-tree[18]. Minimizing the area of clusters is important because the two-tier data delivery tree works well only when nodes in the same clusters are close enough. With the help of VNM, the cluster head node can use a more elaborated clustering algorithm, which yields the better clustering results. If the size of cluster X becomes under k1, Xh merges X with one of the closest other clusters. Xh conducts merging procedure as follows. 1. Select a cluster to merge with. First, Xh selects m closest clusters with its global VNM, and then selects the most small-sized cluster as the cluster to merge with. Let the selected cluster be Y. 2. Receive Y’s local VNM from the head node of Y. 3. Select a node whose sum of distances with other nodes in two local VNMs of X and Y is the smallest and elect it as the new head node Xh’. 4. Inform Xh’ that it becomes the head node of the new cluster composed of nodes, X ∪ Y. C. Node Join and Leave A joining node is assumed to know several well-known portal nodes having the global VNM. A joining node Xi receives the global VNM from a portal node and computes its coordinate through measurements with 10 to 20 head nodes. With the computed coordinate and the global VNM, Xi selects and joins the closest cluster. When Xi joins the cluster, the head node gives its local VNM to Xi and updates its local VNM by receiving Xi’s coordinate information. Then, Xi’s coordinate information is propagated from Xi or the head node to the other cluster member nodes through selective gossiping. Similar procedure is performed when a node leaves or fails. When a node Xj leaves or fails, the head node knows this by the notification from Xj or the absence of Xj’s heartbeat message. Then, the head node invalidates Xj’s coordinate information in its local VNM by changing Validity attribute. The other member nodes recognize the leave or failure of Xj when the head node’s local VNM is propagated to the node through SATI selective gossip protocol.
VI.
SATI DATA DATA DELIVERY LAYER(DDL)
The tree roots, sources or SPXes, construct and maintain trees based on the set of consumer-defined QoS conditions and SDA information. They receive this information when a consumer requests his subscription. Tree construction and maintenance consists of the two phases; initial tree set-up with consumer-defined QoS conditions and Dynamic Path Switching (DPS). The overall description of the two phases is shown in Figure 6. In InitialQoSBoundedTree(), the initial tree is built up as the degree-constrained minimum latency tree[13].
Then, it searches the consumer nodes of which latency conditions are not satisfied and promotes these nodes as close to as their tree root until their latency conditions are met. Latency conditions imposed by service consumers might be so restrictive that even the minimum latency tree can not satisfy the latency bounds. Then, an arbitration process must be invoked for relaxing the latency requirements, which is beyond the scope of this INPUT N = a set of tree member nodes, s = source node D = a set of destination nodes QoS = a set of QoS bounds for destination nodes F = a set of SDAs for on-tree nodes OUTPUT QoS guaranteed multicast tree spanning D ∪ {s} PROCEDURE DataDeliveryTree( N, s, D, QoS, F) { Tj Å InitialQoSboundedTree(N, s, D, QoS); while (Ph = NULL) { Ph Åthe highest cost edge in Tj among all unmarked the tree edges; Remove Ph from tree Tj, getting two sub-trees T1 and T2; Pl Å QoSBoundedLeastCostPath(N, s, QoS, F, T1, T2); if (Ph = Pl) { mark Ph; } else { Tnew Å Pl ∪ T1 ∪ T2; return; } } }
Fig. 6. General description of tree construction and maintenance algorithm
paper. Throughout this paper, we assume that the resulting initial tree meets all QoS conditions. The second phase periodically finds the costly path, Ph, and replaces it with more cost-effective one.
A. The Computation of Path Cost The change of path accompanies the migration of SDAs: deregistering SDAs on the old path and registering them on the new path. Accordingly, in order to compute the path cost, we have to consider this effect as follows:
CiR =
∑θα ⋅ Lat (eα β ) ,
eα , β ∈LRi
, where θ α =δ (Qα ) ⋅ λα
(5)
This cost function calculates the network traffic traversing on the direct path from ni to the tree root. Each node periodically calculates θ i . The tree roots collect this information from their on-tree member nodes and use them for the computation of the path cost. On the other hand, the next function calculates the total network traffic on the path after removing ei,j from the SDA sets of all ancestor nodes or adding ei,j to all ancestor nodes of the new path.
∑ θβ (Q ) ⋅ Lat(eα β )
ϕeR =
−
i, j
− '
i
(6)
∑ θβ (Q ) ⋅ Lat(eα β )
ϕeR =
+
i, j
,
eα ,β ∈LRj
(7)
+ '
i
,
eα ,β ∈LRi
Thus, we can compute the cost reduction and increment of an edge as follows:
Δ− ( e i , j ) = C iR − − ϕ eRi, j
(8)
Δ+ (ei ,k ) = +ϕ eRi,k − CkR
(9)
where node j is node i’s direct parent and node k is node i’s newly chosen parent.
B. Dynamic Path Switching Depending on how to perform DPS, we propose two heuristics as follows. 1) Optimal
switching:
This T
approach
calculates
the
cost
differentials
of
all
node
pairs,
Δ− (ei, j ) − Δ+ (ei,k ) and chooses the pair whose cost differential is the largest. This approach can build the lowest cost tree but the computation complexity is O(n2logn). 2) Path-based switching: This approach searches the victim edge whose contribution in cost reduction is the highest, and finds the replacement edge for this removed edge. This heuristic requires O(nlogn) operation but its traffic reduction effect is less than the optimal solution. Based on the path cost function explained above, pathbased switching is performed into the two steps as shown in Table Ⅱ.
TABLE Ⅱ STEP-WISE DESCRIPTION OF PATH-BASED SWITCHING
1st step: Select and remove the victim edge For all tree nodes, calculate the cost reduction of the path. Select the edge whose cost reduction effect is the highest
Select
ei , j
from
max{Δ− (ei , j )}
2nd step: Select and add the replacement edge For all tree nodes, calculate the cost increment of the new path starting from the victim node. The edge with the smallest cost would be the replacement path.
Select
ei , k
from
min{Δ+ (ei ,k )}
C. Advantages of Our Approach At the core of all DDT operations lie VNS and VNM. VNM enables the centralized search of optimal paths by providing the latency information of all node pairs. Also, the dynamic switching scheme make SATI DDTs remain traffic-efficient under time-varying network conditions. Based on VNS, the two-tiered tree structure enables the scalable data dissemination to a large number of the widely-distributed consumers. Also, VNS confines the number of state information required for the tree maintenance. Since the cluster size is between N 2 and 3 N 2 , the total size of tree members that a tree root should maintain is at most 2 N coordinate
information, which is fairly reasonable in large-scale systems.
VII. EVALUATION The evaluation of SATI is conducted by testing an implementation of the SATI protocol within a simulated environment. The primary purpose of our experiments is to validate the performance of the SATI with respect to the scalability and the efficient usage of network resource. A. Experimental Setup We simulated an overlay network of SATI nodes over the network topology generated with GT-ITM [11] using transit-stub topology model. As a default, we use 10,000 nodes and 10 SDSes. We set the number of consumers to be the same number of nodes, since we assume that each consumer joins SATI with his machine. Due to the lack of the real workload and benchmarking environment, we made synthetic workloads. Each consumer decides its target service, QoS conditions and SDAs. The format of SDAs is the range predicate, which is decided by the leftist value and the predicate range. The range of the predicate is statically set to 15% of the total range of data. We assume the delivery QoS as the maximum tolerable network stretch. Also, we simplify data generation by periodically producing the integer values between 0 and 1000 with different
60
40 20
6 80
5
60
0
5
10
15
Avg. Stretch
4 3
40
20
0.6 0.4 0.2
1
0
0 0
20
10
20
30
2
Degree
Stretch
(a)
3 4 5 QoSCondition (Stretch)
6
(no filtering, no path swtiching)
(c) (c)
(b)
1.2
0.8
2
0
0
1
QoS Stretch Relative Traffic
80
1.2
100
Stretch
Cumulative Distribution of Degree (%)
Cumulative Distribution of Stretch (%)
7 100
1.4
4
1.2
3.5
1
3
(no filtering, path switching)
(filtering, no path switching)
(filtering, path switching)
(d)
Relative Traffic
Relative Traffic
0.8 0.6 0.4
Relative Traffic
Initial 1
0.8 0.6 0.4
Link-based Path-based MST
2.5
Optimal
2 1.5 1
0.2
0.2
0
0.5
0
0
2000
4000
6000
Number of Path Swtiching
(e)
8000
0 3
4
5
6
7
5000
Dimension of Coordinate Space
10000
15000
20000
Number of Nodes
(f)
(g)
Fig. 7. Experiment results: (a) Stretch distribution. (b) Tree degree distribution. (c) Consumer-specific QoS condition effect. (d) Filtering and path switching effect. (e) Path switching effect. (f) The effect of Coordinate space dimension. (g) Comparison of path switching algorithms.
distributions. For service popularity, predicate and data distributions, we use the following distributions: uniform, Zipf, and multivariate Gaussian distributions. We conducted simulations with all possible combinations of distributions. However, due to the space limit, we present the results generated by setting the distributions of three parameters to the uniform. For other major default parameter settings, the dimension of coordinate space of VNM is set to 5, the required QoS stretch condition is set to 4 and the default path switching is path-based switching.
B. Simulation Results 1) The stretch and node degree of DDT In order to show the scalability of SATI DDT, we measure the stretch and node degree with 20,000 consumers. The Figure 7-(a) shows the cumulative distribution of the stretches. The stretch of 99.9% of nodes is less than 4. Only 19 nodes experience the stretch of more than 4. Figure 7-(b) shows the cumulative distribution of the node degree. The degree of 99.19% of nodes is less than 10 and maximum node degree is 30. This result indicates that the load of delivering data streams is well distributed among SATI nodes. 2) The effect of consumer-specific QoS on the actual QoS experienced by consumers Figure 7-(c) shows the actual QoS experienced by consumers with varying QoS stretch conditions. Since SATI cost function models the network traffic with the product of the latency and bandwidth, the average value of actual stretch is not severely compromised. As in Figure 7-(c), the actual QoS measures increase sub-linearly. 3) The effect of SDA and path switching on the network traffic reduction This experiment shows the effect of SDA and path switching on the amount of traffic reduction. Figure 7-(d) describes that the combination of two mechanisms can greatly reduce the network traffic by about 80%. Also, SDA and path switching alone can reduce the traffic by 75% and 40% respectively. Next, we investigate the amount of traffic reduction as the number of path switching is increased. Figure 7-(e) shows the effect of traffic reduction is converged with the increasing number of path switching. 4) The effect of the VNM accuracy on the traffic reduction
Since SATI estimates the actual path latency with the path distance in VNM, the accuracy of VNM heavily influences on the performance of DDT. In this experiment, by varying the dimension of coordinate space, we investigate how the coordinate accuracy affects the traffic reduction. It is known that higher coordinate dimension increases the accuracy of coordinates. As shown in Figure 7-(f), the amount of traffic reduction is increased as the coordinate dimension is elevated. Note that the high dimension also requires the high computation overhead. 5) The performance comparisons of path switching algorithms In this experiment, we compare the amounts of traffic reduction when using different path switching algorithms. Here, we use three references; initial tree without path switching, Link-based Switching (LS) and Minimum Spanning Tree (MST) using the network latency as its cost. LS computes the link cost without considering the cost changes in the upper links of the path. As seen in Figure 7-(g), the initial tree shows the worst performance, which is obvious since it
does not perform any path switching. Note that path-based and
optimal switching always outperforms LS by 20% and 46% respectively, which validates the effectiveness of our approach.
VIII. RELATED WORK In this section, we provide a brief overview on previous works directly related to SATI. To the best of our knowledge, no previous work considers the low overhead infrastructural support for scalable and trafficefficient data stream dissemination. In the area of Distributed Stream Processing Systems (DSPS), two literatures address the efficient construction of distributed delivery paths; Borealis[2] and Hourglass[1]. They mainly address the operator placement problem on distributed participating nodes and do not consider the scalable data delivery to the massive number of consumers. Borealis adopts DHT routing paths for disseminating data streams. Since DHT scheme is designed for load balancing and routing resilience, the constructed paths show high network latency. On the other hand, SBON[1] calculates an optimal place for each query operator in a virtual latency space[12] with a spring relaxation technique, which is proven to have low end-to-end delivery latency and probing overhead. However, the spring relaxation technique works well only when there are a few numbers of operators and consumers. Distributed Stream Management Infrastructure (DSMI)[4] also resolves query operator placement problem within the large-scale distributed enterprise system by using the hierarchical data-flow graph. Since they focus on aggregating data from multiple sources and sending to the one or a few sinks, the constructed data delivery paths would experience the long network latency in the face of multiple, globally distributed consumers. In the area of ALM, a plethora of literatures have been proposed to address the data dissemination to a large number of consumers. However, they are also not suitable for stream data delivery because they are designed to meet service-specific network performance such as low latency, neglecting network traffic efficiency. Among many works, three schemes are largely related to SATI; hierarchical overlays [5][6], path switching trees[7][9], and NC-aided overlays[8]. Hierarchical overlays focus on enhancing scalability and robustness issues in large-
scale data dissemination. NICE[5] constructs a hierarchical clustering of the application-layer multicast peers. Because data are disseminated through the hierarchy, the nodes on the higher layer are easily overloaded. In contrast, SATI provides an individual data delivery tree per source, eliminating this overload problem. ZIGZAG[6] focuses on enhancing the robustness by using the ZIGZAG rule; a peer can only link to other cluster peers. Since the tree is formed without considering latency minimization, the overall tree stretch can be severely degraded. SATI minimizes the overall stretch of trees by choosing proxy nodes based on their location. A number of ALMs use decentralized path swapping techniques. BTP[9] is a representative tree switch protocol. Lacking the global network knowledge, however, these approaches show locally-optimal solutions by reconfiguring links between a parent and its descendents. SATI can achieve the better tree performance by computing globally-optimal paths with the aid of VNM. Our VNM falls in the same category as Network Coordinate[12]. NC-aided overlays are a new research area and only a few approaches have been proposed. However, they are not designed for scalable and efficient data stream dissemination. Grido[8] , which is closest to our approach, is a mesh overlay infrastructure in Grid-based computing environment. It applies NC to two contexts; reducing the overhead of mesh maintenance like link addition/deletion, and selecting the ingress/egress overlay node closest to the customer and data source. SATI is different from Grido in that SATI exploits NCs for building multicast trees and does not have the burden of maintaining the mesh network.
IX.
CONCLUSION
This paper proposes SATI, scalable and traffic efficient data dissemination infrastructure supporting a large number of SDSes. SATI provides the economic and shared management of network measurement as well as traffic efficient data delivery tree with guaranteed QoS.
X.
ACKNOWLEDGEMENT
This research has been supported in part by the Defense Software Research Center of the Defense Acquisition Program Administration and the Agency for Defense Development.
XI.
REFERENCE
[1]
Peter Pietzuch, et. al., Network-Aware Operator Placement for Stream-Processing System ICDE '06.
[2]
Daniel J. Abadi, et. al., The Design of the Borealis Stream Processing Engine CIDR '05.
[3]
Mitch Cherniack, et. al., Scalable Distributed Stream Processing CIDR '03.
[4]
Vibhore Kumar, et. al., Resource-Aware Distributed Stream Management using Dynamic Overlays. ICDCS '05.
[5]
S. Banerjee, et. al., Scalable Application Layer Multicast, SIGCOMM '02.
[6]
Duc A. Tran, et. al., ZIGZAG: An Efficient Peer-to-Peer Scheme for Media Streaming, INFOCOM '03.
[7]
Suman Banerjee, et. al., Construction of an Efficient Overlay Multicast Infrastructure for Real-Time Applications, INFOCOM '03.
[8]
Shirshanka D, et. al., Grido- An Architecture for a Grid-based Overlay Network , QShine '05
[9]
D. A. Helder, et. al, End-host Multicast Communication using Switching-Trees Protocols, GP2PC '02
[10] A. Nakao, et. al., A routing underlay for overlay networks. SIGCOMM '03. [11] http://www.cc.gatech.edu/projects/gtitm/. [12] Frank Dabek, et. al., Vivaldi: A Decentralized Network Coordinate System. SIGCOMM '04. [13] E. W. Dijkstra. A note on two problems in connexion with graphs. Numer. Math., 1:269--271, 1959. [14] http://www.ccalmr.ogi.edu/CORIE/ [15] http://www.its.dot.gov/index.htm [16] http://www.fs.fed.us/rmc/ [17] R. Sriram, et. Al, Algorithms for delay-constrained low-cost multicast tree construction. Computer Communications, 21(18):1693--1706, 1998. [18] A. Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD 1984: 47-5