Journal of Network and Computer Applications 57 (2015) 21–32
Contents lists available at ScienceDirect
Journal of Network and Computer Applications journal homepage: www.elsevier.com/locate/jnca
Scalable and efficient file sharing in information-centric networking Younghoon Kim a, Ikjun Yeom b, Jun Bi c, Yusung Kim b,n a b c
Department of Computer Science, KAIST, Daejeon, South Korea Department of Computer Engineering, Sungkyunkwan University, Suwon, South Korea Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing, China
art ic l e i nf o
a b s t r a c t
Article history: Received 21 November 2014 Received in revised form 20 May 2015 Accepted 6 July 2015 Available online 26 July 2015
Internet usage has been increasingly shifting toward massive content retrieval and distribution. To satisfy the demand, information-centric networking (ICN) approaches have emerged. These approaches have common features such as in-network caching and the decoupling of content and location to enable efficient content delivery. In this paper, we investigate the performance issue of file sharing in ICN. Compared to web and user generated content (UGC), file sharing entails relatively low request rates, but much larger content. These characteristics make it difficult for file sharing traffic to take advantage of innetwork caching while competing with web and UGC traffic. For scalable and efficient file sharing, we design a peer-aware routing (PAR) protocol to utilize peers' storage in ICN. PAR uses a name-based shared tree without additional name resolution service. The shared tree is built at a network layer to connect peers for the same purpose, allowing a peer to discover the nearest peer holding the requested content in the tree. To evaluate the proposed PAR protocol, we conducted a large-scale simulation with a BitTorrent trace over the Internet-scale topology. We show that our protocol can substantially reduce inter-domain traffic, while keeping network overhead low. & 2015 Elsevier Ltd. All rights reserved.
Keywords: Information-centric networking Peer to peer file sharing Peer-aware routing
1. Introduction ICN is an emerging communication paradigm for efficient and scalable content distribution. Its rationale originates from two interesting Internet usage patterns: (a) most Internet usage involves accessing content rather than host-to-host communication, and (b) popular content is always delivered repeatedly because it is popular. Thus, though several distinct ICN architectures have been proposed, they all have the same main features: location-independent communication and in-network content caching (Ahlgren et al., 2012). To realize these features, most ICN architectures employ content-name based communication models instead of the host addressing-based communication model of the current Internet. Each architecture proposes a different mechanism for content name resolution and associated routing to achieve its own goal, but all achieve most of their performance gain through in-network caching to avoid redundant content delivery. Some preliminary studies on ICN have shown that the use of innetwork caching can reduce bandwidth consumption, but these studies used a single traffic type in their evaluation such as web or UGC in Rossi and Rossini (2000) and Muscariello et al. (2011). In
this paper, we investigate file sharing traffic1 as it competes with other web and UGC traffic in ICN. File sharing is a major application in the current Internet, accounting for 23% of total consumer traffic in 2012 (Cisco, 2013). Compared to web and UGC traffic, file sharing traffic has two distinctive characteristics: (a) the average size of individual files is relatively very large (an average one GByte, and up to several dozen GBytes) and (b) the frequency of requests is relatively low. Assuming reasonably sized in-network cache storages, it would be difficult to achieve performance benefits through on-path caching, in which requests are only forwarded to a content provider and data returns along the reverse path of the requests. Even considering off-path caching, in which a request is routed to any node that holds a copy of the content, the number of cached copies for a file would need to be limited due to its large size and low request rate, which would make it difficult to find a copy of the content within a short-distance network. One possible way to improve the network performance of file sharing would be to provide a peer-to-peer (P2P) overlay (application-level) service, similar to BitTorrent in the current Internet. However, current overlay P2P services do not fit the paradigm of ICN in the sense that retrieving a file requires first the exact location of that file. For example, in BitTorrent, a requester
n
Corresponding author. E-mail addresses:
[email protected] (Y. Kim),
[email protected] (I. Yeom),
[email protected] (J. Bi),
[email protected] (Y. Kim). http://dx.doi.org/10.1016/j.jnca.2015.07.010 1084-8045/& 2015 Elsevier Ltd. All rights reserved.
1 In the current Internet, most file sharing traffic is transferred by peer-to-peer communication such as BitTorrent.
22
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
discovers IP addresses of peers holding the requested file, and sends a request to the discovered peers. Moreover, data retrieval may be inefficient when the peer located is a long distance away, since most P2P overlay networks are organized in a topology-blind fashion to provide scalability and robustness. The use of topologyaware peer selection to improve performance remains a challenge in P2P networkings (Xie et al., 2008; Le Blond et al., 2011). For efficient file sharing in ICN, we propose a peer-aware routing (PAR) protocol. In PAR, users (peers) are connected via a tree, and a file is delivered from the nearest source on the tree; the source can be either another user or an in-network cache. Since information for tree traversal is maintained by ICN routers, PAR retains the location-independent communication principle of ICN. PAR can be applied to any existing ICN architectures, and we apply it to named-data networking (NDN) herein. To evaluate the performance and overhead of PAR, we perform a large-scale simulation using a real BitTorrent trace (Han et al., 2012). The simulation results show that file sharing can be efficiently realized in ICN using PAR with minimal overhead. The novelty of our work can be summarized as follows. First, to the best of our knowledge, this is the first work that addresses the necessity of efficient and scalable file sharing in ICN. There has been some research that studies about the performance gain using in-network caching in peer-assisted file sharing (Lee and Nakao, 2013; Tyson et al., 2012; Rautio et al., 2010), but these studies did not consider the effect of coexistent other traffic such as web and UGC. Our work indicates that file sharing is difficult to take advantages of in-network caching when competing with other traffic due to its traffic characteristics, and suggests the necessity of utilizing peers' storage and bandwidth for file sharing. Second, in the previous studies, the ways of discovering content stored at peers were neither sophisticated nor scalable. Content requests are just assumed to be reached peers (Tyson et al., 2012), there is a central directory server (Rautio et al., 2010), or the discovery is limited by flooding requests in a single AS network (Lee and Nakao, 2013). We first designed a decentralized scheme to discover content stored at peers over multi-AS networks, called PAR, and have evaluated the performance of PAR using the real BitTorent traces on the Internet-scale topology. Third, we integrate the concepts of P2P schemes into the network layer of ICN not like prior P2P approaches which are implemented mostly on application layers. With the support of the network layer, especially packet routing can be underlying network topology-aware, which is hard to be achieved in overlayfashion approaches. The rest of this paper is organized as follows. In Section 2, we review the existing ICN architectures in the view of file sharing. We clarify our problem statement in Section 3. In Section 4, we introduce related work. In Section 5, we describe our proposed PAR for P2P file sharing in NDN. In Section 6, we evaluate the performance and overhead of PAR, and we conclude this paper in Section 7.
2. Background and motivation In this section, we summarize representative ICN architectures, and address the challenges to accommodate file sharing traffic in each one.
packet, and this interest packet is delivered to the content provider by longest prefix matching in the forward information base (FIB), which is analogous to a routing table in the current Internet. Upon forwarding an interest packet, an NDN router adds an entry to a pending interest table (PIT) to record information regarding the interface that the interest packet arrived from, thereby providing a reverse path for the corresponding content delivery. Upon receiving data, an NDN router looks up the corresponding PIT entry, and forwards data to the incoming interface of the interest packet. To avoid multiple deliveries of the same content, forwarded data at a NDN router is also stored in a cache called Content Store. Hierarchical name-based routing makes NDN simple and scalable, since it does not require name resolution services for locating content or complex efforts for routing. However, a limitation does arise from name-based routing: because an interest packet is delivered directly to the source, only caches on the path are available, and this limits cache utilization. Considering the network line-speed forwarding, moreover, cache size can be limited by the memory size (Perino and Varvello, 2011).
2.1.2. Publish–subscribe internet routing paradigm (PSIRP) In PSIRP (Lagutin et al., 2010), flat names are assigned to content, and thus name resolution is required to locate a content source. To provide name resolution, a content source publishes its content to a rendezvous system by the content name. A user who wants the content subscribes to it, and this publication and subscription are matched by the rendezvous system. After the matching process, the rendezvous system creates a forwarding identifier (FI) and sends it to the content source. The content source sends the content with the FI to the requester. Because the FI contains path information from the source to the user, routers do not need to maintain additional routing information. Through the use of the rendezvous system, PSIRP may provide flexible ways to utilize in-network caches without imposing limitations on the path between the original content source and the requesting users. However, the rendezvous system must manage and provide locators for available copies of content, and scalability needs to be considered, as the number of content names and locators increases.
2.1.3. Network of information (NetInf) NetInf (Dannewitz et al., 2013) has a hybrid architecture that supports both name resolution and name-based routing. Content sources can publish their content through a name resolution service (NRS) or by announcing the content in a routing protocol. A NetInf node that hold a copy of content can optionally register the content information with an NRS. To locate content, a user can contact an available NRS, and retrieve the content from the nearest content holder. Alternatively, a user can send a direct request with the content name, and it will be forwarded to an available copy of the content using name-based routing. As soon as the requests reach the copy, the data is returned to the user. NetInf has a flat name structure and uses hybrid NRS and name-based routing. Hence, similar to PSIRP, NetInf can suffer from the scalability problems of NRS. It is also challenging to perform name-based routing with a flat name structure.
2.1. Overview of ICN architectures 2.2. File sharing traffic in ICN 2.1.1. Named data networking (NDN) In NDN (initially proposed as content-centric networking) (Jacobson et al., 2009), the name of a content item is constructed hierarchically to include enough information for routing, and thus additional name resolution is not needed. To obtain content, a user sends a request packet with the content name, called an interest
To investigate the performance of file sharing in ICN, we first consider traffic in the current Internet; Table 1 presents the traffic profile for major applications such as web, UGC, and file sharing. The catalog size and content average size are reproduced from Pentikousis et al. (2013), and traffic size is reproduced from Cisco
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
Table 1 Content catalog. Type
Catalog size
Content size (mean)
Bytes (PB)/ Month
Requests/ Month
Web UGC VoD File Sharing
1012 108 104 5 106
10 KB 10 MB 100 MB 800 MB
5173 7409 7409 6201
5 1011 7 108 7 107 8 106
23
has a challenge of a topology-aware seed selection which requires additional efforts such as the cooperation with ISPs. To enable scalable and efficient file sharing, we attempt herein to utilize peers' storage in NDN without NRS or (application-level) Tracker. To overcome the limitation on-path caching, we propose a PAR protocol to provide location-independent, topology-aware, and scalable routing.
3. Problem statement (2013).2 The number of requests is calculated by dividing the traffic size by the average content size. From the traffic profile, it is clearly observed that (a) file sharing is a major application accounting for 23% of total consumer traffic, (b) compared to other applications, there are considerably fewer requests for file sharing than for web and UGC, but with much larger file sizes, and (c) the request rate for each file is relatively low, but each request tends to lead to a large consumption of bandwidth. To ensure this observation, we performed simulation with identical traffic patterns described in Table 1 and the hit ratios of each traffic are shown in Fig. 1. The size of cache is set to be capable of holding 10 s3 of link traffic. From the figure, we can see that web and UGC traffic have cache hits due to their high request rates. File sharing and vod traffic, however, hardly have cache hits. In NDN with hierarchical name-based routing, when file sharing traffic coexists with other traffic, the more frequently requested web and UGC traffic quickly refills in-network caches. As a result, file sharing data is repeatedly retrieved through expensive inter-domain networks. Particularly in NDN, data is supposed to be cached in all routers on the data path from a source to a requester while providing line-speed forwarding. Considering this requirement, the size of the in-network cache is limited by current memory technology (Perino and Varvello, 2011). This makes more difficult for large files to be stored in innetwork caches. PSIRP and NetInf use similar name resolution models. Such an NRS approach moves the DNS-like functionality inside the network. NRS provides locators for the available copies of the requested content. Hence, it is possible to utilize in-network caches more flexibly with NRS than with hierarchical name-based routing, in which content can be cached only on the path between a source and a requester. Using NRS, it is also possible to extend to utilize end peers' storage for content copies. Considering the large size and the low request rate of files, however, the use of in-network caches for file sharing should be limited. Even if we count peers' storage resources, NRS would be severely overloaded by an attempt to accommodate the present scale of P2P file sharing, which is estimated to number more than 150 million (http://www.bittorrent.com/ intl/es/company/about/ces_2012_150m_users). To summarize, a name resolution service can utilize peers’ storage for content caches, but a large number of participating peers entail scaling problems. On the other hand, hierarchical name-based routing limits to take the benefit of in-network caching on the path. Providing off-path caching is still a open issue in hierarchical name-based routing. Alternatively, an application-level server such as Tracker (in BitTorrent) can take a role of the lookup service, but such an application-level approach
2 In Cisco (2013), the figure of 14,818 PBytes of internet video is given, which includes not only UGC but also video on demand (VoD) such as Netflix. We estimate that UGC and VoD each comprise, half of the internet video traffic; this is a rough estimate rather than an exact measurement. 3 It has been shown that bandwidth savings are up to 60% by redundant traffic elimination within the first 10 s (Anand et al., 2009).
File sharing traffic in ICN takes less advantages of in-network caching as compared with other types of traffic (e.g. Web and UGC) due to its traffic characteristics as mentioned in the previous section. In this paper, we propose a routing protocol which satisfies following requirements to support file sharing services in ICN: (a) The protocol should be capable of utilizing users’ storage and bandwidth. For that, request packets should be able to reach proper content-holding-users, and the requester can be a provider for the content after downloading. (b) The protocol should comply the principle of location independent communication of ICN architectures. The requester should be able to send requests without knowing content-holdusers' names or locations. (c) The protocol should be integrated into the informationcentric network layer not overlay. It enables the protocol to take advantages of ICN architectures such as request aggregation, and allow packet routing to be network-topology-aware without unnecessary packet traversals. (d) The protocol should be scalable on the number of users and the number of content items. In file sharing services, any users can be any content holders, so centralized ways of directory are not scalable. System overhead for control messages and the amount of information stored in intermediate routers also need to remain low.
4. Related work Since several information-centric networking (ICN) architectures emerged, many research topics including cache management, naming, routing, and mobility have been studied. An analytical study has been carried out to evaluate storage and bandwidth trade-offs (Muscariello et al., 2011). Strategically selected caching has been shown to be potentially more effective than unconditional caching (Chai et al., 2013). The optimal performance of in-network caching has been analyzed for both on-path and off-path caching scenarios (Kim and Yeom, 2013). Although many previous studies made evaluations based on a single traffic type, but some studies have considered the coexistence of different types in ICN. In a study investigating the impact of traffic mix on the caching performance of a two-level cache hierarchy, the authors concluded that caching performance increases if VoD content is stored in the first-level cache while storing other type content such as web, UGC, and file sharing in a large second-level cache (Fricker et al., 2012). One study investigated storage partitioning schemes for different applications, one of which was a static per-application storage allocation; the authors found that the static scheme performed poorly when content had finite lifetime, and provided dynamic storage management schemes to mitigate this problem (Carofiglio et al., 2013). Both these studies focus on in-network caching management rather than users' storage, but they can complement our work herein. For example, an IPTV service provider may need to both manage in-network caching and utilize users' storage to deliver high quality (high-definition or blu-ray quality) videos to users.
24
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
Because the tree is constructed to be topology-aware, efficient delivery from the nearest source4 in the tree is made possible. Once the nearest source is discovered, the content is delivered to the requester via the reverse single path, the same as in NDN.
Web File Sharing UGC VOD
0.5
0.4
Hit Ratio
5.1. Joining procedure 0.3
0.2
0.1
0 0
10
20
30
40
50
60
70
Time (Seconds) Fig. 1. Hit ratios for different types of traffic. Traffic generated based on Table 1.
Recently, user-assisted caching strategies were examined using both centralized and distributed approaches (Lee and Nakao, 2013). This work showed that the random distributed caching strategy provides comparable performance to other complex strategies. This prior work is similar to our present work in that both utilize users' storage, but the prior work assumed that all users are connected to a NDN router in a single domain, allowing users to discover each other through name-based anycast routing. Instead, our PAR protocol is not limited to a single domain (or a single router), and has been evaluated on a large-scale global network topology. Research has been conducted specifically related to P2P file sharing in ICN. In the initial work, an overlay network approach was introduced to realize the ICN model (Katsaros et al., 2011) by employing Scribe (Castro et al., 2002) which is an overlay multicast scheme. Simulation results demonstrated the considerable benefits in terms of traffic load and download time compared to BitTorrent. The following two studies are similar with ours in the view of P2P file sharing in pure ICN networks. The first study showed the benefits of in-network caching when deploying P2P file sharing over NDN (Tyson et al., 2012). In another study, a prototype was implemented to improve the performance of BitTorrent over NetInf, with experimental results showing that the prototype can achieve high performance gains in both static and mobile scenarios (Rautio et al., 2010). Since both these studies conducted their evaluations without other background traffic, P2P file sharing can fully utilize in-network caching alone. However, the reality is that different traffics such as web and UGC are more frequently requested, and that the data of P2P file sharing is easily evicted from in-network caches. These studies also assumed that a peer can know the location of other peers or that there is a name resolution service. In this paper, we point out that it is difficult for P2P file sharing to take the advantage of in-network caching, and accordingly we propose a scalable and efficient P2P file sharing scheme that does not require any name resolution services.
5. Peer-aware routing protocol Herein we describe a PAR protocol that we designed to provide file sharing in ICN. We built the PAR protocol on the NDN architecture, but it can be applicable to any existing ICN architectures. In PAR, a shared tree is constructed to connect participating peers, and each interest packet traverses this tree. To allow scalable content discovery, incremental flooding within the tree is employed.
A peer joins a PAR tree by sending a join message which is an interest packet with an application name prefix (e.g. /app.com). When routers receive the join message, they look up the name of a rendezvous point (RP) mapped to the application name prefix by a static configuration or by a dynamic approach similar to bootstrap router (BSR) or Cisco's Auto-RP in IP multicast. After forwarding the message to the RP, the router maintains a record of the interfaces that the join messages have been received from and sent to. This information might be stored in an entry in modified FIB; we call this entry an application-level entry. If a router already has such an entry, it can simply update that entry with the information on the incoming interface and then drop the join message. Since a peer is supposed to periodically send join messages, routers can prune the involved interface off the PAR tree if it does not receive a new join message during a specified timeout period. If all interfaces are pruned, the router removes the relevant FIB entry. An example flow of join messages is shown in Fig. 2; its corresponding PAR tree is shown in Fig. 2. 5.2. Two different forwarding phases To retrieve content, a requester needs to obtain information on that content by searching a web site, in a similar way as with BitTorrent. The content information is uploaded to the web site by another user (or peer) who published that content. The content information usually includes a series of names for the content chunks such as “/app.com/fileName/chunkNum”. After joining a PAR tree, the requester starts to send an interest packet with one of the chunk-level names. When a router receives the interest packet, the router handles the interest packet according to the basic NDN procedure mentioned in Section 2. When forwarding the interest packet, PAR has two different forwarding phases: a discovery phase and a data retrieval phase. 5.2.1. Interest forwarding in the discovery phase If a router is connected to a PAR tree, the router has an application-level entry such as “/app.com/*”. If the router receives a request for a specific content item, the router finds the application-level entry by longest prefix matching. Since the router does not know which interface is appropriate to discover the requested content, the router sends the interest packet to all interfaces connected to the PAR tree except the incoming interface. The interest packet is propagated along the PAR tree until it reaches a source. In Fig. 2, a peer P1 sends an interest packet with the content name “/app.com/movie1/chunkNum ¼i”. The interest packet is flooded through the PAR tree; the packet includes a timeto-live (TTL) to limit the flooding scope. The value of TTL can be a number of hops in a level of a router (or an AS network). If the interest packet reaches any source, the data packets return along the reverse paths of the interest packets. In Fig. 2, two peers P2 and P3 respond to the request; because the data packet from P2 arrives first at R1 and consumes the interest packet in PIT, the data packet arriving later from P3 is simply dropped at R1. In traditional flooding-based P2P systems such as Gnutella, peers form overlay networks and their request flooding can 4 In PAR, the discovered source can be a peer or a cache located on the path toward peers in the tree.
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
25
Fig. 2. Introduction to the peer-aware routing protocol. (a) The flow of join messages in PAR. (b) Interest forwarding in a discovery phase. (c) Data responses. (d) Interest forwarding in a data retrieval phase.
Fig. 3. Router 2 (R2) has two type entries. One is for an application-level entry, for example “/app.com”, which is created by building a PAR tree. The other is for a contentlevel entry, for example “/app.com/movie1”, which is added after receiving data for the content.
generate a large amount of network traffic (Chawathe et al., 2003). In overlay networks, a path between a source and a destination is not congruent with the underlying network topology, which can lead to inefficient routing where a peer contacts a nearby peer via peers throughout the wide-area network. On the other hand, we construct a topologically aware PAR tree at the network layer (similar to IP multicast). When a packet is flooded from a requester, there are no redundant packets traversing the same network link. In traditional P2P systems, when many users concurrently request the same content, the request packets are independently flooded through the networks. Contrastingly, our proposed scheme can be aided by the interest aggregation feature of NDN. For a certain time period, interest packets for the same content are efficiently aggregated in PIT. Even though interest packets are aggregated well, if interest packets are flooded widely throughout the PAR tree, this can result
in redundant transmission of data packets from multiple sources, and this can cause a significant scaling problem. To mitigate this problem, we use an incremental flooding method in which a user sends interest packets with TTLs that incrementally (or exponentially) increase in response to successive timeouts. If the value of TTL exceeds a threshold, the requester is then permitted to send the interest packet directly to an original content provider.
5.2.2. Interest forwarding in the data retrieval phase When data packets are returned along the reverse paths of interest packets, a router that receives a data packet related to the content for the first time creates a new temporary entry in FIB. The entry has a content-level name prefix such as “/app.com/fileName/ *”; we call this a content-level entry. The content-level entry records the interface from which the data packet is received first.
26
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
Fig. 4. Modified interest aggregation scheme. (a) Original interest aggregation. (b) Modified interest aggregation.
As long as the entry stays in FIB, the interest packets for that content can be forwarded using the entry in the data retrieval phase. For the following interest packets corresponding to the successive chunks of the content, the router can find the contentlevel entry and renew its timeout, thereby enabling the forwarding of interest packets along the single path to the source that was found during the discovery phase, and avoiding the need for repeated flooding. This procedure is illustrated schematically in Fig. 2 and the corresponding status of FIB is shown in Fig. 3. Content-level entries are removed after timeouts. Timeouts occur in two cases: one is when, after interest packets are forwarded, no data packets respond due to the sources' departures or cache misses; the other is when the content-level entries are not accessed any more after the content download is finished, for example when there are no further requests. Choosing the timeout duration entails a trade-off between efficiency and scalability: if the duration is too short, many interest packets will be flooded repeatedly, increasing network overhead drastically. If it is too long, the discovered path may become meaningless because the corresponding sources may have already left the PAR tree (due to peers leaving or caches dropping); the number of content-level entries at a router increases with the timeout value, worsening routing performance. So, the timeout duration needs to be carefully configured according to the application characteristics or the network provider's policy.
5.2.3. Modified interest aggregation In NDN, interest packets for the same content are aggregated to avoid redundant forwarding. Once a router receives and forwards an interest packet for content, the interest packet is kept in PIT. If the router receives another interest packet for the same content, then the router adds the incoming interface to PIT without forwarding it. The process is efficient when the content is delivered from a source to multiple requesters. In PAR, an interest packet needs to be flooded over the PAR tree to discover a source. If many receivers concurrently send interest packets for the same content, the mechanism of interest aggregation may block the necessary interest propagation. Figure 4(a) describes the interest aggregation problem in PAR for an example in which there are six routers (from R1 to R6), two peers (Peer1 and Peer2), and one seed (Seed1) in a single domain. Here, all peers including Seed1 have joined the PAR tree. Peer1 first requests a file, and the interest packet is propagated along the PAR tree. When R1 receives the interest packet, it forwards the packet to R3 and R4. R1 creates an entry in the PIT to store information about the interest packet. At the same time, Peer2 also requests the same file, and this interest packet is forwarded from R3 to R1. When R1 receives the interest packet, it checks its PIT and
finds the entry that has been created for the same content name. R1 adds the incoming interface to the entry, and does not forward the packet any further. When Seed1 responds, the requested data packet follows the reverse path of the interest packet originated by Peer1. However, Peer2 cannot receive the data packet, because its interest packet has been blocked at R1, and its reverse path is inadequate to retrieve the data packet. Peer2 then must retransmit the interest packet after a time-out period. At worst, this problem is repeated due to concurrent requests by numerous peers. To solve this problem, we propose a new interest aggregation scheme. In the new scheme, a list is managed of outgoing interfaces to which interest packets are forwarded. Whenever a router receives an interest packet, it searches whether there are any interfaces to which the packet can be forwarded. Among all the interfaces connected to the PAR tree, (except the incoming interface), if there is any interface that is not yet included in the outgoing list, then the interest packet is allowed to be forwarded to the interface (or interfaces). Otherwise, the interest packet is aggregated. Figure 4(b) demonstrates our proposed interest aggregation scheme including the modified PIT entry. In this example, the router R1 has three interfaces from if _0 to if_2. When R1 receives an interest packet from if_0, it tests the condition mentioned above. R1 decides to forward the interest packet to if_1 and if_2. After forwarding, R1 creates a PIT entry for the content, adding if_0 to the incoming interface list and both if_1 and if_2 to the outgoing interface list in this entry. In the next step, R1 receives an interest packet with the same content name from if_1. Following the new aggregation scheme, R1 allows the new interest packet to be forwarded to if_0. R1 also updates both the incoming and outgoing interface lists. Following this scenario, Peer2 in Fig. 4(a) would be able to receive the data packet retrieved from Seed1 without the need for retransmission.
5.3. Downloading content from multiple peers In this paper, we focus on a peer-aware routing protocol rather than an application implementation of file sharing. Although detailed implementation issues are outside this scope, we briefly discuss the parallel downloading of content from multiple peers. The parallel downloading of data is important for the following reasons. Since peers may have limited upload bandwidth, receiving data from a single peer can perform poorly. Even when a single peer has enough capacity to upload the data, network congestion can occur at a bottleneck link between a requester and the peer. In these cases, the requester needs to find other peers who do not share that bottleneck link. Besides, not every peer may have all the pieces of the content, in which case a requester will need to download different pieces from different peers.
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
To enable parallel downloading of data from multiple peers, we suggest two approaches. One approach is to utilize an adaptive forwarding scheme at NDN routers. In this approach, if a router may receive data from multiple interfaces in each discovery phase, the router then creates a content-level entry in FIB, and stores all the incoming interfaces in the FIB entry. Upon receiving the successive interest packets, the router selects one interface among the interface list in the FIB entry to forward the interest packets. The interface selection can be a round-robin fashion or based on the measurement such as the round trip time or the data receiving rate (Yi et al., 2013). The alternative approach is to use a previously proposed an exclusion filter scheme (Zhu et al., 2011). A requester explicitly expresses the content names to distinguish different sources using the exclusion filter. In this approach, in each discovery phase the requester repeatedly collects candidate peers and receives the chunk list information from the peers. During a data retrieval phase, the requester can concurrently download different pieces of the content from different peers using the exclusion filter. If the need arises, the requester may look for additional peers to improve performance or to replace peers that have left. 5.4. Consideration of peer churn The dynamic arrival and departure of peers, called churn, is an inherent part of P2P systems. Since churn significantly affects the performance of P2P systems, understanding churn is important in the design and evaluation of P2P systems. Traditional P2P systems can be classified in two main categories: unstructured and structured systems. Unstructured P2P systems use a floodingbased approach or a centralized component for lookup and management purposes. In structured P2P systems, the overlay is organized into a specific topology, and their most common approach is to use a distributed hash table (DHT). These overlay P2P systems (both unstructured and structured) require each peer to be a forwarder as a part of the overlay topology. If many neighbor peers leave such a system simultaneously, it becomes difficult to search for content due to the loss of packet forwarders. Although unstructured P2P systems are known to be more robust than structured P2P systems in conditions of high churn rate (Chawathe et al., 2003), unstructured P2P systems are also impacted considerably when some peers are lost who were located in an important position within the overlay network and were supposed to forward packets. Our PAR protocol is less sensitive to churn than overlay P2P systems because peers are not responsible for forwarding packets. Peers are located at the edge of the PAR tree, and NDN routers serve the role of forwarders. As long as a single peer connects to the PAR tree, its access router (or the AS network) will be linked to the PAR tree, thereby allowing a content request to be propagated well through the PAR tree. In our evaluation, we used a real BitTorrent trace in which more than 14 million peers dynamically joined and left networks over 30 days. Since the trace is captured every two hour, peer churn is not perfectly measured. We consider peers which exist in the next two hours as seeds in the evaluation. The evaluation result showed that PAR is working well for the roughly estimated churn in the trace. In the future work, we try to model peer churn of BitTorrent more precisely, and improve the performance evaluation of peer churn in PAR.
27
limitation of computation resources, which leads us to implement our own content-level event-driven simulator also used in a previous work (Kim and Yeom, 2013). In the simulator, we implemented essential features of NDN protocols such as FIB, PIT and content store which are the same as ndnSIM (http://ndnsim. net/) which is an NDN module in NS3 (http://www.nsnam.org/). We used the Least Recently Used policy as the cache replacement policy, based on its recommendation in NDN (Jacobson et al., 2009). Details of our simulation methodology are explained in the following sections.
6.1. BitTorrent trace We used information from a real BitTorrent trace to simulate file sharing traffic. In this trace, BitTorrent swarm snapshots were captured periodically (once every two hours) for 30 days (Han et al., 2012). The information extracted from the trace included snapshot times, torrent IDs and user IP addresses. According to the information, 14,822,261 IP addresses and 43,837 torrents were shown in the trace. Torrent IDs and IP addresses were directly matched to content items and users, respectively, for simulations. From 10 GB of the trace, we selected who to be seeds or requesters and when to request a content item or leave the system as illustrated in Fig. 5. For initial seeds of each content item, the peers that appeared in the first snapshot of each torrent were considered to be seeds for the torrent. Since two consecutive snapshots are two hours apart, the peers that stayed in two consecutive snapshots were considered to have become seeds after downloading the content. The peers who were not been shown in the previous snapshot were considered to be the requesters. We made them request the content in a uniform distribution over the two hour interval. We considered peers to have left the system when they were not shown in subsequent snapshots. Using these simulation events which consisted of three event types: being seeds, requesting content, and leaving the system, we replayed P2P file sharing traffic for the first 20 days in the BitTrorrent trace. To avoid the impact of the warm-up period in which in-network cache is not fully filled, we evaluated the performance of P2P file sharing after 10 days. Note that becoming seeds does not mean that the seeds permanently serve. It is because seeds occasionally leave the system, and new requesters become seeds. After all, swarms retain the similar sizes (including both seeds and requesters) as we can see in Fig. 6(a). Figure 6(b) shows the content popularity distribution which were generated in the evaluated 10 days and we can verify that the content popularity in the BitTorrent trace follows Zipf's law.
6. Simulation methodology To evaluate the performance of the proposed protocol, a largescale real BitTorrent trace was used. Packet-level simulation is not feasible for evaluating with this huge size of trace due to the
Fig. 5. All peers at the first snapshot become seeds. Newly appeared peers become requesters and send requests between two consecutive snapshots. Seeds or peers who do not appear at the following snapshot are considered to leave the system during that interval.
28
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
After receiving the seed information, the peer sends an interest packet to and receives a data packet from the seed. The packets are forwarded following the shortest paths between seeds and peers in RSS. On the other hand, the OS is an ideal solution. In OS, we assume that all the peers also know where the source (a peer or a cache) is. OS can be considered as a system with an ideal anycast scheme. When a peer or a cache becomes a source of content, the routing information immediately spreads through the whole network and a requester can obtain the content from the closest source using the shortest path. The overhead for the ideal anycast is so high that OS cannot be achieved in the real circumstance, and in our evaluation the overhead is discarded. 6.4. Metrics We use the following metrics to evaluate the simulation results. The main metric is the average hop count, defined herein as the average number of inter-AS links between the requesters' ASes and the providers' ASes. For instance, if the requested data is stored in an in-network cache of the requester's AS, it is considered to be retrieved in zero hops. A small average hop count means that users are able to retrieve the content data without generating much inter-AS traffic. The amount of inter-AS traffic is an another important metric. By observing both interest and data packets transferred between adjacent ASes, we can compare the overall performance of each scheme. We particularly analyze the flooding overhead of our scheme by tracking how many interest packets are propagated through the PAR tree. For the third metric, we measure the average number of FIB entries. In our scheme, since content-level FIB entries dynamically increase and decrease, we evaluated the level of the increased size of FIB. 6.5. Simulations and post-processing
Fig. 6. Trace analysis. (a) Average size of peer swarms per day. (b) Cumulative fraction of the number of requests of content items. Content IDs are sorted on their popularities in descending order.
6.2. Topology For the simulation topology, the AS-level topology of CAIDA's Internet Topology Data Kit (ITDK) (2000) traces was used. The ITDK's raw topology data was obtained by performing traceroutes to randomly chosen destinations in each routed/24 BGP prefix located in each of 29 countries. During the simulations, 21,292 ASes were used. Each AS node was regarded as an NDN router providing the function of PIT, FIB, and content store in the simulation. We assigned peer to ASes by performing a longest prefix match on all IP addresses in the BitTorrent trace within ASes in the ITDK trace. For routing, we computed all-pair shortest paths between all ASes in the ITDK traces. 6.3. Schemes for performance evaluation As a basis of comparison for the performance of the proposed PAR protocol, we developed other schemes, named server-client (SC), random seed selection (RSS) and optimal solution (OS). SC is a system in which a server handles all the responsibilities in providing content. RSS is a simplified version of BitTorrent over CCN. In RSS, we assume that a tracker keeps track of seeds and selects a seed to serve each request. Once a peer joins the system, it sends an interest packet for a seed name to the tracker, and the tracker responds by sending a randomly chosen seed information.
Using the BitTorrent and ITDK traces, we replay a large number of content requests over an Internet-scale topology. Since it is impractical to realize chunk-level simulation with large-scale BitTorrent traces and an internet-scale topology due to long simulation time, content-level (macroscopic) simulation is performed. A whole content item is retrieved by only one interest packet in the simulation. In the content-level simulation, some aspects such as downloading multiple chunks concurrently cannot be implemented. The purpose of our study is, however, to propose an efficient and scalable routing protocol for file sharing services rather than designing a specific implementation-level protocol as BitTorrent. It is general for users to retrieve the whole content in file-sharing services, so when we assume that users download all content chunks sequentially, the difference between chunk-level and content-level simulations would be small. Additionally, we performed post-processing on the contentlevel simulation results to analyze chunk-level overheads. The size of content was set as one GB, which was the average file size calculated from the BitTorrent trace. Each chunk was 4096 bytes as the default value in the prototype of NDN (https://www.ccnx.org/), and thus that one GB content item consisted of 262,144 chunks. With post-processing, we can measure the number of interest packets to be generated and the amount of bandwidth consumed by them. 6.6. Cache size File sharing cannot take much advantage of in-network caches due to the relatively infrequent requests and large file sizes. Particularly, when file sharing traffic competes with other traffic such as web and UGC, the fraction of the cache size that would be utilized for file sharing traffic is likely to be very small. To
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
29
Fig. 7. Average hop count with different cache sizes. Fig. 8. The Amount of inter-AS traffic.
understand the impact of cache size, we simulate cache sizes ranging from zero to 128 GB in each AS. It needs to be noted that the cache sizes in the evaluation are the partial sizes of caches occupied by file sharing traffic among in-network caches. With extremely small sizes of caches, we can evaluate the performance of PAR under the circumstance that other traffic (web or UGC) dominates the network traffic and file sharing traffic hardly utilizes the in-network caches.
7. Performance evaluation 7.1. Average hop count The average hop counts to retrieve content are shown in Fig. 7, representing the results of 30 simulations of each scheme with random rendezvous points (RPs). In SC, only the on-path caches between the server and requesters are utilized, which lessens the number of cache hits due to differed paths for different locations of requesters. Requesters in RSS randomly select peers. It diversifies paths and utilizes more in-network caches than SC, but the topology-blind nature of RSS makes requesters to frequently choose farther peers than available closest ones. After all, requesters in SC and RSS experienced higher average hop counts than in PAR and OS. In PAR, the requested content is retrieved from the nearest source on the PAR tree, and thus the average hop counts of PAR and OS were similar even when a few in-network caches are used. The difference between them is from the fact that while retrieving content, OS uses the shortest path from the nearest source in the networks, while in PAR data comes from the nearest source along the PAR tree which can be longer than the shortest path. 7.2. Inter-as traffic analysis The total amount of inter-AS traffic for content retrieval is shown in Fig. 8, and Fig. 9 shows cumulative fractions of requests on the number of hops for content retrieval with various cache sizes. PAR and OS shows much less the amount of inter AS traffic than SC and RSS in Fig. 8. It is because about 70–80% of requests are handled within one hop in PAR and OS for each cache size as in Fig. 9. Requesters in PAR retrieved content from the nearest source on the PAR tree, and the overall performance of PAR was similar to that of OS in all cache sizes. It suggests that PAR can work well with little assistance of in-network caches by utilizing peer's storage, and can support the traffic characteristic of P2P file
transfer (relatively infrequent requests and large file sizes) when competing with other type traffic such as web and UGC. As the cache size increased, the inter-AS traffics of SC and RSS decreased considerably, while those of PAR and OS were reduced slightly. To take a close look at the impact of the cache size, we measured which source was used for each content request at the cache size of 128 GB. About 37%, 62%, 50% and 51% of requests were handled by in-network cache in SC, RSS, PAR and OS, respectively. It means that PAR and OS also took advantages from the larger caches by lowering the loads of seeds, while the performance of SC and RSS was directly impacted by the larger cache size. To allow analysis of interest traffic, post-processing was carried out to include chunk-level considerations as mentioned in the previous section. Table 2 shows interest traffic details. From the table, we can see that the total amount of interest packets incurred by PAR is much less than those incurred by other schemes. With the cost of the flooding overhead, PAR lessens content retrieval hop counts, and it results in less amount of interest traffic. 7.3. Effect of the number of users PAR protocol is scalable against the number of users and content as depicted in Fig. 10. The total users were counted from traces, and 25%, 50% and 75% users and their requesting content were chosen for simulation. With these chosen users and content, simulation was performed to measure the amount of inter-AS traffic. From the figure we can see that as the number of users increases, the amounts of traffic increase more rapidly with RSS and SC than PAR and OS. It is because PAR and OS can find topologically closer seeds even with small number of users, which leads to generate less amount of inter-AS traffic. 7.4. Average number of content-level entries The average number of content-level entries was evaluated to determine the load of managing those entries. The 100 ASes with the largest average numbers of content-level entries in FIB are represented in Fig. 11. For all cache sizes simulated, most ASes had fewer than 100 content-level entries. By creating content-level entries temporarily and then removing them on timeouts, the PAR protocol prevented FIB from growing unnecessarily. Furthermore, the use of incremental interest flooding also inhibited meaningless creation of content-level entries.
30
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
Fig. 9. The number of hops for content retrieval. (a) Cache Size: 0 GBytes. (b) Cache Size: 8 GBytes. (c) Cache Size: 128 GBytes.
Table 2 Interest traffic details. Cache size (GBytes)
Scheme Content retrieval (GBytes)
0
SC RSS PAR OS SC RSS PAR OS SC RSS PAR OS
8
128
3.32 105 2.26 105 8.05 104 5.71 104 3.15 105 2.14 105 7.83 104 5.57 104 2.14 105 1.47 105 6.63 104 4.75 104
Flooding (GBytes)
Total (GBytes)
– – 2.14 102 – – – 2.09 102 – – – 1.79 102 –
3.32 105 2.26 105 8.08 104 5.71 104 3.15 105 2.14 105 7.85 104 5.57 104 2.14 105 1.47 105 6.65 104 4.75 104
Fig. 11. Top 100 ASes in descending order on the average number of content-level entries.
than in the CS and RSS schemes. In PAR, the requests for exceptionally unpopular content do need to traverse the PAR tree, which led to greater hop counts in these case; OS does not have this constraint of the PAR tree structure and thus showed the best performance among all the schemes simulated herein. However, OS needs accurate information about the locations of sources in global scale, which is hard to be achieved in real circumstances. 7.6. Effect of probabilities of becoming seeds
Fig. 10. The amount of inter-AS traffic against number of users.
7.5. Effect of content popularity Figure 12 shows the average hop counts used to retrieve content when the cache size was zero, plotted versus the content IDs sorted in descending order of popularity. The average hop counts for the CS and RSS schemes without assistance from the in-network caching were around 3.3 and 2.3, respectively. On the other hand, in the PAR and OS schemes, requests for more than 80% of the content retrieved within two hops. In the PAR and OS schemes, topology-aware sources were selected in the PAR tree and in the whole ASes, respectively, thereby allowing requesters to obtain requested content with much fewer hops
Different probabilities of becoming seeds are used to measure the effect of the number of seeds. We have assumed that all requesters become seeds between two consecutive snapshots as depicted in Fig. 5. To identify the effect of relaxation of this assumption, simulation is performed with variable probabilities of becoming seeds after requesting for content. The probabilities are exponentially increased from 1% to 100% and the result is shown in Fig. 13. From the result, we can see that PAR and OS show less average hop counts than RSS and SC with higher becomingseed-probabilities. The average hop count for PAR becomes large with probabilities less than 10%. It is because interest packets traverse through the PAR tree, which can result in longer path than the shortest path as in RSS and SC. 8. Conclusion In this paper, we introduced PAR, a peer-aware routing protocol for file sharing in ICN architecture. PAR extends the scope of caches
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
31
Fig. 12. Average hop count vs. Sorted content ID, cache size: 0. (a) SC, (b) RSS (c) PAR, and (d) OS.
overhead can be controlled. Because PAR is an inherently topology-aware routing protocol, it allows requests to reach the nearest source, either a peer or a cache in the PAR tree. The performance and overhead of PAR were evaluated with a number of large scale simulations using a BitTorrent trace and the Internetscale topology. The simulation results show that content can be efficiently retrieved with PAR even with an extremely small innetwork cache, and show that PAR is suitable for scalable and efficient file sharing in ICN.
Acknowledgments
Fig. 13. Average hop count with different probabilities of becoming seeds.
to the storage of peers by constructing a shared tree. Without requiring additional name resolution services, PAR enables us to discover available copies of the requested content. By employing incremental flooding of the request in the tree, the discovery
We thank Jinyoung Han, Ted Taekyoung Kwon, and Yanghee Choi at Seoul National University and Hyun-chul Kim at Sangmyung University for sharing BitTorrent traces. This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (GRL) (NRF-2014K1A1A2064649).
References Ahlgren B, Dannewitz C, Imbrenda C, Kutscher D, Ohlman B. A survey of information-centric networking. IEEE Commun Mag 2012;50(7):26–36.
32
Y. Kim et al. / Journal of Network and Computer Applications 57 (2015) 21–32
Anand A, Muthukrishnan C, Akella A, Ramjee R. Redundancy in network traffic: findings and implications. ACM SIGMETRICS Perform Eval Rev 2009;37 (1):37–48. Bittorrent and torrent software surpass 150 million user milestone, http://www. bittorrent.com/intl/es/company/about/ces_2012_150m_users. CAIDA's internet topology data kit (ITDK), http://www.caida.org/data/active/ipv4_ routed_topology_aslinks_dataset.xml. Carofiglio G, Gallo M, Muscariello L, Perino D. Evaluating per-application storage management in content-centric networks. Comput Commun 2013;36 (7):750–7. Castro M, Druschel P, Kermarrec A-M, Rowstron AI. Scribe: a large-scale and decentralized application-level multicast infrastructure. IEEE J Sel Areas Commun 2002;20(8):1489–99. Chai WK, He D, Psaras I, Pavlou G. Cache less for more in information-centric networks (extended version). Comput Commun 2013;36(7):758–70. Chawathe Y, Ratnasamy S, Breslau L, Lanham N, Shenker S. Making gnutella-like p2p systems scalable. In: Proceedings of the 2003 conference on applications, technologies, architectures, and protocols for computer communications. ACM; 2003. p. 407–18. Cisco. Cisco visual networking index: forecast and methodology. In: White paper. Cisco; 2013. Dannewitz C, Kutscher D, Ohlman B, Farrell S, Ahlgren B, Karl H. Network of information (netinf)—an information-centric networking architecture. Comput Commun 2013;36(7):721–35. Fricker C, Robert P, Roberts J, Sbihi N. Impact of traffic mix on caching performance in a content-centric network. In: Proceedings of IEEE Infocom workshop on emerging design choices in name-oriented networking. IEEE; 2012. Han J, Kim S, Chung T, Kwon TT, Kim H-c, Choi Y. Bundling practice in bittorrent: what, how, and why. In: ACM SIGMETRICS performance evaluation review, vol. 40. ACM; 2012. p. 77–88. Jacobson V, Smetters DK, Thornton JD, Plass MF, Briggs NH, Braynard RL. Networking named content. In: Proceedings of the 2008 ACM CoNEXT conference. ACM; 2009. p. 1–12. Katsaros K, Xylomenos G, Polyzos GC. Multicache: an overlay architecture for information-centric networking. Comput Netw 2011;55(4):936–47.
Kim Y, Yeom I. Performance analysis of in-network caching for content-centric networking. Comput Netw 2013;57(13):2465–82. Lagutin D, Visala K, Tarkoma S. Publish/subscribe for internet: Psirp perspective. In: Future internet assembly; 2010. p. 75–84. Le Blond S, Legout A, Dabbous W. Pushing bittorrent locality to the limit. Comput Netw 2011;55(3):541–57. Lee H, Nakao A. User-assisted in-network caching in information-centric networking. Comput Netw 2013;57(16):3142–53. Muscariello L, Carofiglio G, Gallo M. Bandwidth and storage sharing performance in information centric networking. In: Proceedings of the ACM SIGCOMM workshop on Information-centric networking. ACM; 2011. p. 26–31. Ns-3 based named data networking (ndn) simulator, http://ndnsim.net/. Ns-3, http://www.nsnam.org/. Pentikousis K et al., Icn baseline scenarios and evaluation methodology. In: Draftpentikousis-icn-scenarios-04, 2013. Perino D, Varvello M. A reality check for content centric networking. In: Proceedings of the ACM SIGCOMM workshop on Information-centric networking. ACM; 2011. p. 44–9. Project ccnx, https://www.ccnx.org/. Rautio T, Mmmel O, Mkel J. Multiaccess netinf: a prototype and simulations. In: TRIDENTCOM'10; 2010. p. 605–8. Rossi D, Rossini G. Caching performance of content centric networks under multipath routing (and more), Relatório técnico, Telecom ParisTech. Tyson G, Kaune S, Miles S, El-khatib Y, Mauthe A, Taweel A. A trace-driven analysis of caching in content-centric networks. In: Proceedings of the 21st international conference on computer communication networks (ICCCN). IEEE; 2012. Xie H, Yang YR, Krishnamurthy A, Liu YG, Silberschatz A. P4p: provider portal for applications. In: ACM SIGCOMM computer communication review, vol. 38. ACM; 2008. p. 351–62. Yi C, Afanasyev A, Moiseenko I, Wang L, Zhang B, Zhang L. A case for stateful forwarding plane. Comput Commun 2013;36(7):779–91. Zhu Z, Wang S, Yang X, Jacobson V, Zhang L. Act: audio conference tool over named data networking. In: Proceedings of the ACM SIGCOMM workshop on Information-centric networking. ACM; 2011. p. 68–73.