Design and Implementation of Centrally-Coordinated ...

1 downloads 113 Views 3MB Size Report
many Gbit/s towards the distribution site. Bandwidth capacity is by far the most expensive of the two aspects. Pricing as of Q4 2010 for streaming from a CDN is.
Design and Implementation of Centrally-Coordinated Peer-to-Peer Live-streaming

ROBERTO ROVERSO

Licentiate Thesis Stockholm, Sweden 2011

TRITA-ICT/ECS AVH 11:03 ISSN 1653-6363 ISRN KTH/ICT/ECS/AVH-11/03-SE ISBN 978-91-7415-957-8

KTH School of Information and Communication Technology SE-100 44 Stockholm SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av Licentiatexamen i Elektronik och Datorsystem Torsdagen den 5 maj 2011 klockan 14.00 i Sal D, ICT School, Kungl Tekniska högskolan, Forum 105, 164 40 Kista, Stockholm. © ROBERTO ROVERSO, May 2011 Tryck: Universitetsservice US AB

i

Abstract In this thesis, we explore the use of a centrally-coordinated peer-to-peer overlay as a possible solution to the live streaming problem. Our contribution lies in showing that such approach is indeed feasible given that a number of key challenges are met. The motivation behind exploring an alternative design is that, although a number of approaches have been investigated in the past, e.g. mesh-pull and tree-push, hybrids and best-of-both-worlds mesh-push, no consensus has been reached on the best solution for the problem of peer-to-peer live streaming, despite current deployments and reported successes. In the proposed system, we model sender/receiver peer assignments as an optimization problem. Optimized peer selection based on multiple utility factors, such as bandwidth availability, delays and connectivity compatibility, make it possible to achieve large source bandwidth savings and provide high quality of user experience. Clear benefits of our approach are observed when Network Address Translation constraints are present on the network. We have addressed key scalability issues of our platform by parallelizing the heuristic which is the core of our optimization engine and by implementing the resulting algorithm on commodity Graphic Processing Units (GPUs). The outcome is a Linear Sum Assignment Problem (LSAP) solver for timeconstrained systems which produces near-optimal results and can be used for any instance of LSAP, i.e. not only in our system. As part of this work, we also present our experience in working with Network Address Translators (NATs) traversal in peer-to-peer systems. Our contribution in this context is threefold. First, we provide a semi-formal model of state of the art NAT behaviors. Second, we use our model to show which NAT combinations can be theoretically traversed and which not. Last, for each of the combinations, we state which traversal technique should be used. Our findings are confirmed by experimental results on a real network. Finally, we address the problem of reproducibility in testing, debugging and evaluation of our peer-to-peer application. We achieve this by providing a software framework which can be transparently integrated with any alreadyexisting software and which is able to handle concurrency, system time and network events in a reproducible manner.

iii

Acknowledgements I am extremely grateful to Sameh El-Ansary for the way he supported me during this work. This thesis would not have possible without his patient supervision and constant encouragement. His clear way of thinking, methodical approach to problems and enthusiasm are of great inspiration to me. He also taught me how to do research properly given extremely complex issues and, most importantly, how to make findings clear for others to understand. In particular, I admire his practical approach in solving problems in an efficient and timely fashion given the very demanding goals and strict deadlines imposed by the industrial setting we have been working in. In this years, he has treated me more as a friend than a colleague/student and I feel very privileged to have worked with him and hope to continue doing so in the future. I would like to acknowledge my supervisor Seif Haridi for giving me the opportunity to work under his supervision. His vast knowledge of the field and experience was of much help to me. In particular, I take this opportunity to thank him for his understanding and guidance for the complex task which was combining the work as a student and as employee of Peerialism. I am grateful to Peerialism AB for funding my studies and to all the company’s very talented members for their help: Johan Ljungberg, Andreas Dahlström, Mohammed El-Beltagy, Nils Franzen, Magnus Hedbeck, Amgad Naiem, Mohammed Reda, Jonas Vasell, Christer Wik, Riccardo Reale and Alexandros Gkogkas. Each one of them has contributed to this work in his own way and has made my stay at the company very enjoyable. I would also like to acknowledge all my colleagues at the Swedish Institute of Computer Science and KTH: Vladimir Vlassov, for his patience, Cosmin Arad, Joel Höglund, Tallat Mahmood Shaffat, Ahmad Al-Shishtawy, Amir Payberah and Fatemeh Rahimian. In particular, I would like to express my gratitude to Jim Dowling for providing me with valuable feedback on my work. Special thanks go to all the people that have supported me in the last years and have made my life exciting and cherishable: Stefano Bonetti, Tiziano Piccardo, Christian Calgaro, Jonathan Daphy, Tatjana Schreiber to name a few and, in particular, Maren Reinecke for her love. Finally, I would like to dedicate this work to my parents Segio and Odetta and close relatives, Paolo and Ornella, who have at all times motivated and helped me in every possible way.

To my family

Contents

Contents

vii

I Thesis Overview

1

1 Introduction 1.1 Content Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 4 6

2 Peer-To-Peer Live Streaming 2.1 Tree-based . . . . . . . . . . 2.1.1 Overlay maintenance 2.2 Mesh-based . . . . . . . . . 2.3 Hybrid Approaches . . . . .

. . . . . . . . . . and construction . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 9 11 11 13

3 Problem Definition

15

4 Thesis Contribution 4.1 List of publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Design and implementation of a centrally-managed peer-to-peer live streaming platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Solving Linear Sum Assignment Problems in a time-constrained environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 NAT Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Highly reproducible emulation of P2P systems . . . . . . . . . . . .

17 17

5 Conclusion and Future Work

21

Bibliography

25 vii

18 19 19 20

viii

CONTENTS

II Research Papers

33

6 On The Feasibility Of Centrally-Coordinated Peer-To-Peer Live Streaming

35

7 NATCracker: NAT Combinations Matter

43

8 GPU-Based Heuristic Solver for Linear Sum Assignment Problems Under Real-time Constraints

53

9 MyP2PWorld: Highly Reproducible Application-level Emulation of P2P Systems

61

Part I

Thesis Overview

1

Chapter 1

Introduction Peer-to-peer (P2P) systems have shown a significant evolution since first introduced to the world by Gnutella [41] and Kazaa [27]. Nowadays, applications which use the peer-to-peer approach vary from illegal file sharing to distributing games updates. It is safe to state that Bittorrent in particular has been a major force in driving the bandwidth demands of most consumer networks throughout the last 5 years. It is estimated that in 2009 one fourth to one third of the Internet traffic was somehow related to P2P applications [20]. The consequences of an increased popularity of P2P platforms has coincided with efforts from the academia to try to understand how such an important amount of traffic influences the Internet and what can be done to reduce its congesting impact, in particular regarding Bittorrent [39][44]. The industry as well has applied the peer-to-peer approach to a number of areas, including VoIP, with Skype [18], distributed storage, with Wuala [19], and on-demand audio streaming, with Spotify [25]. All the aforementioned are attempts to provide services which do not involve significant costs from the point of view of bandwidth consumption and infrastructure. P2P-based software amounts to only a tiny part of all Internet-based services. The bulk of the industry instead relies on expensive but reliable solutions, that are content delivery networks (CDNs) and Clouds. In particular, when considering video streaming, no commercial peer-to-peer software has been widely deployed. However, a number of free applications such as SopCast [2] and PPLive [14] have proven very effective for large-scale live streaming, mostly because of their limited requirements in terms of bandwidth at the distribution site. In fact, most of the source of the streams in those systems are users which broadcast live content for thousands of others with limited upload bandwidth. On the other hand, the aforementioned applications provide a low quality of service which would not be acceptable in a commercial system. Many solutions and different approaches have been proposed by the research community to the problem of streaming live video and audio content over the Internet using overlay networks. However, no consensus has been reached on which 3

4

1.1. CONTENT STREAMING

one of them solves best the difficult task of guaranteeing high quality of user experience while providing a large amount of bandwidth savings. In this thesis, we explore an alternative and novel approach to the aforementioned problem, that is using central coordination for controlling the delivery of content on overlay networks. This in the hope of showing the viability of such approach for large-scale live content streaming. The next sections detail the types, requirements and technologies currently used in content streaming services. This will serve as a background for a better understanding of the challenging problem that is efficiently distributing content streams over the Internet.

1.1

Content Streaming

Streaming services can be classified in two main classes: video-on-demand and live. Video-on-Demand(VOD). VOD allows users to select and watch pre-recorded video/audio material at any point in time. Users are usually presented with a catalog of streams to choose from; once a decision has been made, the stream is sent to the player as a flow of video/audio chunks. VOD allows for the delivery to start at any point of the stream. Seeking operations are also allowed. VOD has an inherently sparse content popularity distribution. It is widely recognized that content follows a long tail, where the majority of the videos are not accessed very often, while few popular others are requested very frequently [8]. The complexity of VOD lies in guaranteeing the same quality of experience for popular and non-popular content items. Live Streaming. The main difference of Live streaming compared to VOD is that the content is not pre-recorded. The stream instead is being created and broadcasted in real-time. As a requirement, every client receiving the live content should have the minimum possible delay from the moment the content becomes available at the distribution point, i.e. the streaming server, to the point when it gets played at the receiver’s end. A desirable feature is also to minimize the interclient delay, i.e. the playback point should be synchronized or within a reasonable time window across all clients. Live streaming is usually implemented using stateful network control protocols, such as RTSP [52], where clients establish and manage media sessions towards the streaming server by issuing VCR-like commands, e.g. play, stop and pause. The media delivery is carried out using a transport protocol such as RTP [51], however proprietary alternatives are also common, e.g. RDT [1]. In standard live streaming, it is the server that pushes content fragments at a specific rate to the client following a single play request. At the transport level, standard live streams are delivered through UDP, while TCP is used for control messages.

CHAPTER 1. INTRODUCTION

5

Recently, the industry has introduced a new technology for live streaming, called HTTP-live. In HTTP-live, the stream is split into a number of small HTTP files, i.e. fragments. The streaming server appears to the clients as a standard HTTP server. When a client first contacts the streaming server, it is presented with a descriptor file, called Manifest, which outlines the characteristics of the stream. It contains the stream’s fragments path on the HTTP server and the bitrate characteristics. As the content becomes available from an encoder or a capturing device, the streaming server creates new fragments and regenerates the Manifest accordingly. The player periodically requests a new copy of the Manifest to be aware of which fragments are available at the current time. Reasons behind the development of HTTP live protocol are the simplicity of management at server side and the use of HTTP as a transport protocol, which makes it easier to achieve good connectivity in restrictive environments such as corporate networks. Examples of HTTP live protocols are Apple’s Live streaming [16] and Microsoft’s Smooth Streaming [17]. Bandwidth Requirements. Media streams are usually quite demanding from the point of view of bandwidth consumption. You Tube for example requires a minimum bitrate of 256Kbit/s for normal quality videos encoded with the H264 codec. With the same video compression format, it is possible to achieve a quality comparable to Digital Satellite TV at 1.5M bit/s, whereas for an HD quality stream a minimum bitrate of 4M bit/s is required. The high bitrate requirements of video streaming raise obvious challenges. First, from the point of view of server infrastructure, since a single streaming server is typically able of handling just a few thousands of clients. And second, from the point of view of bandwidth consumption, because streaming requires a capacity of many Gbit/s towards the distribution site. Bandwidth capacity is by far the most expensive of the two aspects. Pricing as of Q4 2010 for streaming from a CDN is shown in Table 1.1. Volume 50TB 100TB 250TB 500TB

Max Price ($) 0.45 0.25 0.10 0.06

Min Price ($) 0.40 0.20 0.06 0.02

Table 1.1: Table showing the highest and lowest prices per acquired volume in the CDN market as of Q4 2010. Data taken from [40].

Distribution Infrastructure. Live and VOD streams are mostly distributed using unicast towards a single content source or a CDN. Multicast is also exploited.

6

1.2. THESIS ORGANIZATION

Given that providers of steaming services are usually ISPs, the quality of the service is guaranteed by means of network resources reservation. Alternatives to the ISP approach include proprietary application-level solutions, such as Voddler, Netlix and Hulu. The delivery of audio and video streams in this case happens without any guarantee of quality of service or prevention of service interruption. Despite their best effort nature, these solutions have known a large amount of success in the last years. Internet-based services use different delivery strategies for streaming: • Unicast: Single End-to-End connectivity through either TCP or best effort UDP is implemented in this case. The load of multiple clients is usually shared among multiple locations of a Content Delivery Network. Server farms are placed in strategical geographical locations. Proximity to clients allow for lower distribution delays. CDNs are usually organized in a way to lower peering costs by placing servers with copies of the same content inside ASs and ISPs. • IP Multicast: support for efficient broadcasting in this case is implemented at Network layer. Multicast is the most efficient way to deliver streams of content, since the distribution happens using a single stream of data along a tree-like network path. However, IP Multicast setup and configuration is cumbersome and requires expensive hardware. • Peer-To-Peer: As opposed to the aforementioned strategies, the peer-topeer approach allows to utilize hosts as relays for content streams. A client plays a double role: it receives the content data delivering it to the player, and it makes the data available to other peers. This approach allows for sharing the load of distribution among all involved hosts. Only few peers need to be receiving content from the source, whilst the others can retrieve it from them. Obviously, if the peer-to-peer delivery is organized in the right way, this can lead to significant savings in terms of bandwidth utilization at the distribution site and to improved scalability of service.

1.2

Thesis Organization

This chapter provides a general introduction to the thesis and the problem of content streaming. Chapter 2 presents the state of the art of peer-to-peer live streaming. A definition of the problems addressed in this work is presented in Chapter 3. The contribution to the defined issues is explained in Chapter 4. Section 4.1 provides the list of publications related to this thesis. Finally, Chapter 5 concludes the thesis and gives an insight of future directions of this work.

Chapter 2

Peer-To-Peer Live Streaming Peer-to-peer content streaming can be viable solution to provide scalability and distribution savings. However, P2P live streaming is subject to a number of challenges. Typically, in any live streaming architecture, the source streamlines the content into a number of data chunks of fixed or variable size which need to be delivered to the player at fixed rate and within a strict deadline. Missing a deadline might cause loss of quality, temporary interruption or full termination of the playback. The behaviour of the content players, when such problematic events happen, can vary significantly according to the type of transport protocol, the codec used for encryption/decryption and quality of the stream.

Challenges The main challenge for a peer-to-peer live streaming system is to meet real-time constraints while coping with dynamicity, e.g. churn, network congestion and connectivity limitations. On top of that, the application should strive for efficient bandwidth utilization. The peers in the overlay network should in fact contribute with their upload bandwidth in order to offload as much as possible the source of the content. Churn, defined as the process and rate of peer joining and leaving the peerto-peer network, is an important issue in peer-to-peer systems. This because nodes have very limited time to adapt to overlay network changes as deadlines for video/audio fragments are in the order of seconds rather than minutes. Studies on peer-to-peer file-sharing systems have shown that node lifetimes follow an heavy-tailed Pareto distribution [6][50], where few peers have very long lifetime while all others have a very short one. In addition to that, the population size tends to remain stable over time. In live streaming instead, churn behaviour is significantly different. In Figure 2.1(a) and Figure 2.1(b), we show the arrival and departure rates of a movie channel 7

500

1200

2

6

70

12

14

16

18

20

22

24

Time(h)

800

50

Jan. 29

600

0 0

400

5

60

2

4

6

8

10

12

14

16

18

20

22

24

200

50

3, 2006

0 250 0

2

4

6

8

10

4

40

14

16

18

20

22

24

1400 300

(b) Peer departure rate

200

30

4

8

Fig. 11. 12

16

10

1200 250 1000

200 Peer arrival and departure evolution of a popular movie channel 150

# of Arrival

20 3

sh crowd 10

12

Time(h)

10

20

24

4

8

12

16

20

Arrival # #ofofDeparture

# of Peers # of peers

1000 10

8

Jan. 28

10

100

We also plot the numbers of peer arrivals and departures of 7 the popular TV channel in every minute of one day in Figure (a) Popular TV channel 12. We observe that the arrival pattern of this TV channel is similar more to that of the movieTime(h) channel. However, there is no with 8each other. We will address this peer dynamics periodic batch departure(a)pattern for rate this TV channel. in Section V-D3. Peer arrival 4

# of Departure

0 0 6

10

Peer arrival and departure evolution of a popular movie channel 150 # of Arrival

Fig. 11.

2P streaming systems scales well, handling a flash crowd ve broadcasting.

24

Time(h)

800

150

100

0 crowd on Chinese new year eve Flash 0 2 4 6 8 10 12 14 16

600

We also plot the numbers of peer arrivals and departures of Time(h) the(b)popular TV Unpopular TVchannel channel in every minute of one day in Figure 12. We observe that the arrival pattern of this TV channel trendis ofsimilar number of on Time(h) Oct. 13, 2006However, there is no erDiurnal Arrivals and Departures toparticipating that of theusers movie channel. Time(h) Time(h) batchthe departure pattern this TV channel. his section, periodic we examine peer (a) arrival andfordeparture Peer(a) arrival rate (b) departure rate (a)Peer Peer(b) arrival rate n for various PPLive channels. We plot the numbers of rrivals and departures of thewell, popular movie channel in Fig. 11. Peer arrival and departure evolution of a popular movie channel 2P streaming systems scales handling a flash crowd minute of one day in Figure 11. Comparing it with ve broadcasting. olution of the number of participating peers, we find We also plot the numbers of peer arrivals and departures of eers join and leave at a higher rate at peak times. We the popular TV channel in every minute of one day in Figure ee consecutive spikes with a period of about 2 hours 12. We observe that the arrival pattern of this TV channel departure rate curve in Figure 11(b). The spikes are is similar to that of the movie channel. However, there is no many peers leaving immediately and simultaneously at periodic batch departure pattern for this TV channel. d of (roughly) two-hour programs. This batch-departure Time(h) Time(h) n3,in2006 P2P IPTV systems is different from P2P file sharing Time(h) departure (b)(a)Peer departure rate ms, where peer departures are mostly triggered by the Peer arrival rate (b) Peer (d) departure rate (c) mbers ofcompletions (or, the detections of completions) hronous 11. Peer arrival and departure evolution of a popularFig. movie 12. channel Peer arrival and departure evolution of a popular TV channel in Fig.This shdownloads. crowd ehannel suggests IPTV systems Figure 2.1: that Peer P2P arrival (a) and departure (b) rate of a Movie channel and a TV channel (c) (d) Time(h) 1 g lower it withpeer churn in PPLive rates in the middle of a program. .quently, Flash crowd on Chinese new year eve we find peers can stable partnership We define of the peer lifetime as the time between the arrival We alsomaintain plot the more numbers of peer arrivals and departures mes. We the popular TV channel in every minute of one day in Figure 2 hours 12. We observe that the arrival pattern of this TV channel 1 broadcasted the channel. PPLiveHowever, steamingthere platform hours time span. As er Arrivals and Departures pikes are is similar to that of using the movie is no during a 24 Time(h) we can see, the join rate varies significantly during the day and peaks out at noon eously at periodic batch departure pattern for this TV channel. his section, we examine the peer arrival and departure (a) Peer arrival rate and channels. late evening. Being this a movie channel, we observe batch peer departures at ndeparture for various PPLive We plot the numbers of e sharing Time(h) the end of each movie, i.e. everyinsecond hour. Figure 2.1(c) and 2.1(d) instead show arrivals and departures of the popular movie channel dminute by theof one day (b) Peer departure the in same type statistics itrate but Figure 11.ofComparing withfor a TV channel. The arrival rate is very similar mpletions) volution of the number of participating weof afind to Peer the arrival previous case,peers, however, the departures Fig. 12. and departure evolution popular TV channel trend is totally different. Figure 2.2 systems eers join and leave at a the higher rate at peak times. Weof the same system. As we can see, most of the shows lifetime distribution program. ee consecutive spikes with a period about and 2 hours users, both forlifetime theofmovie TV channel, stay online for less than 1.5 hours. It is rtnership define peer 11(b). as the timeare between the arrival departure rateWecurve in the Figure The spikes therefore clear that a steady amount of dynamicity should be expected in a peermany peers leaving immediately and simultaneously at to-peer live streaming system at all times. In addition, flashcrowds and sudden d of (roughly) two-hour programs. This batch-departure departure of large number of peers are also observed. Time(h) n in P2P IPTV systems is different from P2P file sharing Time(h) Further of arrival disruption congestion. Although departure ms, where peer departures are source mostly triggered by thein a peer-to-peer network (a) Peer rate (b) Peerisdeparture rate mbers ofcompletions it is(or, believed that the last mile segment is usually the bottleneck along a network hronous the detections of completions) Fig. 12. Peer arrival and departure evolution of a popular TV channel hannel in e downloads. This suggests that P2Pcan IPTV route, congestion be systems experienced at different levels in the network. This is gt lower it withpeer churn rates in the middle of a program. caused mainly by the fact that ISPs and ASs dimension their internal network we find peers can quently, more considering stable partnership We define therather peer lifetime time between the arrival andmaintain border links the average traffic, than as thethepeak. This leaves mes. We room to possible congestion scenarios when most of the users are downloading 2 hours significant amount of data, e.g. when accessing high definition live streaming events pikes are eously at 1 Figures taken from "A Measurement Study of a Large-Scale P2P IPTV System" [15] departure e sharing Time(h) d by the (b) Peer departure rate pletions) systems Fig. 12. Peer arrival and departure evolution of a popular TV channel program. rtnership We define the peer lifetime as the time between the arrival 18

20

22

24

100 400

50

50 200

0 0

2

4

6

8

10

12

14

16

18

20

22

00 00

24

1400 300

22

44

66

88

10 10

12 12

14 14

16 16

18 18

20 20

22 22

24 24

2

4

6

8

10

12

14

16

18

20

22

24

2

4

6

8

10

12

14

16

18

20

22

24

2

4

6

8

10

12

14

16

18

20

22

24

250

1200 250

200

Jan. 28

# of Departure

6

10

# #ofofDeparture Arrival

1000 200 800

Jan. 29

150

600

100 400

5

# of peers

10

150

100

50

50 200

00 00

4

22

44

66

8 8

10 10

12 12

14

16

18

20 20

22 22

0 3000

24 24

10

250

20

24

4

8

12

16

20

# of Arrival

16

24

200

150

100

150

50

100

0 0

50

0 300 0

2

4

6

8

10

12

14

16

18

20

22

24

250

250

200

200

# of Departure

12

# of Departure

8

# of Arrival

4

150

100

0 0

2

4

6

8

10

12

14

16

18

20

22

24

2

4

6

8

10

12

14

16

18

20

22

24

250

200

150

100

50

0 0

150

100

50

50

# of Departure

10

200

250

3

0 0

largest share at about 8PM EST. Thus the behavio in North America is quite similar to users in Asia. 100%

Others 90%

80%

Distribution of peers

and the departure of the peer. Our analysis shows that peer lifetimes vary from very small values up to 16 hours. There are totally 34, 021 recorded peer sessions for the popular channel and 2, 518 peer sessions for the unpopular channel. The peer lifetime distribution in Figure 13 suggests that peers prefer to stay longer for popular programs thanSTREAMING for unpopular CHAPTER 2. PEER-TO-PEER LIVE programs. However 90% of peers for both programs have lifetimes shorter than 1.5 hours. 1

9

70%

60%

50%

40%

! "

Jan. 28

Jan. 29

30%

20%

0.9 10%

Others North America Asia

0.8

! #!

CDF Probability

0.7

4

! $

8

12

16

20

24

4

8

12

16

20

Time(h)

0.6

Fig. 15. Evolution of geographic distribution during Chinese ne

0.5 0.4

! %

0.3

! &

! '

! (

! )

One lesson learned from this crawling study is P2P IPTV systems, it is possible to track detailed ! ! ! * havior. "+! ""! ! Unlike traditional broadcast television, the Time(mins) "+! of a P2P IPTV system can track the type of p 1 Figure Peer Lifetime Distribution Figure 2.3: Tree-based system Fig. 13. Peer lifetime2.2: distribution user watches, the region in which the user lives, at which the user watches, and the users channel behavior. Such detailed information will likely be u future for targeted, user-specific advertising. This res particular interest. Effects of network congestion are: longer transmission delay, D.ofUser Geographic Distribution demonstrates that an independent third-party can packet loss and blocking of new We classify PPLive users into threeconnections. regions: users from peer and user characteristics for a P2P IPTV system a live streaming Asia, An usersadditional from Northrequirement America, andofusers from the rest ofplatform is that the stream must to file-sharing monitoring companies (such as Big C the To accomplish this classification, a peer’s beworld. received with a small delay from we themap source of the[17]), stream. Furthermore, userswill likely emerg IPTV monitoring companies IPinaddress to a region by querying the free should MaxMind GeoIP the same geographical location not experience difference providesignificant content creators, contentindistributors, and a database [16].point Figurewith 14 shows the evolution the geographic These requirements tie directly playback respect to theirof neighbours. with information about user interests. distribution of the popular channel during one day. The into the meaning of live broadcasting. full Guaranteeing low delay from the source is figure is divided into three regions by two curves. The bottom particularly cumbersome in peer-to-peer networks since,IV.in Pmost cases, the stream LAYBACK D ELAY AND P LAYBACK L AGS A region is made up of the peers from Asia, the middle region has to traverse multiple hops. In order to solve this issue, it is often necessary to P EERS is for the peers from North America, and the top region is overlay such the amount hops can be kept under Asoftheoretically demonstrated in [18], appropriate forintroduce peers fromstructure the rest ofin thethe world. We can seethat that most of control. users come from Asia. Again, the percentage of peers from can significantly improve video streaming quality. Finally,thea lowest peer-to-peer live 7PM streaming should for may keeping too muchstrive buffering makethe the delay performa Asia reaches point around to 8PMapplication EST. for a streaming service. traffic inside a certain network segment, i.e. an ISP orceptable an Autonomous System (AS) In this section, quantitative results on thelow buffering network. In fact, hosts which belong to a certain segment are likely to have com- effect of PPLiv the delay performance. In particular, munication delay between them, whereas promoting intra-segment communication we used our me platforms, to obtain insights into start-up delay and also lowers incurred peering costs for network operators. lags of peers. 0.2

PopMovie UnpopTv PopTv

0.1

0 0

30

60

90

120

150

180

Distribution of Peers

100%

95%

There exists three main approaches to peer-to-peer live streaming, we detail them in the next sections. A. Start-up Delay

90%

Start-up delay is the time interval from when one selected until actual playback starts on the screen. F Time(h) ing applications in the best-effort Internet, start-up The approach aims to channel recreate the same has network of IPmechanism mulalwaysstructure been a useful to deal wit Fig. 14. Tree-based Geographic distribution of popular movie streaming sessions. ticast but with using an overlay network. The peers variations organize of themselves in a treeP2P streaming ap Fig 15 plots of peer geographic for is additionally have tofrom deal the with root peer churn, increasin whose rootthe is evolution the broadcast provider.distribution The content then pushed the Spring Festival Gala event on the past Chinese for startup buffering and delay [18]. While short star along the tree by letting peers receive the New videoYear’s from their parents and then forward Eve. This figure has the same format as Figure 14, with three is desirable, certain amount of start-up delay is nec it to their children. regions denoting three different geographical regions. We can continuous playback. Using our monitored peers (s The typical structure of a system based on the tree-based approach is shown see that for this event, many peers from outside of Asia were V), we recorded two types of start-up delays in P in Figure 2.3. Peers with highest upload bandwidth capacity are usually placed watching this live broadcast – in fact, a significantly higher delay from when one channel is selected until the near the the source theoutside stream, in this case peersplayer number andthethree, percentage of peers were of from of Asia as compared pops two up; and delay and from when the playe theFigure others14.are new rows according their capacity in aFor a popular ch with Thepositioned geographic into distribution evolution is con- tountil theavailable playback actually starts. decreasing Examples of III-C: tree-based systems Overcast [22], Climber sistent with the order. observations in Section Peers from North are measured player pop-up delay [34] was from 5 to 10 se America have the smallest share at about 7AM EST, and the the player buffering delay was 5 to 10 seconds.

2.1

Others North America Asia

Tree-based 85% 0

2

4

6

8

10

12

14

16

18

20

22

24

10

2.1. TREE-BASED

,!

*!

!

*!

!

!

&&!

(!

!

!

!

)!

(

&! !

"!

%%!

,!

%

#! !

'

!

!

%(! "

!

%&!

%,!

! #

! "

&'!

$! !

!

$!

!

)

!

!

! !

!

!

#!

+

!

!

! "

%

! &'!

!

! %&!

+!

Figure 2.5: Mesh-based system

! $

! $

! #

Figure 2.4: Multitree-based system

and ZigZag [54]. The main advantage of a tree-based overlay is that the distribution delay is predictable and corresponds to the sum of the delays along the overlay path. There exist a number of ways to construct a peer-to-peer tree given a set of available peers. Two major aspects must be considered when building such a structure: the depth of the tree and the fan-out degree of the internal nodes. For instance, as peers from lower levels in the tree receive content from others which are closer to the source, it is necessary to keep the number of rows in the tree to a bare minimum in order to avoid delays. As a consequence, the fan out degree of the internal nodes in the tree should be as large as possible. However, the fan out degree of a node is constrained by its upload capacity and it is therefore able to provide only to a limited number of peers. In order to improve bandwidth utilization, a multi-tree structure can be used. In a single tree structure, the leaves of the tree do not contribute to the delivery. In a multi-tree configuration instead, the content server splits the stream in multiple substreams which are then broadcasted over separate overlay trees. As a consequence, a peer might be a provider of a certain sub-stream but only a receiver for another. Solutions using this approach are, for instance, SplitStream [7] and Orchard [30]. A study on a widely deployed system, i.e. GridMedia [61], has shown that using multi-tree based systems leads to near-optimal bandwidth utilization within a certain degree of churn. Figure 2.4 shows a multi-tree overlay network structure. In the example, two sub-streams are broadcasted by the streaming server along two trees, starting at peer 2 and 1. Maintenance of the streaming tree is essential given the high dynamicity of peerto-peer networks. When a peer abruptly leaves the overlay, all of its descendants get disconnected from the stream. In order to create room for the system to recover from a failure, a peer typically keeps a buffer. The buffer provides a way to compensate for disruptions in the delivery. Since the buffer itself introduces a delay from the point the data is received to the one it is provided to the player, its size is usually kept small. It is therefore very important for the system to be able to recover quickly from failures in order to avoid playback issues.

CHAPTER 2. PEER-TO-PEER LIVE STREAMING

2.1.1

11

Overlay maintenance and construction

A tree-based overlay construction and maintenance can be carried out either in a centralized or decentralized fashion. We describe both of the methods in the next paragraphs. Centrally-coordinated tree construction. In this case, peers rely on a central coordinator for instructions on which is the peer they should get the stream from. The central server keeps an overview of the system which includes characteristics of the peers and configuration of the tree at a certain point in time. Using this information, the central server makes decisions upon the join of a peer or a failure. For the purpose of load balancing, the central entity might enforce an explicit reconfiguration of the overlay based on some criteria. Load balancing operations may be carried out, for instance, to lower the number of rows in the overlay three or to improve efficiency. The coordinator might become the performance bottleneck of the system, limiting scalability. This considering the challenging task of both providing quick reaction in case of churn and coping with overlay management complexity. To the best of our knowledge, central coordination has been used exclusively for content distribution, e.g. in Antfarm [36], but never for live streaming. In this thesis, we will explore this approach and report our experience and findings. Decentralized tree construction. A number of distributed algorithms have been designed to construct and maintain a tree-based live streaming overlay. In this approach, peers negotiate by means of gossiping their placement in the tree using their upload bandwidth as a metric. Examples of systems using decentralized tree construction are: SplitStream [7], Orchard [30], ChunkySpread [57] and CoopNet [32].

2.2

Mesh-based

In mesh-based overlay networks no static overlay structure is enforced. Instead, peers create and lose peering relationships dynamically. Examples of systems using this kind of approach are: SopCast [2], DONet/Coolstreaming [62], Chainsaw [33], BiToS [58] and PULSE [37]. A typical structure of a mesh-based system is shown in Figure 2.5. In the pictured case, only few peers receive the stream from the source while the majority of them exchanges content chunks through overlay links, i.e. the black arrows in the picture. A mesh-based system is usually composed by three main parts: membership, partnership and chunk scheduling. The membership mechanism allows peers to discover others in the network receiving the same stream. This is usually achieved by means of a central discovery service where all peers report their address, e.g. the Bittorrent Tracker. Another way of discovering peers

12

2.2. MESH-BASED

is using gossip. Gossiping algorithms in this case are usually biased towards finding neighbours that have interesting characteristics, e.g. peers with similar playback point and available upload capacity [35] or that are geographically closer to the requester [23]. Upon the discovery of peers, the partnership service is used to establish temporary peering connections with a subset of peers which is considered suitable for the receipt of the stream. A partnership between two peers is commonly established according to the following considerations: • The load on the peer and resource availability at both ends. Possible load metrics include: available upload and download bandwidth capacity and CPU/memory usage. • Network Connectivity. The potential quality of the link between the two peers in terms of delay, packet loss characteristics, network proximity and firewall/NAT configurations, as in [24]. • Content Availability. The available chunks at both ends, i.e. the parts of the stream which have been already downloaded and are available locally. Availability of chunks is advertised periodically. Being the peer-to-peer network very dynamic, the partnership service continuously creates and terminates peering relationships. Control traffic generated by the partnership service is usually significant given that peers need frequent exchange of status information. On the other hand, less frequent updates might lead to increased latencies since pieces become available later. On top of that, a larger degree of partners gives more flexibility when requesting content, as peers have more choice to quickly change from a partner to the other if complications arise during the delivery. It is therefore important to find the right trade-off between partner set size and frequencies of updates. The chunk scheduling service is entitled with the task of requesting content chunks as the delivery progresses, making use of information about the available peers collected by the membership service. Differently from the tree-based case, content chunks are not delivered through an already established overlay network path. In fact, a peer might be downloading different chunks from different peers in parallel. Chunks may then take different paths to reach a peer and consequently be delivered in an out-of-order fashion. In order to guarantee continuous playback, a peer keeps a buffer of received chunks and re-orders them before delivering them out to the player. The content of the buffer is usually what is made available to other peers. For this purpose, a peer might keep chunks available for a longer time. Given their high peering degree and randomness, mesh-based systems are extremely robust both against churn and network-related disruptions, such as congestion. Since no static structure is enforced on the overlay, a peer can quickly switch between different providers if a failure occurs or the the necessary streaming rate cannot be sustained. That said, since every chunk is treated as a separate delivery unit, per-packet distribution paths and delivery times are not predictable

CHAPTER 2. PEER-TO-PEER LIVE STREAMING

13

and highly variable. Consequently, it is very challenging to design a mesh-based streaming platform able to guarantee playback deadlines. Another drawback of the mesh-based approach is sub-optimal bandwidth utilization of provider peers since chunk requests are mostly unpredictable.

2.3

Hybrid Approaches

A tree-based approach can be combined with a mesh-based to obtain better bandwidth utilization and lower delays. mTreebone[59] elects a subset of nodes in the system as stable and uses them to form a tree structure. The content is broadcasted from the source node along the tree structure. A second mesh overlay is then built comprising both of the peers in the tree and of the rest of the peers in the system. For content delivery, the peers rely on the few elected stable nodes but default to the auxiliary mesh nodes in case of a stable node failure. The drawback of this approach is that a few stable peers might become congested while the others are not contributing with their upload bandwidth. Thus, this solution clearly ignores the aspect of efficient bandwidth utilization. As an alternative approach, CliqueStream [3] creates clusters of peers using both delay and locality metrics. One or more peers are then elected in each cluster to form a tree-like structure to interconnect the clusters. Both in PRIME [29] and NewCoolsteaming [26], peers establish a semi-stable parent-to-child relationship to combine the best of push and pull mesh approach. Typically a child subscribes to a certain stream of chunks from the parent and the parent pushes data to the child as soon as it becomes available. It has been shown that this hybrid approach can achieve near-optimal streaming rates [38] but, in this case as well, no consideration has been given to efficient bandwidth utilization.

Chapter 3

Problem Definition In this work, we target the design and implementation of a live streaming platform which allows for large amount of bandwidth savings at the source of the stream and high quality of user experience. In order to achieve the aforementioned goal, we argue that a number of important challenges must be addressed: • Efficient Overlay Management. The system must be carefully crafted to enable management of peers in a way to enable high upload bandwidth utilization while coping with real-time streaming deadlines. User quality of experience must be guaranteed at all times, even if bandwidth savings need to be sacrificed for that purpose. As a consequence, low initial buffering time, small playback and distribution delays should be maintained throughout a streaming session. A further requirement on the platform is to preserve network locality of streams in order to limit peering costs for operators. • Scalability. The proposed system should scale to thousands of users. In order to achieve this, obvious bottlenecks should be resolved with targeted solutions or by providing an alternative design of the platform. • Circumventing NAT limitations. In peer-to-peer systems deployed on consumer network, it is common to observe a large amount of participating hosts behind NAT, typically up to 70% [47]. Obviously, limited accessibility to peers translates in limited efficiency of the system. In order to overcome this limitation, a number of NAT traversal techniques exist in order to enable connectivity even when peers are behind two different NAT boxes [10][53][42][43]. However, in our experience, their effectiveness is limited in the case direct connections must be achieved between peers behind different NAT boxes. When a direct connection fails to be established, content is simply relayed through other peers [21]. This is not feasible 15

16

when considering streaming systems where the amount of data to be relayed is extremely large. Research on NAT limitations has not known significant advancements in the last years. This, even though NAT constraints are one of the most relevant issues in peer-to-peer systems, for the simple fact that good connectivity is a precondition for any distributed algorithm to function correctly. • Deterministic Testing/Debugging/Evaluation Environment. Arguably, when designing large-scale systems, it is infeasible to cover all possible runtime scenarios with pure reasoning. For that reason, prototyping of peer-to-peer systems is often conducted in a controlled environment, such as Discrete Event Simulator. One of the main advantages in using a controlled environment is that, depending on the platform, it is possible to achieve partial reproducibility of executions, i.e. the ability to execute the same experiment many times while preserving the exact same sequence of events, at least network-related ones. This brings obvious advantages when it comes to debugging of code and detailed inspection of the application’s execution. A number of platforms have been developed to allow reproducibility of network events, both at kernel, e.g. [55], and application level, e.g. WiDS [28], ProtoPeer [11] and SSFNet [60]. However, to the best of our knowledge, no solution has been developed to manipulate operating system concurrency and thus enabling a fully deterministic application execution.

Chapter 4

Thesis Contribution In this chapter, we summarize the contributions of this thesis. First, we list the publications that were produced as a result of this work. After that, we provide a small summary of each publication’s contribution.

4.1

List of publications

• Roberto Roverso, Amgad Naiem, Mohammed Reda, Mohammed El-Beltagy, Sameh El-Ansary, Nils Franzen, and Seif Haridi. On The Feasibility Of Centrally-Coordinated Peer-To-Peer Live Streaming. In Proceedings of IEEE Consumer Communications and Networking Conference 2011, Las Vegas, NV, USA, January 2011 • Roberto Roverso, Sameh El-Ansary, and Seif Haridi. NATCracker: NAT Combinations Matter. In Proceedings of 18th International Conference on Computer Communications and Networks 2009, ICCCN ’09, pages 1–7, San Francisco, CA, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-4581-3 • Roberto Roverso, Amgad Naiem, Mohammed El-Beltagy, Sameh El-Ansary, and Seif Haridi. A GPU-enabled solver for time-constrained linear sum assignment problems. In Informatics and Systems (INFOS), 2010 The 7th International Conference on, pages 1–6, Cairo, Egypt, 2010. IEEE Computer Society. ISBN 978-1-4244-5828-8 • Roberto Roverso, Mohammed Al-Aggan, Amgad Naiem, Andreas Dahlstrom, Sameh El-Ansary, Mohammed El-Beltagy, and Seif Haridi. MyP2PWorld: Highly Reproducible Application-Level Emulation of P2P Systems. In Proceedings of the 2008 Second IEEE International Conference on Self-Adaptive and Self-Organizing Systems Workshops, pages 272–277, Venice, Italy, 2008. IEEE Computer Society. ISBN 978-0-7695-3553-1 17

4.2. DESIGN AND IMPLEMENTATION OF A CENTRALLY-MANAGED PEER-TO-PEER LIVE STREAMING PLATFORM

18

List of publications of the same author but not related to this work • Roberto Roverso, Cosmin Arad, Ali Ghodsi, and Seif Haridi. DKS: Distributed K-Ary System a Middleware for Building Large Scale Dynamic Distributed Applications, Book Chapter. In Making Grids Work, pages 323–335. Springer US, 2007. ISBN 978-0-387-78447-2

4.2

Design and implementation of a centrally-managed peer-to-peer live streaming platform

The work describing the design and implementation of our live streaming platform has been published in a conference paper [49]. The paper appears as Chapter 6 in this thesis. In our work, we adopt the approach of a centrally coordinated overlay in order to provide significant amount of savings while coping with churn and limitations on connectivity of the underlying network. Our contribution lies in showing that the realization of a system based on the central coordination approach is indeed feasible. In this work we present a design where a central coordinator organizes the overlay network by issuing direct instructions to peers. Our results show that this approach leads to efficient delivery of streams by letting the central entity help peers in their provider’s choice, thus avoiding the time consuming trial-and-error process typical of fully decentralized solutions. The main challenge we faced was to design an efficient decision engine which can provide directions on how to organize the overlay in a very short period of time. If the population of the overlay network varies considerably due to churn while the decision engine is running, then its decisions might be of no use or even detrimental to the system performance. A decision is based on a snapshot of the overlay network status based on the information periodically sent by the peers to the central coordinator. The decision process is composed of three steps: I Row construction. Peers are placed in subsequent rows according to their available upload bandwidth. Peers with highest bandwidth capacity are placed in the row closest to the source of the stream so that they can provide to others. The rest of the rows are filled with peers with decreasing amount of upload bandwidth. During this process we make use of an heuristic based on the max-flow approach [13] in order to guarantee connectivity compatibility between peers in consecutive rows, in particular due to the presence of NAT. II Input definition. Every assignment between a peer a in an upper row and another b in a lower row is given a certain score and placed in an assignment matrix. The score depends on the following metrics: inter-peer delay, stickiness (connections that already exists are favoured), playback buffer matching, NAT compatibility probability and ISP friendliness.

CHAPTER 4. THESIS CONTRIBUTION

19

III Assignment problem solving. In an effort to provide the best possible set of interconnections between peers in the overlay, we model peer assignment between consecutive rows as an optimization problem, namely as a Linear Sum Assignment Problem. We make use of an heuristic called Deep Greedy Switching (DGS) which provides fast outcomes while attaining near-optimal solutions. In practice, we have seen it deviate by less than 0.6%. The input to the optimization engine is the assignment matrix produced in step II. Once an outcome of the decision has been achieved, orders are issued to the peers and the overlay re-organized accordingly. The process is repeated periodically or as need arises, i.e. when disruptions are observed on the overlay.

4.2.1

Solving Linear Sum Assignment Problems in a time-constrained environment

Another contribution of this thesis is the parallelization of the Deep Greedy Switching heuristic algorithm and its implementation on Graphic Processing Units (GPUs). This work is motivated by the fact that we identified the optimization engine to be the bottleneck of our peer-to-peer live streaming system. In fact, large instances of the Linear Sum Assignment problem need to be solved in order to couple peers of consecutive rows to each others. Sequential implementations of the DGS solver, although much faster than any other similar software using algorithms with optimal outcome, such as for the Auction algorithm [5][56][31], fell short of our needs in terms of time complexity. Given the time-constrained nature of the application, the solver must provide an outcome in a matter of seconds rather than minutes. This work was published in a conference [48] and can be found in Chapter 8 of this thesis. The conference paper details the parallelization of the original Deep Greedy Switching algorithm and the evaluation of the new heuristic’s implementation on GPUs using the NVIDIA CUDA language [12]. The solver is very generic and can be used outside of our streaming platform’s context for solving any instance of the LSAP problem.

4.3

NAT Traversal

The work on NAT traversal has been published in a conference paper [47] and appears in Chapter 7 of this thesis. In this work, we classify behaviours of Network Address Translators recently discovered by the community in 27 different types. Given this set of behaviors, we tackle two fundamental issues that were not previously solved in the state of the art. We provide a study which comprehensively states, for each of the possible 378 combinations, whether direct connectivity, without relaying of traffic, is possible or not between NAT types. The outcome is the first contribution of this work. In addition, we define a formal model of NAT which helps us describe how each type behaves when allocating new ports. This is the second outcome of our work. As a third contribution, we define through

20

4.4. HIGHLY REPRODUCIBLE EMULATION OF P2P SYSTEMS

an augmentation of currently-known traversal methods which techniques should be used to traverse all pairwise combinations of NAT types. This scheme is validated by reasoning using our formalism, simulation and implementation on a real test network.

4.4

Highly reproducible emulation of P2P systems

The work on application-level emulation has been published in a workshop [45] and appears as Chapter 9 of this thesis. In this work we describe our applicationlevel emulator which has been designed to tackle the problem of reproducibility for development and testing of an existing Peer-To-Peer application. The contribution in this case is a novel platform which can provide highly transparent emulation which can be used for reproducible debugging, testing and evaluation of an alreadydeveloped peer-to-peer software. We achieve transparency by injecting custom implementations of standard/widelyused APIs, such as Apache MINA, for networking, and Java Executors, for concurrency. Calls made to those APIs are redirected to a Discrete Event Simulator adapted for the purpose of emulation. A further contribution of this work is an implementation of an accurate flowlevel bandwidth model, based on the progressive filling algorithm [4], which is used to mimic transfers duration between peers according to the configured upload/download capacity of the emulated nodes.

Chapter 5

Conclusion and Future Work In this Chapter, we present conclusions for this thesis work. Conclusion are described separately for each problem area that we addressed. Finally, we present some of the future work related to this thesis.

Design and implementation of a centrally-managed peer-to-peer live streaming platform In this thesis, we have reported our efforts with addressing the issue of bandwidth savings and user experience in live streaming using a central coordination approach. We employed a design which models the issue of assigning peers to each others as a linear sum optimization problem (LSAP). Our main concern with scalability of the solution has been addressed using a fast LSAP heuristic called Deep Greedy Switching. We further improved on the performance of DGS by parallelizing the heuristic’s algorithm and implementing the resulting algorithm on commodity GPUs using the CUDA language. We show in our results, how the solver is able to handle 10, 000 problem instances in less than 3 seconds. The solver is generic and can used to solve any instance of LSAP given that a tolerance of 0.6%, or less than, on the outcome optimality is tolerated. The merit of our live streaming approach is that it provides high savings at the distribution site in real-world scenarios where realistic delay, packet-loss, bandwidth emulation and NAT constraints are modelled in the experiments. Our results also show that, in such difficult scenarios, the system is able to maintain low initial buffering time and playback delays. Concerns on scalability for overlays which require the assignment engine to scale over 10.000 can be addressed by either providing specialized GPU processing hardware, e.g. GPU racks, or by partitioning the overlay into multiple decision engines. An alternative approach is to manage a backbone of nodes with central coordination and let swarms of peers create along that backbone. 21

22

NAT Traversal In this work, we proposed a comprehensive analysis of what combinations of NAT types are traversable using direct connectivity without relaying connections. We have defined a formal model for describing NAT behaviors. Using that model, we covered all possible NAT combinations describing which traversal techniques are applicable for each combination. With formal reasoning, we have shown that about 80% of all possible combinations are traversable. Our experimental results however, show that only 50% of all possible combinations are encountered in reality, or at least in our test network. Of those, 79.4% are shown to be traversable with various degrees of success, which are also presented in our study. Data on the time needed for the carrying out traversal strategies is also reported in our findings. It is the first time that such a comprehensive study, including all new findings in Network Address Translation behaviors, has been conducted to provide a clear understanding on NAT traversal both from a formal point of view and from experimental results on a real deployed network.

Highly reproducible emulation of P2P systems We have provided our experience in developing an emulation framework which can be used for debugging, testing and evaluation of an already-existing peer-to-peer software. By inspecting the state of the art, we found that we could not find a tool which simultaneously satisfies the following requirements: minimal changes to the software under test, ease of deployment and high reproducibility. Our approach in developing the platform was to adopt application-level emulation but with ensuring that high reproducibility is met. We achieved this by controlling concurrency, system time and all network related events. The resulting framework called “MyP2PWorld” has been used for a number of months for testing our live streaming peer-to-peer application. It has resulted in huge improvements in software quality and bug discovery rate. It became an integral part of our development process. While we have initially developed MyP2PWorld to complement the implementation of a specific software, we are now trying to make it available for other peerto-peer software developers to use.

Future Work As future work, we would like to investigate using central coordination or other types of peer-to-peer approaches with HTTP live streaming. HTTP-live streaming is significantly different from other streaming protocol in the way that requests for chunks of the content are completely unrelated between each others.

CHAPTER 5. CONCLUSION AND FUTURE WORK

23

The streaming player at the end of the distribution chain decides which segment to request and at which bitrate. The time of the request is also not deterministic. This approach greatly simplifies the complexity on a unicast system, where the distribution site is composed of one or more standard HTTP servers. However, it poses significant issues when trying to efficiently organize the delivery of the content over a peer-to-peer overlay, since no assumptions can be made on which fragment of the content will be requested next. As far as we know, there exist no current peer-to-peer system which provides support for HTTP-live. Related to the work conducted in this thesis, we would like to extend our NAT traversal findings by deploying our solution on a global scale to conduct further tests and, if possible, improve our approach. We also would like to implement a connectivity library to be used in any peer-to-peer application and which transparently handles traversing of NAT boxes. As an additional feature of the connectivity library, we would like to include support of different traffic prioritization policies. This would allow peer-to-peer applications to tweak the desired level of QoS. In the case of live streaming for instance, traffic generated by the peer-to-peer application should have priority over all other transfers happening at the same time on the host. We think this is achievable using dynamic congestion control mechanism, such as MulTFRC [9] or LEDBAT [44], over UDP. We are not aware of any other attempt of including general traffic prioritization for peer-to-peer applications at application-level. We are currently in the process of transforming the platform called MyP2PWorld to add support for a component model and event driven runtime. This would allow for faster prototyping and easier testing of single part of the application. The event runtime should also provide a greater degree of scalability and better execution semantics when it comes to concurrency. We are also considering improving scalability of the bandwidth model’s implementation used in MyP2PWorld.

Bibliography [1]

Real Data Transport. https://helixcommunity.org/viewcvs/server/protocol/ transport/rdt/.

[2]

SopCast. http://sopcast.com.

[3]

Shah Asaduzzaman, Ying Qiao, and Gregor von Bochmann. CliqueStream: an efficient and fault-resilient live streaming network on a clustered peer-to-peer overlay. CoRR, abs/0903.4365, 2009.

[4]

Dimitri Bertsekas and Robert Gallager. Data Networks. Prentice Hall, second edition, 1992.

[5]

D.P. Bertsekas. The auction algorithm: A distributed relaxation method for the assignment problem. Annals of Operations Research, 14(1):105–123, 1988.

[6]

Fabian Bustamante and Yi Qiao. Friendships that Last: Peer lifespan and its Role in P2P Protocols. Web Content Caching and Distribution, pages 233–246, 2004.

[7]

Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Animesh Nandi, Antony Rowstron, and Atul Singh. Splitstream: high-bandwidth multicast in cooperative environments. In SIGOPS Oper. Syst. Rev., volume 37, pages 298–313, New York, NY, USA, October 2003. ACM. URL http://doi.acm.org/10.1145/1165389.945474.

[8]

Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. Analyzing the video popularity characteristics of large-scale user generated content systems. IEEE/ACM Trans. Netw., 17:1357–1370, October 2009. ISSN 1063-6692. URL http://dx.doi.org/10.1109/TNET.2008.2011358.

[9]

D. Damjanovic and S. Gjessing. MulTFRC IETF draft. http://tools. ietf.org/html/draft -irtf-iccrg-multfrc-01, July 2010.

[10] Bryan Ford, Pyda Srisuresh, and Dan Kegel. Peer-to-peer communication across network address translators. In ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 13–13, Berkeley, CA, USA, 2005. USENIX Association. 25

26

BIBLIOGRAPHY

[11] Despotovic Galuba, Aberer and Kellerer. Protopeer: A p2p toolkit bridging the gap between simulation and live deployement. 2nd International ICST Conference on Simulation Tools and Techniques, May 2009. [12] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Yao Zhang, and V. Volkov. Parallel Computing Experiences with CUDA. Micro, IEEE, 28(4):13–27, 2008. [13] A V Goldberg and R E Tarjan. A new approach to the maximum flow problem. In STOC ’86: Proceedings of the eighteenth annual ACM symposium on Theory of computing, pages 136–146, New York, NY, USA, 1986. ACM. ISBN 0-89791193-8. [14] X. Hei, C. Liang, J. Liang, Y. Liu, and K. W. Ross. Insights into PPLive: A Measurement Study of a Large-Scale P2P IPTV System. In In Proc. of IPTV Workshop, International World Wide Web Conference, 2006. [15] Xiaojun Hei, Chao Liang, Jian Liang, Yong Liu, and Keith W. Ross. A measurement study of a large-scale P2P IPTV system. Multimedia, IEEE Transactions on, 2007. [16] Apple Inc. HTTP Live Streaming. http://developer.apple.com/resources/httpstreaming/. [17] Microsoft Inc. Smooth Streaming. http://www.iis.net/download/SmoothStreaming. [18] Skype Inc. Skype. http://www.skype.com/. [19] Wuala Inc. Wuala. http://www.wuala.com/. [20] Ipoque. Internet Study 2008/2009. http://www.ipoque.com/userfiles/file/ipoque-Internet-Study-08-09.pdf. [21] C. Huitema J. Rosenberg, R. Mahy. Traversal Using Relays around NAT (TURN): Relay extensions to session traversal utilities for NAT (STUN). Internet draft, November 2008. URL http://tools.ietf.org/html/draft-ietf-behave-turn-14. [22] John Jannotti, David K. Gifford, Kirk L. Johnson, M. Frans Kaashoek, and James W. O’Toole, Jr. Overcast: Reliable Multicasting with an Overlay Network. In Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4, OSDI’00, pages 14–14, Berkeley, CA, USA, 2000. USENIX Association. URL http://portal.acm.org/citation.cfm?id=1251229.1251243.

BIBLIOGRAPHY

27

[23] David Kempe, Jon Kleinberg, and Alan Demers. Spatial gossip and resource location protocols. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, STOC ’01, pages 163– 172, New York, NY, USA, 2001. ACM. ISBN 1-58113-349-9. URL http://doi.acm.org/10.1145/380752.380796. [24] Anne-Marie Kermarrec, Alessio Pace, Vivien Quema, and Valerio Schiavoni. NAT-resilient Gossip Peer Sampling. Distributed Computing Systems, International Conference on, 0:360–367, 2009. ISSN 1063-6927. [25] G. Kreitz and F. Niemela. Spotify – large scale, low latency, P2P musicon-demand streaming. In Proceedings of the Tenth IEEE International Conference on Peer-to-Peer Computing (P2P), pages 1–10, August 2010. URL http://dx.doi.org/10.1109/P2P.2010.5569963. [26] B. Li, Y. Qu, Y. Keung, S. Xie, C. Lin, J. Liu, and X. Zhang. Inside the New Coolstreaming: Principles, Measurements and Performance Implications. In INFOCOM 2008. The 27th Conference on Computer Communications. IEEE, 2008. [27] J. Liang, R. Kumar, and K. Ross. The kazaa overlay: A measurement study. In Proceedings of the 19th IEEE Annual Computer Communications Workshop, 2004. [28] Shiding Lin, Aimin Pan, Rui Guo, and Zheng Zhang. Simulating large-scale p2p systems with the wids toolkit. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2005. 13th IEEE International Symposium on, pages 415 – 424, 2005. [29] N. Magharei and R. Rejaie. PRIME: Peer-to-Peer Receiver-drIven MEshBased Streaming. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 1415–1423, May 2007. URL http://dx.doi.org/10.1109/INFCOM.2007.167. [30] J.J.D. Mol, D.H.J. Epema, and H.J. Sips. The Orchard Algorithm: P2P Multicasting without Free-Riding. Peer-to-Peer Computing, IEEE International Conference on, 0:275–282, 2006. [31] A. Naiem and M. El-Beltagy. Deep greedy switching: A fast and simple approach for linear assignment problems. In (To Appear in) 7th International Conference of Numerical Analysis and Applied Mathematics, 2009. [32] Venkata N. Padmanabhan and Kunwadee Sripanidkulchai. The case for cooperative networking. In Revised Papers from the First International Workshop on Peer-to-Peer Systems, IPTPS ’01, pages 178– 190, London, UK, 2002. Springer-Verlag. ISBN 3-540-44179-4. URL http://portal.acm.org/citation.cfm?id=646334.758993.

28

BIBLIOGRAPHY

[33] Vinay Pai, Kapil Kumar, Karthik Tamilmani, Vinay Sambamurthy, and Alexander Mohr. Chainsaw: Eliminating trees from overlay multicast. In Peer-to-Peer Systems IV, volume Volume 3640/2005, pages 127–140. Springer Berlin / Heidelberg, 2005. URL http://dx.doi.org/10.1007/11558989_12. [34] Kunwoo Park, Sangheon Pack, and Taekyoung Kwon. Climber: an incentivebased resilient peer-to-peer system for live streaming services. In Proceedings of the 7th international conference on Peer-to-peer systems, IPTPS’08, pages 10–10, Berkeley, CA, USA, 2008. USENIX Association. URL http://portal.acm.org/citation.cfm?id=1855641.1855651. [35] Amir H. Payberah, Jim Dowling, Fatemeh Rahimian, and Seif Haridi. gradienTv: Market-based P2P Live Media Streaming on the Gradient Overlay. In Lecture Notes in Computer Science (DAIS 2010), pages 212–225. Springer Berlin / Heidelberg, Jan 2010. ISBN 978-3-642-13644-3. [36] Ryan S. Peterson and Emin Gün Sirer. Antfarm: efficient content distribution with managed swarms. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation, pages 107–122, Berkeley, CA, USA, 2009. USENIX Association. URL http://portal.acm.org/citation.cfm?id=1558977.1558985. [37] Fabio Pianese, Diego Perino, Joaquin Keller, and Ernst W. Biersack. Pulse: An adaptive, incentive-based, unstructured p2p live streaming system. IEEE Transactions on Multimedia, 9(8):1645–1660, December 2007. ISSN 1520-9210. URL http://dx.doi.org/10.1109/TMM.2007.907466. [38] Fabio Picconi and Laurent Massoulié. Is there a future for mesh-based live video streaming? In Proceedings of the 2008 Eighth International Conference on Peer-to-Peer Computing, pages 289–298, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3318-6. URL http://portal.acm.org/citation.cfm?id=1443220.1443468. [39] Dongyu Qiu and R. Srikant. Modeling and performance analysis of BitTorrentlike peer-to-peer networks. In Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications, SIGCOMM ’04, pages 367–378, New York, NY, USA, 2004. ACM. ISBN 158113-862-8. URL http://dx.doi.org/10.1145/1015467.1015508. [40] Dan Rayburn. 2010 Q4 CDN Pricing Detailed. http://blog. streamingmedia.com/the_business_of_online_vi/2010/06/datafromq1showsvideocdnpricingstabilizingshouldbedown25fortheyear.html. [41] M. Ripeanu. Peer-to-peer architecture case study: Gnutella network. In Peer-to-Peer Computing, 2001. Proceedings. First International Conference on, pages 99–100, August 2001.

BIBLIOGRAPHY

29

[42] J. Rosenberg. Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols. Internet draft, October 2007. URL http://tools.ietf.org/html/draft-ietf-mmusic-ice-19. [43] J. Rosenberg, J. Weinberger, C. Huitema, and R. Mahy. STUN - Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs). RFC 3489 (Proposed Standard), March 2003. URL http://www.ietf.org/rfc/rfc3489.txt. Obsoleted by RFC 5389. [44] Dario Rossi, Claudio Testa, Silvio Valenti, and Luca Muscariello. LEDBAT: The New BitTorrent Congestion Control Protocol. In Computer Communications and Networks (ICCCN), 2010 Proceedings of 19th International Conference on, pages 1–6, August 2010. URL http://dx.doi.org/10.1109/ICCCN.2010.5560080. [45] Roberto Roverso, Mohammed Al-Aggan, Amgad Naiem, Andreas Dahlstrom, Sameh El-Ansary, Mohammed El-Beltagy, and Seif Haridi. MyP2PWorld: Highly Reproducible Application-Level Emulation of P2P Systems. In Proceedings of the 2008 Second IEEE International Conference on Self-Adaptive and Self-Organizing Systems Workshops, pages 272–277, Venice, Italy, 2008. IEEE Computer Society. ISBN 978-0-7695-3553-1. [46] Roberto Roverso, Cosmin Arad, Ali Ghodsi, and Seif Haridi. DKS: Distributed K-Ary System a Middleware for Building Large Scale Dynamic Distributed Applications, Book Chapter. In Making Grids Work, pages 323–335. Springer US, 2007. ISBN 978-0-387-78447-2. [47] Roberto Roverso, Sameh El-Ansary, and Seif Haridi. NATCracker: NAT Combinations Matter. In Proceedings of 18th International Conference on Computer Communications and Networks 2009, ICCCN ’09, pages 1–7, San Francisco, CA, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-4581-3. [48] Roberto Roverso, Amgad Naiem, Mohammed El-Beltagy, Sameh El-Ansary, and Seif Haridi. A GPU-enabled solver for time-constrained linear sum assignment problems. In Informatics and Systems (INFOS), 2010 The 7th International Conference on, pages 1–6, Cairo, Egypt, 2010. IEEE Computer Society. ISBN 978-1-4244-5828-8. [49] Roberto Roverso, Amgad Naiem, Mohammed Reda, Mohammed El-Beltagy, Sameh El-Ansary, Nils Franzen, and Seif Haridi. On The Feasibility Of Centrally-Coordinated Peer-To-Peer Live Streaming. In Proceedings of IEEE Consumer Communications and Networking Conference 2011, Las Vegas, NV, USA, January 2011. [50] Stefan Saroiu, Krishna P. Gummadi, and Steven D. Gribble. A Measurement Study of Peer-to-Peer File Sharing Systems. Multimedia Computing and Networking (MMCN), January 2002.

30

BIBLIOGRAPHY

[51] H. Schulzrinne. RTP: A Transport Protocol for Real-Time Applications. RFC 3550 (Proposed Standard), July 2002. URL http://www.ietf.org/rfc/rfc3550.txt. [52] H. Schulzrinne, A. Rao, and R. Lanphier. RTSP: Real-Time Streaming Protocol. RFC 2326 (Proposed Standard), 1998. URL http://www.ietf.org/rfc/rfc2326.txt. [53] P. Srisuresh, B. Ford, and D. Kegel. State of Peer-to-Peer (P2P) Communication across Network Address Translators (NATs). RFC 5128 (Informational), March 2008. URL http://www.ietf.org/rfc/rfc5128.txt. [54] D. Tran, K. Hua, and S. Sheu. ZIGZAG: An Efficient Peer-to-Peer Scheme for Media Streaming. In Proc. of IEEE INFOCOM, 2003. [55] Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kostic, Jeffrey S. Chase, and David Becker. Scalability and accuracy in a large-scale network emulator. In OSDI, 2002. [56] Cristina Nader Vasconcelos and Bodo Rosenhahn. Bipartite Graph Matching Computation on GPU. In Daniel Cremers, Yuri Boykov, Andrew Blake, and Frank R. Schmidt, editors, EMMCVPR, volume 5681 of Lecture Notes in Computer Science, pages 42–55. Springer, 2009. ISBN 978-3-642-03640-8. URL http://dblp.uni-trier.de/db/conf/emmcvpr/emmcvpr2009.html#Vasconc elosR09. [57] Vidhyashankar Venkataraman, Kaouru Yoshida, and Paul Francis. Chunkyspread: Heterogeneous Unstructured Tree-Based Peer-to-Peer Multicast. Proceedings of the Proceedings of the 2006 IEEE International Conference on Network Protocols, pages 2–11, 2006. URL http://portal.acm.org/citation.cfm?id=1317535.1318351. [58] A. Vlavianos, M. Iliofotou, and M. Faloutsos. Bitos: Enhancing bittorrent for supporting streaming applications. In 9th IEEE Global Internet Symposium 2006, April 2006. [59] Feng Wang, Yongqiang Xiong, and Jiangchuan Liu. mTreebone: A Hybrid Tree/Mesh Overlay for Application-Layer Live Video Multicast. In Proceedings of the 27th International Conference on Distributed Computing Systems, ICDCS ’07, pages 49–, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-2837-3. URL http://dx.doi.org/10.1109/ICDCS.2007.122. [60] Sunghyun Yoon and Young Boo Kim. A Design of Network Simulation Environment Using SSFNet. In Proceedings of the 2009 First International Conference on Advances in System Simulation, pages 73–78, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-0-7695-3773-3. URL http://portal.acm.org/citation.cfm?id=1637862.1638154.

BIBLIOGRAPHY

31

[61] Meng Zhang, Yun Tang, Li Zhao, Jian-Guang Luo, and Shi-Qiang Yang. Gridmedia: A Multi-Sender Based Peer-to-Peer Multicast System for Video Streaming. In Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, pages 614–617, 2005. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1521498. [62] Xinyan Zhang, Jiangchuan Liu, Bo Li, and Y. S. P. Yum. CoolStreaming/DONet: a data-driven overlay network for peer-to-peer live media streaming. In INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE, volume 3, pages 2102–2111 vol. 3, March 2005. URL http://dx.doi.org/10.1109/INFCOM.2005.1498486.

Part II

Research Papers

33

Paper A

Chapter 6

On The Feasibility Of Centrally-Coordinated Peer-To-Peer Live Streaming Roverso Roberto, Naiem Amgad, Reda Mohammed, El-Beltagy Mohammed, El-Ansary Sameh, Franzen Nils and Haridi Seif In Proceedings of IEEE Consumer Communications and Networking Conference 2011, January 2011, Las Vegas, NV, USA.

1

On The Feasibility Of Centrally-Coordinated Peer-To-Peer Live Streaming Roberto Roverso1,2 Amgad Naiem1,3 Mohammed Reda1,3 Mohammed El-Beltagy1,3 Sameh El-Ansary1,4 Nils Franzen1 Seif Haridi2 1 Peerialism Inc., Sweden, 2 KTH-Royal Institute of Technology, Sweden, 3 Cairo University, Egypt, 4 Nile University, Egypt {roberto, amgad, sameh, mohammed}@peerialism.com Abstract—In this paper we present an exploration of central coordination as a way of managing P2P live streaming overlays. The main point is to show the elements needed to construct a system with that approach. A key element in the feasibility of this approach is a near real-time optimization engine for peer selection. Peer organization in a way that enables high bandwidth utilization plus optimized peer selection based on multiple utility factors make it possible to achieve large source bandwidth savings and provide high quality of user experience. The benefits of our approach are also seen most when NAT constraints come into play.

I. I NTRODUCTION Peer-To-Peer live streaming (P2P IPTV) is a challenging problem that has attracted the focus of many academic and industrial research communities. Debates have been going on about two rival approaches: meshpull and tree-push [1]. Hybrid [2] and best-of-bothworlds mesh-push [3] have also been investigated. That said, some still see that despite current deployments and reported successes, P2P IPTV is still at its early stages [4]. In this paper, we report our experience with a third approach to P2P IPTV where a central entity plays a bigger role in overlay structuring. Central entities in P2P systems are a taboo from a P2P purist’s perspective, but are major components of prominent P2P systems such as the tracker in Bittorrent. Previous works [5] aimed to make Bittorrent’s tracker or the seed more intelligent [6]. Central coordination in our case could be perceived as an attempt to take tracker’s role to the limit. The main challenge we faced was to design an efficient optimization engine which can provide decisions in a very short period of time, or else the outcome might be of no value. If the overlay network changes considerably due to peer dynamics while the optimization engine is running, then connectivity recommendation of the engine might even be detrimental to the system’s performance. Our main contribution is that we show that a system employing such an approach is feasible and that the required central computing resources are not by any

means prohibitive. Subparts of this system have been published before as generic components such as the idea of our optimization engine [7] and its parallelization on GPUs [8] as well as our NAT connectivity [9], but this is the first time we describe how these subparts work together. II. S YSTEM A RCHITECTURE The main entities in the system are: i) Clients (peers) who want to watch the live stream and are normally behind home or corporate NAT gateways, ii) Streaming source connected to the streaming server but otherwise exactly like any normal peer, iii) The tracker which centrally coordinates the system, iv) The optimization engine which has a snapshot of the overlay and handles joins, failures and restructuring of the overlay, v) Connectivity server, to facilitate connection establishment between peers behind NAT, vi) Bandwidth measurement server that peers use to get an approximate guess of their upload capacity. A typical scenario for a client is as follows: The client contacts the bandwidth measurement server to estimate its upload capacity and then requests a video stream from the tracker providing information about its upload bandwidth. The tracker forwards the request to the optimization engine which selects providing peers for the requesting client. The tracker notifies both the requesting and the providing peers involved in the operation. The providing peers will then start to push the stream to the requesting peer after using the connectivity server to traverse NAT gateways if needed. Periodically, the optimization engine restructures the overlay and reconfiguration orders are sent to the clients. III. T ERMINOLOGY: S EATS & P ERSONS We assume that a stream is divided into a number of stripes. For instance, if the stream rate is 1 Mbps, and we used 4 stripes, each stripe would be a sub-stream of 256 Kbps. Given a peer with an upload capacity of 1.5 Mbps, we say that this peer has 6 “seats” because it can upload to other peers 6 stripes simultaneously. Each

2

Fig. 1.

Overlay before reconfiguration

client will have to find a seat for the 4 stripes, so we say that he has to feed 4 “persons”. That is, we discretize the upload capacity to units called seats and the download capacity to units called persons. It is the task of each of the client to request seats for each of his persons. Such division of bandwidth into persons and seats is made so that the optimization engine can have a simple model of the bandwidth/download capacities of the peers. IV. OVERLAY S TRUCTURE Basic peer Joins. Peers with different number of seats join the network in an arbitrary order. The most basic join mechanism is to start by seating the peers on the source seats until, eventually, all the source seats are occupied. At this stage, we say that the first row is full and the seats of the joined peers form a second row. New joiners are seated at the second row until it is full and in their turn they form a third row and so forth. Naturally, the order of joins matters. The extreme worst case would be that a number of peers, all with zero seats, come and occupy the first row with no second row created. The extreme best case would be that peers with large number of seats join before the peers with less seats. In reality, we get some order that is somewhere between the two extremes. The tracker periodically runs a reconfiguration process and tries to bring the overlay into the best possible state. Reconfiguration phase 1: row construction. The process starts by sorting all the peers according to their seat count in descending order. The peers with high seat count are seated at the first row and the process continues as above. This leads to maximum upload bandwidth utilization and minimum number of rows, which directly translates into smaller playback delays. At this stage, we are sure that every row can provide enough seats for the rows below, and the problem now reduces to

Fig. 2.

Overlay after reconfiguration

the assignment of persons of every row to seats of the row above. At this point, even a random assignment between peers can be carried out because an important part of the decision process has been done already by sorting the peers and compacting the rows. We explain next, how a more optimized assignment is achieved. To avoid confusion, we stress that this reorganization happens first at the tracker internal snapshot of the overlay, then a batch of messages is sent out to rewire the peer connections. In Figure 1, Figure 2 we show an example of the overlay before and after reconfiguration respectively. Reconf. phase 2: happiness matrix. The different assignment combinations between persons and seats form columns and rows of a matrix respectively. This matrix we call the “Happiness Matrix” A(i, j), which represents all possible interconnections between peers in two consecutive rows and whose values express the worthiness of such connections. A “Happiness Value” aij in the matrix is a weighted sum of all characteristics observed and/or expected from a certain connection between person i and seat j . The weighted characteristics for this happiness value are: a) Inter-peer delay, peers with lower delay are favored, b) Stickiness, connections that already exist are favored to minimize interruptions, c) Buffers matching, seats with buffers who have data that is useful to the persons are favoured, where usefulness is proportional to how much the seat of the uploading peer is more advanced in the sub-stream compared to the person of the downloading peer, d) NAT Compatibility, defines the probability of a connection to succeed between two peers. A value is given accordingly such that higher value means a higher probability of connection. We elaborate on NAT issues hereunder shortly, e) ISP Friendliness, peers in the same autonomous system (AS) are favored.

3

In fact, we have an engine which computes the hop count between ASs and the smaller distance is favored. The resulting value can be considered as a grade for a certain combination. If many uploaders are available for a certain peer to download from, the one with highest grade of all the uploading peers will be chosen. Which means that, for a certain person A, a seat B will be selected which is expected to provide the best performance in the future transfer between A and B . Such mechanism of assigning persons to seats has been modelled as a Linear Sum Assignment Problem. The task of solving the optimization problem is carried out by the Optimization Engine. Once a result is produced, the tracker notifies the peers so that the new transfers can be established. There are two sensible steps in this process: the calculation of the “Happiness” values and the actual solving of the Optimization Problem. In the first case, it’s trivial to understand that the choice of the happiness values will directly impact the performance of the system. For instance, one of the parameters in the calculation of the Happiness values is the Stickiness and the value chosen might result in more stable system but the overall bandwidth utilization might be affected, since the system will give higher priority to the preservation of successful connections than to load balancing. Thus a careful calculation of the values of the A(i,j) matrix is developed as the choice of weights of different characteristics affecting the happiness value. Reconf. Phase 3: Solving the assignment problem. The last step in the Optimization process is the actual solving of the linear optimization problem between pairs of rows to assign seats to persons. In fact, the computation associated with it might take a long time to execute, since the number of potential peer combinations is typically quite large. In the presence of high churn, disruptions in the network which happen as the optimization is taking place might totally change the validity of the initial information which the ongoing computation is based upon. This will cause the results of the calculation to be of limited or no value. It’s therefore vital for the calculation to happen as fast as possible to avoid such situations. For this purpose, initially we used the Auction algorithm [10] to solve the optimization problem, which is known to be one of the fastest algorithms for solving complex Linear Sum Assignment Problem (LSAP). However its performance fell very short of our needs given the size of the problems to be solved. Consequently, we have developed a new heuristic solver based on a local-search approach called Deep Greedy Switching (DGS) which has been published in [7]. It sacrifices very little in terms of optimality, for a

huge gain in the running time of the algorithm over other methods. The DGS algorithm provides no guarantees for attaining an optimal solution, but in practice we have seen it deviate by less than 0.6% from the solutions reported by the auction algorithm. Such a minor sacrifice in optimality is acceptable in our system where speed is the most important factor as an optimal solution that is delivered too late is practically useless. Compared with the auction algorithm, DGS has the added advantage that it starts out with an initial assignment and keeps improving that assignment during the course of its execution. The auction algorithm, however, attains full assignment only at termination. Hence, if a deadline has been reached where an assignment must be produced, DGS can be interrupted to get the best assignment solution it has attained so far. We were also able to parallelize the DGS heuristic on commodity GPUs [8] and solve instances of 10, 000 peers overlays in less than 3 seconds. Churn. Beyond the basic join described above we use some other techniques. For instance, in each row we reserve some slack seats for future use. This helps us, during the peer joins and failures. During the join process we try to put a peer not at the last available row as described in the basic join above, instead we try to predict which row the peer will be at should a reconfiguration take place. This results in peers with high seat count ending up closer to the root of the tree even if they came late. Moreover, it reduces the possibility of disruption caused by reconfiguration as we place a new peer in its deserved row directly. If for any reason the reserved seats were all occupied we revert to basic join. If that also did not help, due to lack of seats in the last rows, we reserve what we call fallback source seats, and if these were also totally occupied, we have one final strategy which is a waiting list where the peer has to wait until the next reconfiguration to be admitted to the network when some seats later become available. For failures, the person facing the failure of his providing seat will try to fallback to the slack seats of the row above. If there is not enough slack, the slack of a higher row are attempted. In a sense, handling a failure is like a partial join process but not necessarily for all persons of a peer. Churn usually reduces the optimality of the overlay and reconfigurations take care of bringing it back in shape. NAT Heuristic. With NAT in the picture, the effective upload capacity of a row, is determined by the compatibility of the NAT types of the peers in it with the ones in the row below. For instance, a client behind a very restrictive NAT can have a huge upload capacity but effectively it is much less because it can only be used

4

Row Construction with NAT Row Construction + Optimization with NAT Row Construction without NAT Row Construction + Optimization without NAT

512 256

Row Construction with NAT Row Construction + Optimization with NAT Row Construction without NAT Row Construction + Optimization without NAT

100

Playback Buffering Time (sec)

Initial Buffering Time (sec)

128 64 32 16 8

10

4 2

1

1 50

Fig. 3.

100

150 200 Source Seats

250

300

Initial buffering time

to upload to a limited subset of peers. In our previous work entitled “NATCracker”[9], we described a rigorous classification of the various behaviors of NAT gateways beyond the traditional classification of only four types. We have also provided an analysis of which traversal strategy should be used to connect peers to each other according to their types. In this context, we make use of this model to find an optimal placement of peers in rows in order to satisfy the download demands of as many peers as possible while maximizing pairwise connectivity between them. In order to achieve such trade-off, we formulated a heuristic based on the maxflow approach[11], where s is the source of the flow (our streaming server) and t is a virtual sink node collecting all spare capacity in the network. The heuristic works as follows: first, we carry out row construction as previously described. For each row r , we aggregate peers in sets Ntr according to their NAT type t. Each set Ntr is considered as a virtual node in the max-flow network. Each virtual node has a cumulative capacity of uNtr number of seats, i.e. the sum of all seats for the peers in the set. Then, we establish edges between in the a set Ntri in the current row and all sets Ntr+1 j row below it, such that ti is compatible with tj . The aforementioned process is carried out for all the rows in the tree. When this placement process is completed, we execute a standard max-flow algorithm to push as much flow as possible from the source s to the sink t through the virtual nodes. After that, we proceed to switch peers between rows according to their outbound and inbound flow. For instance, we choose the set Ntrl with the least flow in a row and we identify which has the weakest upload bandwidth in it. We then swap this peer with the one in the row below which has the biggest upload capacity but whose NAT type tc is different from tl . We then run the max-flow algorithm again. The process is repeated multiple times until each row has enough

50

Fig. 4.

100

150 200 Source Seats

250

300

Playback Delays

bandwidth capacity to provide for the row below it given the connectivity constraints in place between provider and receiver peers. V. S IMULATION R ESULTS We present a simulation of our system over a discreteevent simulator. It is worth mentioning here that we tried to make sure we model bandwidth accurately[12]. In addition NAT semantics with real-word probability of connection establishment and type encountering probability are also simulated based on the NATcracker work [9]. We simulate 1000 peers watching a 30 minutes stream. Half of the network is there when the stream starts and the rest join with average rate of 2 peers per second. One tenth of the network fails during the stream playback. The streaming rate in 600 Kbps divided into two stripes. The seat distribution is 10% : 10 seats, 40% : 2 seats, 40% : 1 seat and 10% : 0 seats. We also have to stress that the scalability of our simulation is not limited by the optimization engine but rather by the simulator’s modelling of peer bandwidth allocation and deallocation. The simulation is done as follows, we measure the performance of our approach with and without optimization after row construction. However, we consider the NAT heuristic as part of the optimization process. Then we repeat the experiment by solving an easier problem where peers are not behind NAT. The idea is to show how much the optimization contributes to user experience and show that when one solves the problem in a network where peers are behind NAT, the problem is much harder. Additionally, we want to see how much of the problem is solved by row construction alone and how much depends on the optimized assignment. Initial buffering time. This is the time before stream playback starts. As shown in Figure 3, in the absence of NAT constraints, and as long as a peer can find available seats, playback can start immediately. Therefore optimization is not the key element. However, with NAT

5 98 Row Construction with NAT Row Construction + Optimization with NAT Row Construction without NAT Row Construction + Optimization without NAT

96

94

Savings (%)

92

90

88

86

84

82 50

100

150

200

250

300

Source Seats

Fig. 5.

Savings

constraints, a peer might find available seats but can not use them because the seat and the person NAT types might not be compatible. Thus, persons will be randomly assigned to NAT-incompatible seats, the connection will fail, and the person will try rejoining until, eventually, getting admitted. That’s why maximizing the probability of successful connection plays a very significant role and that is something that optimized assignment provides and in its absence, the initial buffering time is higher by orders of magnitude. The number of source seats can drastically affect the results as well, because in general it is a last resort for peers with severely constrained NATs. When enough source seats are provided, the optimization process can bring the initial buffering time of a network with NAT constraints to one without NAT constraints. However, in all cases, we varied the source bandwidth between 10 and 100 Mbps. Playback delay. This is the sum of the time periods where playback is paused due to the buffers running out of content. As shown in Figure 4, with or without NAT constraints, in the lack of optimization, the performance lags by at least one order of magnitude. We also see that even in the presence of NAT constraints and with some extra help of source seats. the optimization can bring the user experience to one that is close the experience of a network without NAT constraints. Source bandwidth savings. In Figure 5 we can clearly see a trade-off between savings, i.e. client uploaded stream vs. all uploaded, and quality of experience. However, with a user experience where we have around 2.5 seconds of initial playback delay and 2 seconds of playback delay, the savings are at around 85%. More importantly one can see that row construction is the primary driver of savings not optimized assignment. VI. S CALABILITY D ISCUSSION & C ONCLUSION We have reported in this paper our experience with tackling the problem of P2P live streaming using central coordination. We were able to provide a feasible solution by using a heuristic linear sum assignment problem

solver capable after parallelization on commodity GPUs to handle 10, 000 peers overlay in less than 3 seconds. The merits of the solution is high bandwidth source savings due to high bandwidth utilization of the peers, low initial buffering time and playback delays. The main point of our approach compared to other decentralized approaches is that we let a central entity help the peers in the peer selection process, avoiding with that, the trial and error process for discovering the best overlay configuration. Beyond the scale of 10, 000, one can partition the network into multiple trackers. Another approach is to manage a backbone of nodes using central coordination and let a group of swarming peers stream from each backbone node. The approach is generic enough to be modified for other problems. For future work, we are considering replica placement in distributed storage as well as a NAT traversal facilitator for generic overlays. R EFERENCES [1] F. Picconi and L. Massoulie, “Is there a future for mesh-based live video streaming?” in Peer-to-Peer Computing , 2008. P2P ’08. Eighth International Conference on, 2008. [2] F. Wang, Y. Xiong, and J. Liu, “mtreebone: A hybrid tree/mesh overlay for application-layer live video multicast,” in ICDCS ’07, Washington, DC, USA. [3] R. J. Lobb, A. P. Couto da Silva, E. Leonardi, M. Mellia, and M. Meo, “Adaptive overlay topology for mesh-based p2p-tv systems,” in NOSSDAV ’09, 2009. [4] X. Hei, Y. Liu, and K. W. Ross, “Iptv over p2p streaming networks: the mesh-pull approach,” in Communications Magazine, IEEE, March 2009. [5] A. R. Bharambe, C. Herley, and V. N. Padmanabhan, “Analyzing and improving a bittorrent networks performance mechanisms,” in IEEE INFOCOM 2006. [6] F. Esposito, I. Matta, P. Michiardi, N. Mitsutake, and D. Carra, “Seed scheduling for peer-to-peer networks,” in 8th IEEE International Symposium on Network Computing and Applications. [7] A. Naiem and M. El-Beltagy, “Deep greedy switching: A fast and simple approach for linear assignment problems,” in 7th International Conference of Numerical Analysis and Applied Mathematics, 2009. [8] R. Roverso, A. Naiem, M. El-Beltagy, S. El-Ansary, and S. Haridi, “A gpu-enabled solver for time-constrained linear sum assignment problems,” in Informatics and Systems (INFOS), 2010 The 7th International Conference on, 2010. [9] R. Roverso, S. El-Ansary, and S. Haridi, “Natcracker: Nat combinations matter,” Computer Communications and Networks, International Conference on, vol. 0, pp. 1–7, 2009. [10] D. Bertsekas, “The auction algorithm: A distributed relaxation method for the assignment problem,” Annals of Operations Research, vol. 14, no. 1, pp. 105–123, 1988. [11] A. V. Goldberg and R. E. Tarjan, “A new approach to the maximum flow problem,” in STOC ’86: Proceedings of the eighteenth annual ACM symposium on Theory of computing. New York, NY, USA: ACM, 1986, pp. 136–146. [12] R. Roverso, M. Al-Aggan, A. Naiem, A. Dahlstrom, S. ElAnsary, M. El-Beltagy, and S. Haridi, “Myp2pworld: Highly reproducible application-level emulation of p2p systems,” in Decentralized Self Management for Grid, P2P, User Communities workshop, SASO 2008, 2008.

Paper B

Chapter 7

NATCracker: NAT Combinations Matter Roverso Roberto, El-Ansary Sameh and Haridi Seif In Proceedings of 18th International Conference on Computer Communications and Networks 2009, August 2009, San Francisco, CA, USA.

NATCracker: NAT Combinations Matter Roberto Roverso1,2 , Sameh El-Ansary1,3 , and Seif Haridi2 1 Peerialism Inc., Sweden, 2 KTH-Royal Institute of Technology, Sweden, 3 Nile University, Egypt {roberto,sameh}@peerialism.com, [email protected] Abstract—In this paper, we report our experience in working with Network Address Translators (NATs). Traditionally, there were only 4 types of NATs. For each type, the (im)possibility of traversal is well-known. Recently, the NAT community has provided a deeper dissection of NAT behaviors resulting into at least 27 types and documented the (im)possibility of traversal for some types. There are, however, two fundamental issues that were not previously tackled by the community. First, given the more elaborate set of behaviors, it is incorrect to reason about traversing a single NAT, instead combinations must be considered and we have not found any study that comprehensively states, for every possible combination, whether direct connectivity with no relay is feasible. Such a statement is the first outcome of the paper. Second, there is a serious need for some kind of formalism to reason about NATs which is a second outcome of this paper. The results were obtained using our own scheme which is an augmentation of currently-known traversal methods. The scheme is validated by reasoning using our formalism, simulation and implementation in a real P2P network.

Most works contain a few examples of combinations for explanation purposes [6][7]. However, we have failed to find any comprehensive analysis that states, for every possible combination of NAT types, whether direct (i.e. with no relay) connectivity is possible and how. The analysis is more topical given that NAT community is switching from the classical set of NAT types Tclassic = { Full-Cone, Restricted-Cone, PortRestricted, Symmetric} [3] to a more elaborate set that defines a NAT type by a combination of three different policies, namely, port mapping, port allocation and port filtering [8]. With that, a statement like “two peers behind symmetric NAT can not communicate” becomes imprecise, as we will show that in many cases it is possible given the nuances available in the presently wide spectrum of NAT types.

I. I NTRODUCTION

The work in [7] includes a matrix for a number of combinations, however mostly drawn from Tclassic rather than the more elaborate classification in [8]. The work in [6] is probably the closest to ours, one can see our work as a superset of the set of combinations mentioned in that work.

Dealing with Network Address Translators (NATs) is nowadays an essential need for any P2P application. The techniques used to deal with NAT have been more or less “coined” and there are several widely-used methods[1][2]. Some of them are rather a defacto standard like STUN [3],TURN [4],ICE [5]. In the context of our a P2P live video streaming application PeerTV, we are mainly concerned with media streaming using UDP and therefore the scope of this paper is UDP NAT traversal. Moreover, we are strictly interested in solutions that do not use relay, such as TURN for instance, due to the high bandwidth requirements of video streaming. We have found lots of of previous work on the subject that aims to answer the following question: For every t in the set of NAT types T , which s in the set of traversal strategies S should be used to traverse t? The answer is of the form f : T → S. i.e. the following is an example with a couple of types f : { Simple Hole Punching, Port-Prediction } → { Full-Cone, Symmetric} [6]. However, the point which we found not gaining enough attention is that the presence of a feasible traversal technique that enables two peers behind NAT to communicate depends on the “combination” of the NAT types and not on the type of each peer separately. Thus, the question should be: “Given 2 peers pa and pb with respective NAT types t(pa ) and t(pb ), which traversal strategy s is needed for p1 and p2 to talk? The answer is of the form f : T × T → S”, i.e we need to analyze traversable combinations rather than traversable types.

II. R ELATED W ORK

III. NAT T YPES AS C OMBINATIONS OF P OLICIES In this section we try to semi-formally summarize the more elaborate classification of NATs known as “BEHAVEcompliant”[8] and craft the notation that we will use in the rest of the paper. Notation. Let na and nb be NAT gateways. For i ∈ {a, b}, Let Pi = {pi , p′i , p′′i , . . . } be the set of peers behind ni . An “endpoint” e is a host-port pair e = (h, p), where h(e) is the host of e and p(e) is its port. Let Vi = {vi , vi′ , vi′′ , . . . } denote the set of all private endpoints of all peers behind ni and Ui = {ui , u′i , u′′i . . . } be the set of public endpoints of ni . i.e ∀v ∈ Vi , h(v) ∈ Pi and ∀u : Ui , h(u) = ni . When a packet is sent out from a certain private endpoint vi of a peer pi behind a gateway ni , to some public endpoint d, a rule in the NAT table of ni is created. We define the set of NAT table rules Ri = {ri , ri′ , ri′′′ } at ni , the rule records the fact that some public port ui and some private port vi are associated, e.g ra = (va ↔ ua ). The behavior of a gateway ni is defined by three policies, namely, port mapping, port filtering and port allocation. We use the notation f (ni ), m(ni ), a(ni ) to denote the respective policies of gateway ni .

A. Mapping Policy The mapping policy is triggered every time a packet is sent from a private endpoint vi behind the NAT to some external public port d. The role of a mapping policy is deciding whether a new rule will be added or an existing one will be reused. We use the notation: −−→ 1) vi , d  ri to specify that the sending of a packet from vi to d resulted in the creation of a new NAT table rule ri at ni . That is the binding of a new public port on ni . However, we say that a rule was created because we care not only about the binding of the port but also the constraints on using this new port. −−→ 2) vi , d ⇒ ri to specify that the sending of the packet reused an already existing rule ri . −−→ reason 3) vi , d =⇒ 6 ri to specify that the sending of the packet did not reuse some ri in particular because of some “reason”. Irrespective of the mapping policy, whenever a packet is sent from a private port vi to an arbitrary public destination endpoint d and ∄ri ∈ Ri of the form ri = (vi ↔ ui ), for an −−→ arbitrary ui , the following is true vi , d  ri . However, if such a mapping exists, the mapping policy would make the reuse decision based on the destination. For all subsequent packets −−→ from vi to d, naturally, vi , d ⇒ ri . However, for any d′ 6= d, there are 3 different behaviors: • Endpoint-Independent, m(ni ) = EI: −−→′ vi , d ⇒ ri , for any d′ • Host-Dependent, m(ni ) = HD: −−→′ vi , d ⇒ ri , iff h(d) = h(d′ ) −−→′ vi , d  ri′ , iff h(d) 6= h(d′ ), where ri′ = (vi ↔ u′i ) and u′i 6= ui • Port-Dependent, m(ni ) = PD: −−→′ vi , d  ri′ Having introduced the different policies, we decorate the notation of the rule to include the criteria that will be used to decide whether a certain rule will be reused as follows:         if m(n ) = EI v ← − − − − − → u   i i i   m:vi →∗         if m(ni ) = HD vi ←−−−−−−−−−→ ui ri =   m:vi →(h(d),∗)            if m(ni ) = PD    vi ←−−−−−→ ui m:vi →d

Where the syntax m : x → y means that the rule will be reused if the source endpoint of the packet is x and the destination is y. The ∗ denotes any endpoint. Order. We impose the EI < HD < PD according to the increasing level of restrictiveness.

1) Port-Preservation, a(ni ) = PP: −−→ Given vi , d  ri , where ri = (vi ↔ ui ), it is always the case that: p(ui ) = p(vi ). Naturally, this may cause conflicts if any two pi and p′i behind ni decided to bind private endpoints with a common port. 2) Port Contiguity, a(ni ) = PC: Given any two sequentially allocated public endpoints ui and u′i it is always the case that: p(u′i ) = p(ui ) + ∆, for some ∆ = 1, 2, ... 3) Random, a(ni ) = RD: ∀ui , p(ui ) is allocated at random. Order. We impose the order PP < PC < RD according to the increasing level of difficulty of handling. C. Filtering Policy. The filtering policy decides whether a packet from the outside world to a public endpoint of a NAT gateway should be forwarded to the corresponding private endpoint. Given an existing rule ri = (vi ↔ ui ) that was created to send a packet from vi to d, we use the notation: 1) ri ⇐ ← ui− ,−s to denote that the receival of a packet from the public endpoint s to ni ’s public endpoint ui is permitted by ri reason ← 2) r ⇐= 6 u− ,−s to denote that the receival is not permitted i

i

because of some “reason”. There are 3 filtering policies with the following conditions for allowing receival: • Endpoint-Independent, f (ni ) = EI: ←− − ri ⇐ u i , s, for any s • Host-Dependent, f (ni ) = HD: ri ⇐ ← ui− ,−s, iff h(s) = h(d) • Port-Dependent, f (ni ) = PD: ri ⇐ ← ui− ,−s, iff s = d We also decorate the rules to include conditions for accepting packets as follows:     f :ui ←∗  if f (ni ) = EI      vi ←−−−−−→ ui      f :ui ←(h(d),∗) if f (ni ) = HD vi ←−−−−−−−−−→ ui ri =         f :ui ←d    vi ←− −−−−→ ui if f (ni ) = PD  Order. We impose the order EI < increasing level of restrictiveness.

HD


EI and f (nx′ 6=x ) < PD. Proof: We consider the most restrictive case where f (na ) = f (nb ) = m(na ) = m(nb ) = PD and a(na ) = a(nb ) = RD and show the minimum relaxations that we need to do for SHP to work. By looking at the steps in section VI, and considering all the very restrictive mapping and filtering

on both sides, we can see that after steps 5 and 6, ra and rb will be     as follows: f :ub ←ua f :ua ←uz ra = va ←−−−−−−→ ua , rb = vb ←−−−−−−→ ub m:vb →ua

m:va →uz

Which will cause the following problems: ub 6=uz ←−,− − 6 u In step 7: ra ⇐= b ua and there is nothing that we can relax at nb which can help. Instead, we have to relax the filtering at pa to indulge receiving on ua from ub while it was initially opened for receiving from uz . i.e, ra has to tolerate host change which is not satisfied by PD nor HD filtering,  therefore f (na ) = EI is necessary, resulting into f :ua ←∗ ra = va ←−−−−−−→ ua m:va →uz u 6=u

−→ b6 z r and − −→ ′ ′ In step 8: − v− v− a a , ub =⇒ a , ub  ra where ra =   ′ u = 6 u ′ a a ←′−−− f :u ←∗ va ←−−−a−−−→ u′a . Consequently, rb ⇐= 6 ua , ub . To m:va →ub

solve this, we have two solutions, the first is to let the mapping reuse ra and not create ra′ which needs relaxing m(na ) to be EI, in which case we can keep f (nb ) as restrictive. The second solution is to keep na as restrictive and relax f (nb ) to tolerate receiving from u′a . In the second solution, there is a minor subtlety that needs to he handled, where pb has to be careful to keep sending to pa on ua despite the fact that it is receiving from u′a . Similarly pa should always send to pb on ub despite the fact it is receiving from u′b . That is an asymmetry that is not in general needed. C. Coverage of SHP Since |τ | = 27 types, we have a 27×28 = 378 distinct com2 binations of NAT types of two peers. Using Theorem 6.1, we find that 186 combinations, i.e. 49.2% of the total number of possible ones are traversable using the Simple Hole Punching approach. That said, this high coverage is totally orthogonal to how often one is likely to encounter combinations in the covered set in practice, which we discuss in our evaluation (Section IX-A). Traversable SHP combinations are shown in Figure 1 with label SHP(*). To cover the rest of the cases, we use port prediction which enables a peer to punch a hole by sending to the opposite peer instead of z, which makes it possible to tolerate more restrictive filtering and mapping policies, as explained below. VII. P REDICTION A. Prediction using Contiguity (PRC) The traversal process consists in the following steps: 1) pa sends two consecutive messages: • from some va to z through na dum • from va to ub , an arbitrary endpoint of nb 2) na creates the following two rules: ′ ′ • ra = (va ↔ ua ) and forwards to z. dum • ra = (va ↔ ua ) and forwards to ub . Actually, the whole point of sending udum is to open ua by b sending to nb but be able to predict it at z. 3) The messages are received as follows:

Fig. 1. All possible distinct NAT types combinations for two peers a and b with the technique needed to traverse the combination and X for un-traversable combinations. SHP(*), PRC(*) and PRP(*) stand respectively for Simple Hole Punching, Port Prediction using Contiguity and Port Prediction using Preservation. Combinations of NAT behaviors mandated by RFC 4787 are identified by the label BEHAVE in the table’s legend.

a) z receives and consequently knows u′a and additionally predicts ua = u′a + ∆ where ∆ is known during the discovery process. b) nb drops the message since no endpoint udum was b ever bound. 4) z informs pb about ua (Out-of-Band). 5) Steps 5 − 9 follow the same scheme as in simple hole punching. Port scanning.The process is susceptible to failure if another peer p′a happens by coincidence to send a packet between the two consecutive packets. For that, a technique called port scanning [6] is used such that when pb tries to connect to ua , pb will try ua + ∆, ua + 2∆, ua + 3∆, etc.. until a reply is received. Some gateways might identify this as a malicious UDP port scan and block it as is the case in some corporate firewalls. Port scanning might be used only when pb connecting to pa where a(na ) = P C has m(nb ) < P D, as shown by[6]. B. Prediction using Preservation (PRP) Another technique is to exploit the port-preservation allocation policy. However, to do that, we assume that when a peer with port-preservation policy registers at z, the peer supplies a pool of free candidate ports to z. The main point here is to avoid conflicts with ports of other peers behind the same NAT. The rendez-vous server z is stateful regarding which ports are bound by each NAT and chooses from the pool of the ports supplied by the peer a port which is not already bound. 1) z chooses some arbitrary port ρ and tells pa (Out-ofBand) to bind ρ 2) pa sends from va where p(va ) = ρ to udum through na . b 3) na creates a new rule ra = (va ↔ ua )udum and b forwards to udum and since a(pa ) = PP, p(ua ) = b p(va ) = ρ.

4) z informs pb about ua (Out-of-Band). 5) Steps 5-9 follow the same scheme as in SHP. Note that the process is shorter than prediction by contiguity and z chooses the port for the peer behind NAT instead of the NAT of the peer deciding it and z observing it. However, for the sake of reasoning, the two are equivalent because what matters is what happens after the opposite peer learns about the punched port irrespective of how the port was predicted. C. Prediction-on-a-Single-Side Feasibility Theorem 7.1: Prediction using contiguity or preservation on a single side is feasible for establishing direct communication between two peers pa and pb respectively behind na and nb if: • Condition 1: ∃nx ∈ {na , nb } s.th. a(nx ) < RD and f (nx ) < PD • Condition 2: Either m(nx ) < PD or m(nx ) = PD and f (nx′ 6=x ) < PD. Proof: Similar to theorem 6.1, we start with the most restrictive policies and we relax until prediction is feasible. The allocation policy of the side to be predicted (na in Section VII-B,VII-A) can not be random, because the whole idea of prediction relies on a predictable allocation policy, thus the needed relaxation is a(na ) < RD. In both prediction techniques, the dummy packet from pa punches a hole by sending to pb , in contrast to SHP which punches by sending to z. nevertheless, it is sent to a dummy port of pb . After steps 5, 6: !   ra =

f :ua ←udum

va ←−−−−−−b−−→ ua m:va →udum p

ub 6=udum b

, rb =

f :ub ←ua

vb ←−−−−−−→ ub m:vb →ua

← In step 7: ra ⇐= 6 ub−,− u−a , we have to relax the filtering at pa to indulge the port difference from ub , but we tolerate host sensitivity. The needed relaxation is: f (na ) < PD resulting

ra =

f :ua ←(nb ,∗)

va ←−−−−−−−−→ ua m:va →udum b

40

!

In step 8: the reasoning about relaxing the mapping on pa or the filtering of pb is identical to Theorem 6.1 except that host-sensitivity is tolerable and thus either m(na ) < PD or is kept m(na ) = PD and in that case, the needed relaxation is f (nb ) < PD.

35

30

25 Total %

into:

20

15

D. Coverage of PRP & PRC 10

5

PP

PP PD EI

PC H D

PD PD

EI

R

R PD EI

EI

EI

R

PC PD

PD PD

EI

EI

EI

R

PC

PP H D EI

PD

EI

PD

PP

PC

Fig. 2.

EI

H D EI

EI

PC

0 PD

PRP and PRC together cover another 18% of the combinations. That said, we can say that PRP is as good as SHP in terms of traversal time and success rate (see Section IX), which means in addition to the cases where PRP on a single side is used in Figure 1, we can also use PRP instead of SHP when the allocation policy is port preservation.

Distribution of encountered NAT types in τ as (m,f ,a)

VIII. I NTERLEAVED P REDICTION ON T WO S IDES . The remaining combinations are these not covered by SHP nor prediction. The final stretch to go is to do simultaneous prediction on both sides. However, it is a seemingly tricky deadlock situation because every peer needs to know the port that will be opened by the other peer without the other peer sending anything. Which we solve as follows. Interleaved PRP-PRP. In this case actually double prediction is very simple because the rendez-vouz server can pick a port for each side and instruct the involved peers to simultaneously bind it and start the communication process. Interleaved PRP-PRC This case is also easily solvable thanks to preservation. Because z can inform the peer with a port contiguity allocation policy about the specific endpoint of the opposite peer. The latter in turn will run a port prediction process using the obtained endpoint in the second consecutive message. Interleaved PRC-PRC This one is the trickiest and it needs a small modification in the way prediction by contiguity is done. The idea is that the two consecutive packets, the first to z and the second to the opposite peer can not be sent after each other immediately. Instead, both peers are commanded by z to send a packet to z itself. From that, z deduces the ports that will be opened on each side in the future and sends to both peers informing them about the opposite peer’s predicted endpoint. Both peers in their turn send a punching packet to each other. The problem with this scheme is that there is more time between the consecutive packets which makes it more susceptible to the possibility of another peer behind any of the NATs sending a packet in in between. Like the case in single PRC, port scanning is the only resort, but in general this combination has lower success rate compared to single PRC (see SectionIX). For our reasoning, we will work on the last one (PRC-PRC), since it is a general harder case of the first two. A. Traversal Process 1) z tells pa & pb to start prediction (Out-of-Band)

2) pa & pb both send to z through na & nb respectively resulting in the new rules ra′ = (va ↔ u′a ), rb′ = (vb ↔ u′b ) 3) z receives from pa & pb , observing u′a & u′b and deducing ua = u′a + ∆ & ub = u′b + ∆ 4) z informs pa & pb about ub & ua respectively (Out-ofBand) 5) pa sends to ub through na and pb sends to ua through nb 6) nb receives and forwards to vb and na receives and forwards to va A race condition where step 6 for one of the peers happens before the opposite peer starts to run step 5 can take place resulting into a packet drop. However, the dropped packet opens the hole for the opposite peer, and retrying sending is enough to take care of this issue. B. Interleaved Prediction Feasibility Theorem 8.1: Interleaved Prediction is feasible for establishing direct communication between two peers pa and pb respectively behind na and nb if both a(nb ) and a(nb ) are < RD Proof: Similar to theorem 6.1, we start with the most restrictive policies and we relax until prediction is feasible. Since we need to predict both sides we need a(na ) < RD &  a(nb ) < RD. After  step 5 inSection VIII-A, we have: f :ua ←ub

f :ub ←ua

va ←−−−−−−→ ua , rb = vb ←−−−−−−→ ub m:vb →ua m:va →ub In step 6, we have ra ⇐ ← ua−,− u−b and rb ⇐ ← ub−,− u−a without the need for any relaxations on the filtering nor the mapping of either sides. ra =

C. Interleaved Prediction Coverage The interleaved prediction covers another 11.9% of the combinations, namely the ones shown in Figure 1 leaving 20.6% of the cases untraversable. That is, approximately 79.4% of all NAT type combinations are traversable and for

TABLE I D ISTRIBUTION OF ENCOUNTERED NAT POLICIES EI 80.21% EI 13.54% PP 54.69%

Filtering Allocation

HD 0% HD 17.45% PC 23.7%

PD 19.79% PD 69.01% RD 21.61%

each combination, we know which technique to use. The more important thing is that not all of them have the same likelihood of being encountered which we discuss in the next section. That said, it worth mentioning that there is a technique in [9] which performs a brute-force search on all possible ports after reducing the search space using the birthday paradox, which we ignored due to low success probability, high traffic and long time requirements.

90 80 70

Success Rate(%)

Mapping

100

A. Distribution of Types We wanted to know how likely is it to encounter each of the types in τ . We have collected cumulative results for peers who have joined our network over time. As shown in Figure 2: i) we encountered 13 out of the 27 possible types; ii) we found that (m = EI, f = PD, a = PP) is a rather popular type (approx. 37%) of all encountered types, which is fortunate because port preservation is quite friendly to deal with and it is with a very relaxed mapping; iii) about 11% are the worst kind to encounter, because when two peers of this type need to talk, interleaved prediction is needed with a shaky success probability. B. Adoption of BEHAVE RFC By looking at each policy alone, we can see to what extent the recommendations of the BEHAVE RFC [8] (f = EI/HD, m = EI) are adopted. As shown in Table IX, for filtering, the majority are adopting the policy discouraged by the RFC, while for mapping the majority were following the recommendation. For allocation, the RFC did not make any specific relevant recommendation. The percentage of NATs following both recommendations was 30%.

50 40 30 20 10 0

PRC−PRC

PRC

PRP−PRC PRP−PRP

PRP

SHP

Traversal Strategies

Fig. 3. Success rate of each technique averaged over all applicable combinations.

IX. E VALUATION

1200

1000

800

Time(ms)

Apart from the reasoning above, we have done a sanity check on our logic using our emulation platform [10]. That is, we wrote our own NAT boxes, which behave according to the semantics defined in Section III. We also implemented the Rendez-Vous server and the nodes that are capable of performing all the traversal techniques in Section V. For each case in Figure 1, we ran the suggested traversal technique and we made sure direct communication is indeed achievable. Real-life evaluation was needed to gain insights on other aspects like probability of encountering a given type, success rates of traversal techniques and time needed for the traversal process to complete.

60

600

400

200

0

PRP

PRP−PRC PRP−PRP

SHP

PRC−PRC

PRC

Traversal Strategies

Fig. 4.

Time taken (in msec) for the traversal process to complete.

C. Success Rate Given the set of peers present in the network at one point in time, we conduct a connectivity test where all peers try to connect to each other. We group the result by traversal techniques, e.g. SHP is applicable for 186 combinations, so we average the success rate over all combinations and the whole process is repeated a number of times, we have found (Figure 3) as expected that SHP is rather dependable as it succeeds 96% of the time. We also found that PRP is as good as SHP, which is quite positive given that we found that the probability of occurrence of preservation is quite high in the last section. Interleaved PRP-PRP is also rather good with slightly worse success rate. The three remaining techniques involving PRC in a way or the other are causing the success rate to drop significantly especially for PRC-PRC mainly because of the additional delay for interleaving. D. Time to traverse When it comes to the time needed for the traversal process to complete (Figure 4), we find two main classes, SHP and

PRP in one class and PRC in another class, even when we do PRC-PRP, it is faster than PRC alone because the number of messages is less. X. C ONCLUSION & F UTURE W ORK In this paper, we have presented our experience with trying to find a comprehensive analysis of what combinations of NAT types are traversable. We have shown that using a semi-formal reasoning that covers all cases and we provided a slightly augmented versions of the well-known traversal techniques and shown which ones are applicable for which combinations.We have shown that about 80% of all possible combinations are traversable. Using our deployment base for P2P live streaming, we have shown that only 50% fo all possible types are encounterable. We have also reported our findings on the success probability and time of traversing the different combinations. For future work: a) Modeling: we would like to enrich the model to make it capture real-life aspects like expiration of NAT rules, multiple levels of NAT, subtleties of conflicts between many peers behind the same NAT, NATs that use different policies in different situations, and support for uPnP and TCP; b) Real-life Evaluation: more insight into the tradeoff between success probability and timing, preventing the techniques as being identified as malicious actions in some corporate firewalls; c) Dissemination: releasing our library and simulator as open-source for third-party improvement and evaluation. XI. ACKNOWLEDGMENTS We would like to thank all members of Peerialism’s development team for the help and collaboration on the implementation of our NAT traversal techniques, in particular Magnus Hedbeck for his patience and valuable feedback. The anonymous reviewers of ICCCN have provided a really inspiring set of comments that helped us to improve the quality of this paper. R EFERENCES [1] B. Ford, P. Srisuresh, and D. Kegel, “Peer-to-peer communication across network address translators,” in ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference. Berkeley, CA, USA: USENIX Association, 2005, pp. 13–13. [2] P. Srisuresh, B. Ford, and D. Kegel, “State of Peer-to-Peer (P2P) Communication across Network Address Translators (NATs),” RFC 5128 (Informational), Internet Engineering Task Force, Mar. 2008. [Online]. Available: http://www.ietf.org/rfc/rfc5128.txt [3] J. Rosenberg, J. Weinberger, C. Huitema, and R. Mahy, “STUN Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs),” RFC 3489 (Proposed Standard), Internet Engineering Task Force, Mar. 2003, obsoleted by RFC 5389. [Online]. Available: http://www.ietf.org/rfc/rfc3489.txt [4] C. H. J. Rosenberg, R. Mahy, “Traversal using relays around nat (turn): Relay extensions to session traversal utilities for nat (stun),” Internet draft, November 2008. [Online]. Available: http://tools.ietf.org/html/draft-ietf-behave-turn-14 [5] J. Rosenberg, “Interactive connectivity establishment (ice): A protocol for network address translator (nat) traversal for offer/answer protocols,” Internet draft, October 2007. [Online]. Available: http://tools.ietf.org/html/draft-ietf-mmusic-ice-19

[6] Y. Takeda, “Symmetric nat traversal using stun,” Internet draft, June 2003. [Online]. Available: http://tools.ietf.org/html/draft-takedasymmetric-nat-traversal-00 [7] D. Thaler, “Teredo extensions,” Internet draft, March 2009. [Online]. Available: http://tools.ietf.org/html/draft-thaler-v6ops-teredo-extensions03 [8] F. Audet and C. Jennings, “Network Address Translation (NAT) Behavioral Requirements for Unicast UDP,” RFC 4787 (Best Current Practice), Internet Engineering Task Force, Jan. 2007. [Online]. Available: http://www.ietf.org/rfc/rfc4787.txt [9] A. Biggadike, D. Ferullo, G. Wilson, and A. Perrig, “NATBLASTER: Establishing TCP connections between hosts behind NATs,” in Proceedings of ACM SIGCOMM ASIA Workshop, Apr. 2005. [10] R. Roverso, M. Al-Aggan, A. Naiem, A. Dahlstrom, S. El-Ansary, M. ElBeltagy, and S. Haridi, “Myp2pworld: Highly reproducible applicationlevel emulation of p2p systems,” in Decentralized Self Management for Grid, P2P, User Communities workshop, SASO 2008, 2008.

A PPENDIX T RAVERSAL S EQUENCE D IAGRAMS msc Simple-Hole Punching SHP pa

na

z

nb

pb

SHP va

ua

va va

ua ua

GO(ua ) ub ub

vb vb

msc Prediction using Preservation PRP pa

na

z

PRED PRES(va ) u′a ua

va va

va va

nb

pb

udum b GO(ua ) ub ub

ua ua

vb vb

msc Prediction using Contiguity PRC pa

na

z

PRED CONT u′a ua

va va

va va

nb

pb

udum b GO(ua ) ub ub

ua ua

vb vb

msc Interleaved Prediction PRC-PRC pa

va

va va

na PRED CONT u′a

GO(ub ) ua ua

z

nb

pb

PRED CONT u′b

vb

GO(ua ) ub ub

vb vb

Paper C

Chapter 8

GPU-Based Heuristic Solver for Linear Sum Assignment Problems Under Real-time Constraints Roverso Roberto, Naiem Amgad, El-Beltagy Mohammed, El-Ansary Sameh and Haridi Seif In Proceedings of Informatics and Systems (INFOS), 2010 The 7th International Conference on, March 2010, Cairo, Egypt.

1

GPU-Based Heuristic Solver for Linear Sum Assignment Problems Under Real-time Constraints Roberto Roverso

∗† ,

Amgad Naiem∗‡ , Mohammed El-Beltagy∗‡ , Sameh El-Ansary∗§ and Seif Haridi† ∗ Peerialism Inc., Sweden † KTH-Royal Institute of Technology, Sweden ‡ Cairo University, Egypt § Nile University, Egypt {roberto,amgad,mohammed,sameh}@peerialism.com [email protected]

Abstract—In this paper we modify a fast heuristic solver for the Linear Sum Assignment Problem (LSAP) for use on Graphical Processing Units (GPUs). The motivating scenario is an industrial application for P2P live streaming that is moderated by a central node which is periodically solving LSAP instances for assigning peers to one another. The central node needs to handle LSAP instances involving thousands of peers in as near to real-time as possible. Our findings are generic enough to be applied in other contexts. Our main result is a parallel version of a heuristic algorithm called Deep Greedy Switching (DGS) on GPUs using the CUDA programming language. DGS sacrifices absolute optimality in favor of low computation time and was designed as an alternative to classical LSAP solvers such as the Hungarian and auctioning methods. The contribution of the paper is threefold: First, we present the process of trial-and-error we went through, in the hope that our experience will be beneficial to adopters of GPU programming for similar problems. Second, we show the modifications needed to parallelize the DGS algorithm. Third, we show the performance gains of our approach compared to both a sequential CPU-based implementation of DGS and a parallel GPU-based implementation of the auctioning algorithm.

I. I NTRODUCTION In order to deal with hard optimization or combinatorial problems in time-constrained environments, it is often necessary to sacrifice optimality in order to meet the imposed deadlines. In our work, we have dealt with a large scale peerto-peer live-streaming platform where the task of assigning n providers to m receivers is carried out by a centralized optimization engine. The problem of assigning peers to oneanother is modelled as a linear sum assignment problem (LSAP). However, in our p2p system, the computational overhead of minimizing the cost of assigning n jobs (receivers) to n agents (senders) is usually quite high because we are often dealing with tens of thousands of agents and jobs (peers in the system). We have seen our implementation of classical LSAP solvers take several hours to provide an optimal solution to a problem of this magnitude. In the context of live streaming we could only afford a few seconds to carry out this optimization. It was also important for us not to sacrifice optimality too much in the pursuit of a practical optimization solution. We hence opted for a strategy of trying to discover a fast heuristic near-optimal solver for LSAP that is also amenable to parallelization in such a way that can exploit the massive

computational potential of modern GPUs. After structured experimentation on a number of ideas for a heuristic optimizer, we found a simple and effective heuristic we called Deep Greedy Switching [1] (DGS). It was shown to work extremely well on the instances of LSAP we were interested in, and we never observed it deviate from the optimal solution by more than 0.6%, (c.f. [1, p. 5]). Seeing that DGS has parallelization potential, we modified and adapted it to be run on any parallel architecture and consequently also on GPUs. In this work, we chose CUDA [2] to be our choice as a GPU programming language to implement the DGS solver. CUDA is a sufficiently general C-like language which allows for execution of any kind of user-defined algorithms on the highly parallel architecture of NVIDIA GPUs. GPU programming has become increasingly popular in the scientific community during the last few years. However, the task of developing whatsoever mathematical process in a GPU-specific language still involves a fair amount of effort in understanding the hardware architecture of the target platform. In addition, implementation efforts must take into consideration a set of best practices to achieve best performance. This is the reason why in this paper we will provide an introduction to CUDA in Section II, in order to better understand its advantages, best practices and limitations, so that it will later be easier to appreciate the solver implementation’s design choices in Section V. We will also detail the inner functioning of the DGS heuristic in Section III and show results obtained by comparing both different versions of the DGS solver and the final version of the GPU DGS solver with an implementation of the auction algorithm running on GPUs in Section VI. We will conclude the paper with few considerations on the achievements of this work and our future plans in Section VII. II. GPU S

AND THE

CUDA

LANGUAGE

Graphical Processing Units are mainly accelerators for graphical applications, such as games and 3D modelling software, which make use of the OpenGL and DirectX programming interfaces. Given that many of calculations involved in those applications are amenable to parallelization, GPUs have hence been architected as massive parallel machines. In the last few years GPUs have ceased to be exclusively fixed-function

2

devices and have evolved to become flexible parallel processors accessible through programming languages [2][3]. In fact, modern GPUs as NVIDIA’s Tesla and GTX are fundamentally fully programmable many-core chips, each one of them having a large number of parallel processors. Multicore chips are called Streaming Multiprocessors (SMs) and their number can vary from one, for low-end GPUs, to as many as thirty. Each SM contains in turn 8 Scalar Processors (SPs), each equipped with a set of registers, and 16KB on-chip memory called Shared Memory. This memory has lower access latency and higher bandwidth compared to off-chip memory, called Global Memory, which is usually of the DDR3/DDR5 type and of size of 512MB to 4GB. We chose CUDA as GPU Computing language for implementing our solver because it best accomplishes a trade-off between ease-of-use and required knowledge of the hardware platform’s architecture. Other GPU specific languages, such as AMD’s Stream [4] and Kronos’ OpenCL standard [3] look promising but fall short of CUDA either for the lack of support and documentation or for the quality of the development platform in terms of stability of the provided tools, such as compilers and debuggers. Even though CUDA provides a sufficient degree of abstraction from the GPU architecture to ease the task of implementing parallel algorithms, one must still understand the basics of the functioning of NVIDIA GPUs to be able to fully utilize the power of the language. The CUDA programming model imposes the application to be organized in a sequential part running on a host, usually the machine’s CPU, and parallel parts called kernels that execute code on a parallel device, the GPU(s). Kernels are blocks of instructions which are executed across a number of parallel threads. Those are then logically grouped by CUDA in a grid whose sub-parts are the thread blocks. The size of the grid and thread blocks are defined by the programmer. This organization of threads derives from the legacy purpose of GPUs where the rendering of a texture (see grid) needs to be parallelized by assigning one thread to every pixel, which are then executed in batches (see thread blocks). A thread block is a set of threads which can cooperate among themselves exclusively using barrier synchronization, no other synchronization primitives are provided. Each block has access to an amount of Shared Memory which is exclusive for its group of threads to utilize. The blocks are therefore a way for CUDA to abstract the physical architecture of Scalar Multiprocessors and Processors away from the programmer. Management of Global and Shared Memory must be enforced explicitly by the programmer through primitives provided by CUDA. Although Global memory is sufficient to run any CUDA program, it is advisable to use Shared Memory in order to obtain efficient cooperation and communication between threads in a block. It is particularly advantageous to let threads in a block load data from global memory to shared on-chip memory, execute the kernel instructions and later copy the result back in global memory. III. DGS H EURISTIC In LSAP we try to attain the optimal assignment of n agents to n jobs, where there is a certain benefit aij to be realized

when assigning agent i to job j. The optimal assignment of agents to jobs is the one that yields the maximum total benefit, while respecting the constraint that each agent can only be assigned to only one job, and that no job is assigned to more than one agent. The assignment problem can be formally described as follows max

n n X X

aij xij

i=1 j=1

n X

i=1 n X

xij = 1

∀j ∈ {1 . . . n}

xij = 1

∀i ∈ {1 . . . n}

j=1

xij ∈ {0, 1}

∀i, j ∈ {1 . . . n}

There are many applications that involve LSAP, ranging from image processing to inventory management. The two most popular algorithms for LSAP are the Hungarian method [5] and the auction algorithm [6]. The auction algorithm has been shown to be very effective in practice, for most instances of the assignment problem, and it is considered to be one of the fastest algorithms that guarantees a very near optimal solution (in the limit of nǫ). The algorithm works like a real auction where agents are bidding for jobs. Initially, the price for each job is set to zeros and all agents are unassigned. At each iteration, unassigned agents bid simultaneously for their “best” jobs which causes the jobs’ prices (pj ) to rise according. The prices work to diminish the net benefit (aij − pj ) an agent attains when being assigned a given job. Each job is awarded to the highest bidder, and the algorithm keeps iterating until all agents are assigned. Although the auction algorithm is quite fast and can be easily parallelized, it is not well suited to situations where large instances of the assignment problem are involved and there is deadline after which a solution would be useless. Recently, a novel heuristic approach called Deep Greedy Switching (DGS) [1] was introduced for solving the assignment problem. It sacrifies very little in terms of optimality, for a huge gain in the running time of the algorithm over other methods. The DGS algorithm provides no guarantees1 for attaining an optimal solution, but in practice we have seen it deviate with less than 0.6% from the optimal solutions, that are reported by the auction algorithm, at its worst performance. Such a minor sacrifice in optimality is acceptable in many dynamic systems where speed is the most important factor as an optimal solution that is delivered too late is practically useless. Compared with the auction algorithm, DGS has the added advantage that it starts out with a full assignment of jobs to agents and keeps improving that assignment during the course of its execution. The auction algorithm, however, attains full assignment only at termination. Hence, if a deadline has been reached where an assignment must be produced, DGS can interrupted to get the best assignment solution it has attained thus far. The DGS algorithm, shown in Algorithm 1, starts with a random initial 1 The authors are still working on a formal analysis of DGS that would help explain its surprising success.

3

solution, and then keeps on moving in a restricted 2-exchange neighborhood of this solution according to certain criteria until no further improvements are possible. A. Initial Solution The simplest way to obtain an initial solution is by randomly assigning jobs to agents. An alternative is to do a greedy initial assignment where the benefit aij is taken into account. In our experiments with DGS we found that there was not clear advantage to adopt either approach. Since greedy initial assignment takes a bit longer, we opted to use random agent/job assignment for the initial solution. B. Difference Evaluation Starting from a full job/agent assignment σ, each agent tries to find the best 2-exchange in the neighborhood of σ. For each agent i we consider how the objective function f (σ) would change if it were to swap jobs with another agent i′ (i.e. a 2-exchange). We select the 2-exchange that yields the best improvement δi and save it as agent i’s best configuration N Ai . The procedure is called the agent difference evaluation (ADE) and is described formally in Algorithm 2. Similarly, a job difference evaluation (JDE) is carried out for each job, but in this case we consider swapping agents. C. Switching Here we select the 2-exchange that yields the greatest improvement in objective function value and modify the job/agent assignment accordingly. We then carry out JDE and ADE for the jobs and agents involved in that 2-exchange. We repeat the switching step until no further improvements are attainable. We define an assignment as a mapping σ : J → I, where J is the set of jobs and I is the set of agents. Here σ(j) = i means that job j is assigned to agent i. Similarly another assignment mapping τ : I → J is for mapping jobs to agent where τ (i) = j means that agent i is assigned to job j. There is also an assignment mapping function to construct τ from σ defined as τ = M (σ) and the objective function value of an assignment σ is given by f (σ). We make use of a switching function SWITCH (i, j, σ) which returns a modified version the assignment σ after agent i has been assigned to job j; i.e. a 2exchange has occurred between agents i and σ(j). For agent i, the job ji is the job that yields the largest increase in objective function when assigned to agent i and it can be expressed as ji =

arg max j=1,...,n,j6=τ (i)

f (SWITCH (i, j, σ)) − f (σ),

and the corresponding improvement in objective function value is expressed as δi =

max

j=1,...,n,j6=τ (i)

f (SWITCH (i, j, σ)) − f (σ).

We similarly define for each job j the best agent that it can be assigned to ij and the corresponding improvement δ j . Using this terminology, the algorithm is formally described in Algorithm 1.

Algorithm 1: DGS A LGORITHM DGS (σ, f ) repeat σstart ← σ, τ = M (σ), δ ← ∅, δ ← ∅ ADE(i, f, τ, σ, N A, δ) ∀i ∈ I ⊲ Difference Evaluation ⊲ Difference JDE(j, f, τ, σ, N J, δ) ∀j ∈ J Evaluation while ∃δ i > 0 ∨ ∃δ j > 0 do ⊲ Switching phase i∗ ← arg maxi=1...n δ i , j ∗ ← arg maxj=1...n δ j if δ i∗ > δ j ∗ then δ i∗ = 0 σ ′ ← SWITCH (i∗ , ji∗ , σ), τ ′ ← M (σ ′ ) agents ← {i∗ , σ ′ (τ (i∗ ))}, jobs ← {τ (i∗ ), τ ′ (i∗ )} else δj∗ = 0 σ ′ ← SWITCH (ij ∗ , j ∗ , σ) agents ← {σ(j ∗ ), σ ′ (j ∗ ))}, jobs ← {j ∗ , τ (σ ′ (j ∗ ))} if f (σ ′ ) > f (σ) then σ ← σ ′ , τ = M (σ) ADE(j, f, τ, σ, N A, δ) ∀i ∈ agents JDE(j, f, τ, σ, N J, δ) ∀j ∈ jobs until f (σstart ) = f (σ ′ ) output σ ′

Algorithm 2: ADE A LGORITHM ADE (i, f, τ, σ, N A, δ) j ← τ (i), σi∗ ← σ, δi ← 0 foreach j ′ ∈ {J | j ′ 6= j} do i′ ← σ(j ′ ) σi′ ← σ, σi′ (j) ← i′ , σi′ (j ′ ) ← i if f (σi′ ) > f (σi∗ ) then σi∗ ← σi′ if σi∗ 6= σ then N Ai ← σi∗ δi ← f (σi∗ ) − f (σ) else N Ai ← 0

IV. E VALUATION While explaining the process of realization of the CUDA solver in the next section, we also show results of the impact of the various steps that we went through to implement it and enhance its performance. The experimental setup for the tests consists of a consumer machine with a 2.4Ghz Core 2 Duo processor equipped with 4GB of DDR3 RAM and NVIDIA GTX 295 graphic card with 1GB of DDR5 on-board memory. The NVIDIA GTX 295 is currently NVIDIA’s top-of-the-line consumer video card and boasts a total number of 30 Scalar Multiprocessors and 240 Processors, 8 for each SM, which run at a clock rate of 1.24 GHz. In the experiments, we

4

Difference Evaluation Phase 60000 CPU GPU Global Memory GPU Shared Memory

55000 50000 45000

Time (ms)

40000 35000 30000 25000 20000 15000 10000 5000 0 0

900

1800

2700

3600

4500

5400

6300

7200

8100

9000

9900

Problem Size

Fig. 1. Computational time comparison between Difference Evaluation implementations

use a thread block size 256 when executing kernels which do not make use of Shared Memory, and 16 in the case they do. Concerning the DGS input scenarios, we use dense instances of the GEOM type defined by Bus and Tvrdık [7], and generated as follows: first we generate n points randomly in a 2D square of dimensions [0, C] × [0, C], then each aij value is set as the Euclidean distance between points i and j from the generated n points. We define the problem size to be equal to the number of agents/jobs. For the sake of simplicity, we make use of problem sizes which are multiples of the thread block size. Note that every experiment is the result of the averaging of a number of runs executed using differently seeded DGS instances.

Algorithm 3: Parallel DGS A LGORITHM DGS (σ, f ) repeat σstart ← σ, τ = M (σ), δ ← ∅, δ ← ∅ start parallel ∀i ∈ I, ∀j ∈ J ⊲ Difference Evaluation Phase starts ADE(i, f, τ, σ, N A, δ) JDE(j, f, τ, σ, N J, δ) stop parallel ⊲ Difference Evaluation Phase ends while ∃δ i > 0 ∨ ∃δ j > 0 do ⊲ Switching phase CC(I, J, N A, N J, C) δi ← 0 ∀i ∈ I, δn+j ← 0 ∀j ∈ J / C} ∀i ∈ I δi ← {δ i | i ∈ δn+j ← {δj | σ(j) ∈ / C} ∀j ∈ J start parallel ∀δt > 0 if t ≤ n then i ← t, σ ′ ← SWITCH (i, ji , σ) else j ← (t − n), σ ′ ← SWITCH (ij , j, σ) if f (σ ′ ) > f (σ) then σ ← σ ′ , τ = M (σ) stop parallel start parallel ∀i ∈ {I | i ∈ / C}, ∀j ∈ {J | σ(j) ∈ / C} ADE(i, f, τ, σ, N A, δ) JDE(j, f, τ, σ, N J, δ) stop parallel until f (σstart ) = f (σ ′ ) output σ ′

A. Difference Evaluation on Global Memory V. T HE DGS CUDA S OLVER The first prototype of the DGS solver was implemented in the Java language. However, its performance did not meet the demands of our target real-time system. We therefore ported the same algorithm to the C language in the hope that we obtain better performance. The outcome of this effort was the first production-quality implementation of the DGS which was sufficiently fast up to problem sizes of 5000 peers. The Difference Evaluation step of the algorithm, as described in Section III-B amounted to as much as 70% of the total computational time of the solver. Luckily, all JDE and ADE evalution for jobs and agents can be done in parallel as they are completely orthogonal and they do not need to be executed in a sequential fashion. Hence, our first point of investigation was to implement a CUDA ADE/JDE kernel which could execute both ADE and JDE algorithms on the GPU. We developed two versions of the ADE/JDE kernel: the first runs exclusively on the GPU’s Global memory and the second makes use of the GPU’s Shared memory to obtain better performance. For ease of exposition we will only discuss ADE going forward. This is without any loss of generality as everything that applies to ADE also applies to JDE, with the proviso the talk of jobs instead of agents.

As mentioned earlier, Global memory is fully addressable by any thread running on the GPU and no special operation is needed to access data on it. Therefore, in the first version of the kernel, we decided to simply upload the full Aij matrix to the GPU memory together with the current agent to job assignments and all the data we needed to run the ADE algorithm on the GPU. Then we let the GPU spawn a thread for each of the agents involved. Consequently, a CUDA thread cti associated with agent i executes the ADE algorithm only for agent i by evaluating all its possible 2-exchanges. The agent-to-thread allocation on the GPU is trivial and is made by assigning the thread identifier cti to agent i. B. Difference Evaluation on Shared Memory Difference Evaluation using Shared Memory assigns one thread to each 2-exchange evaluation for agent i and job j. That implies that the number of created threads equals the number of cells of the Aij matrix. Each thread ctij then proceeds to load in shared memory the data which is needed for the single evaluation between agent i and job j. Once the 2-exchange evaluation is computed, every thread ctij stores the resulting value in a matrix located in global memory in position (i,j). After that, another small kernel is executed

5

foreach j ∈ {J | N Jj 6= 0} do σ ← N Jj i ← σ(j) if i ∈ CR or ij ∈ CR then C ← {C, i} else CR ← {CR, i, ij }

which causes a thread for each row i of the resulting matrix to find the best 2-exchange value along that same row for all indexes j. The outcome of this operation represents the best 2-exchange value for agent i. In Figure 1, we compare the results obtained by running the two aforementioned Shared Memory GPU kernel implementations and its Global Memory counterpart against the pure C implementation of the Difference Evaluation for different problem sizes. For evaluation purpose, we used a CUDA-enabled version of the DGS where only the Difference Evaluation phase of the algorithm runs on the GPU and can be evaluated separately from the other phase. This implies that we need to upload the input data for the ADE/JDE phase to the GPU at every iteration of the DGS algorithm, as well as we need to download its output in order to provide input for the Switching phase. The aforementioned memory tranfers are accounted for in the measurements. As we can observe, there’s a dramatic improvement when passing from the CPU implementation of the difference evaluation to both GPU implementations, even though multiple memory tranfers occur. In addition, the Shared Memory version behaves consistently better than the Global Memory one. Furthermore, the trend for increasing problem sizes is linear for both GPU versions of the Difference Evaluation, opposed to the exponential growth of the CPU version curve. C. Switching Considering the Switching phase of the DGS algorithm described in Subsection III-C we found out that in many cases the computational time necessary to apply the best 2exchanges is fairly high. Our experience is that the switching phase might have a relative impact of between 35% and 60% of the total computation time of the solver. In order to improve the performance of this phase, we modified the Switching algorithm so that a subset of the best 2-exchanges computed in the Difference Evaluation section might be applied concurrently. The modified DGS algorithm is shown in Algorithm 3.

DGS Implementations Comparison

Time (ms)

Algorithm 4: Check Conflicts A LGORITHM CC (I, J, N A, N J, C) CR ← ∅, C ← ∅ foreach i ∈ {I | N Ai 6= 0} do σ ← N Ai i′ ← σ(ji ) if i ∈ CR or i′ ∈ CR then C ← {C, i} else CR ← {CR, i, i′ }

60000 57000 54000 51000 48000 45000 42000 39000 36000 33000 30000 27000 24000 21000 18000 15000 12000 9000 6000 3000 0

CPU Mixed GPU-CPU GPU

0

900

1800

2700

3600

4500

5400

6300

7200

8100

9000

9900

Problem Size

Fig. 2.

Computational time comparison between DGS’s implementations.

In order to execute some of the switches in parallel, we need to identify which among them are not conflicting. For that, we designed a function called CC, shown in Algorithm 4, which serves the aforementioned purpose. Once the nonconflicted 2-exchanges are determined by CC, we identify the corresponding agents and jobs and we apply the exchanges in a parallel fashion. After this operation completes, we proceed to re-evaluate the differences for the agents and jobs whose 2-exchanges were identified as conflicting, for there might be possible better improvements for those which were not applied. At the next iteration of the DGS algorithm, conflicted two-exchanges may be resolved and applied in parallel. In order to execute the parallel Switching phase on the GPU, we simply let the GPU spawn a number of threads which is equal to the number of non-conflicting 2-exchanges and let them perform the switch. VI. R ESULTS In Figure 2 we show the results obtained by comparing three different implementations of the DGS heuristic: a pure C implementation labelled “CPU”, the “Mixed GPU-CPU” implementation, where only the Difference Evaluation section of the algorithm is executed on the GPU using Shared Memory, and the “GPU” implementation, where all three main phases of the DGS including the Switching are executed on the GPU. As we can observe, the gain in performance when considering the “GPU” compared to the two other implementations is paramount. There are two fundamental reasons for that. The first is the speed-up obtained by applying all non-conflicting 2-exchanges in parallel. The second reason is a direct consequence of the fact that most of the operations are executed directly on the GPU and few host–device operations are needed. Such operations, e.g. memory transfers, can be expensive and certainly contribute to the absolute time needed for the solver to reach an outcome. In fact, it’s interesting to observe that the total termination time needed for big problem sizes is less than the total time needed for executing just the ADE/JDE phase, as shown in Figure 1, where multiple memory tranfers occur at every iteration of the algorithm.

6

Speedup Comparison GPU Auctioning / GPU DGS GPU Auctioning / CPU DGS

400 375 350 325 300 Speedup (times)

275 250 225 200 175 150 125 100 75 50 25 0 0

300

600

900

1200 1500 1800 2100 2400 2700 3000 3300 3600 3900 4200 Problem Size

Fig. 3. Speed-up comparison between DGS’s implementations compared to GPU Auctioning

In order to assess the improvement in performance of our GPU DGS solver with respect to other LSAP solvers, we compare our solver to an implementation of the Auction algorithm published by Vasconcelos et al. [8], which is implemented on GPUs using the CUDA language. Figure 3 shows the outcome of this analysis. As we can observe, the speed–up obtained can be as high as 400 times faster. Furthermore, we can note that even the CPU version of the sequential DGS algorithm performs considerably better than the GPU auctioning solver, as much as 20 times faster for large problem sizes.

VII. C ONCLUSION & F UTURE W ORK In this paper we presented the realization of a GPU-enabled LSAP solver based on the Deep Greedy Switching heuristic algorithm and implemented using the CUDA programming language. We detailed the process of implementation and enhancement of the two main phases of the algorithm: Difference Evaluation and Switching, and we provided results showing the impact of each iteration on the performance. In particular, we showed how parallelizing some parts of the solver with CUDA can lead to substantial speed–ups. We also suggested a modification to the DGS algorithm, in the Switching phase, which enables the solver to run entirely on the GPU. In the last part of the paper, we also show the performance of the final version of the solver compared to a pure C language DGS implementation and to an auction algorithm implementation on GPUs, concluding that the time needed for the DGS solver to reach an outcome is one order of magnitude lower compared to the “C” implementation for big scenarios and three orders of magnitude lower on average compared to the GPU auction solver in almost all problem sizes. For future work, we would like to formally analyze the modified version of the DGS algorithm to theoretically assess its lower bound on optimality. We would also like to see our solver applied in different contexts and explore possible applications involving LSAP that have yet to be investigated due to computational limitations.

R EFERENCES [1] A. Naiem and M. El-Beltagy, “Deep greedy switching: A fast and simple approach for linear assignment problems,” in 7th International Conference of Numerical Analysis and Applied Mathematics, 2009. [2] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov, “Parallel computing experiences with cuda,” Micro, IEEE, vol. 28, no. 4, pp. 13–27, 2008. [Online]. Available: http://dx.doi.org/10.1109/MM.2008.57 [3] K. Group. Opencl: The open standard for parallel programming of heterogeneous systems. [Online]. Available: http://www.khronos.org/opencl/ [4] A. Bayoumi, M. Chu, Y. Hanafy, P. Harrell, and G. Refai-Ahmed, “Scientific and engineering computing using ati stream technology,” Computing in Science and Engineering, vol. 11, pp. 92–97, 2009. [5] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, pp. 83–97, 1955. [6] D. Bertsekas, “The auction algorithm: A distributed relaxation method for the assignment problem,” Annals of Operations Research, vol. 14, no. 1, pp. 105–123, 1988. [7] L. Bus and P. Tvrdık, “Distributed Memory Auction Algorithms for the Linear Assignment Problem,” in Proceedings of 14th IASTED International Conference of Parallel and Distributed Computing and Systems, 2002, pp. 137–142. [8] C. N. Vasconcelos and B. Rosenhahn, “Bipartite graph matching computation on gpu.” in EMMCVPR, ser. Lecture Notes in Computer Science, D. Cremers, Y. Boykov, A. Blake, and F. R. Schmidt, Eds., vol. 5681. Springer, 2009, pp. 42–55. [Online]. Available: http://dblp.unitrier.de/db/conf/emmcvpr/emmcvpr2009.html#VasconcelosR09

Paper D

Chapter 9

MyP2PWorld: Highly Reproducible Application-level Emulation of P2P Systems Roverso Roberto, Al-Aggan Mohammed, Naiem Amgad, Dahlström Andreas, El-Ansary Sameh, El-Beltagy Mohammed, Franzen Nils and Haridi Seif In Proceedings of the 2008 Second IEEE International Conference on Self-Adaptive and Self-Organizing Systems Workshops, October 2010, Venice, Italy.

MyP2PWorld: Highly Reproducible Application-level Emulation of P2P Systems Roberto Roverso1,2 , Mohammed Al-Aggan1 , Amgad Naiem1 , Andreas Dahlstrom1 , Sameh El-Ansary1,3 , Mohammed El-Beltagy1,4 & Seif Haridi2 1 Peerialism Inc., Sweden, 2 The Royal Institute of Tech. (KTH), Sweden, 3 Nile University, Egypt, 4 Cairo University, Egypt {roberto,sameh}@peerialism.com Abstract In this paper, we describe an application-level emulator for P2P systems with a special focus on high reproducibility. We achieve reproduciblity by taking control over the scheduling of concurrent events from the operating system. We accomplish that for inter- and intra- peer concurrency. The development of the system was driven by the need to enhance the testing process of an already-developed industrial product. Therefore, we were constrained by the architecture of the overlying application. However, we managed to provide highly transparent emulation by wrapping standard/widely-used networking and concurrency APIs. The resulting environment has proven to be useful in a production environment. At this stage, it started to be general enough to be used in the testing process of applications other than the one it was created to test.

1

tent distribution platform which performs audio and video streaming directly to the customer’s home computer. It does that by building an ad-hoc overlay network between all hosts requesting a certain stream. This network is organized in such a way that the load of the content distribution is shared among all the participating peers. The main entities in the system are: • The Clients, which are the peers where Peerialism’s client application has been installed, i.e. the customers home computers. The installed application requests audio and video streams according to the input received from the customer. It then receives streams from other peers, delivers them to the local media player and streams them once more to other customers. • The Source. It represents a host which has all data of a certain stream. The Source itself is a Peer. A Peer becomes a source for a specific stream when it has received all the data of that same stream.

Introduction

The Case. MyP2PWorld is an application-level emulator with a focus on high reproducibility and simple integration with production code. The need for yet-another emulation/simulation package arose from the fact that we needed to provide an environment for debugging, testing, and evaluation of an already-developed product. Thus MyP2PWorld had to conform to the application rather than the converse. Existing emulators either did not provide enough features for our needs or required major re-engineering of the existing product. Our approach was to adapt an expressiveenough Discrete Event Simulator (DES) that was initially used in the algorithm design phase and develop a translation layer that enables the production code to run on top of it. The Product Under Test. Peerialism’s product is a con-

• The Tracker. It is the central coordinator of the system. It is not part of the overlay network but it organizes it. It receives requests from the clients, forwards them to an optimization engine and issues directions to the peers once the request has been satisfied. • The Optimization Engine. It receives the forwarded requests from the tracker and performs decisions according to the overall state of the network. In addition, it periodically redefines the structure of the overlay network to normalize the load of the delivery among the peers.

2

Our Requirements Our requirements for a testing environment are:

• Single code base. This is a widely sought-after goal in P2P systems research, mainly, due to the fact that initial design of algorithms and parameter trade-offs are studied on a discrete-event simulator, which uses a totally separate code base from the production code. The need for a single code base has even more value in an industrial context where people who design and simulate the protocol (Researchers), are different from those who deliver the production-quality software (Developers). The main issue, while scientifically unprovable but anecdotally evident, is that when one designs a protocol and specifies it for others to implement, some intuitive or based-on-trial/error design decisions are implicit. When given to another person the question of “Why don’t we do it the other way?” always becomes an issue and there is no fast way to answer that, except rapid prototyping, especially when it comes to non-obvious second-order effects. A single code-base is a valuable catalyst for the rapid prototyping process. • High reproducibility. We need to be able to execute the same experiment many times while preserving the same sequence of events and the same output every single time. This is mainly for debugging and inspection purposes rather than evaluation purposes. • Ease of deployment. The ability to use the testing tool on every development and testing machine. That is, we want to avoid the slow cycle of develop-deployinspect using different development and deployment machines. Especially, if the deployment infrastructure needs to be shared among many developers. • Minimal changes. We are testing a software that was already developed, therefore we are constrained by the way it was built. That is, whatever tool we choose, we want it to have a minimal impact (preferably none) on the present software architecture. Having explained our requirements and our constraints, we will show, in the next section, that despite the abundance of existing tools, we were not able to find one which can simultaneously address our requirements and constraints.

3

Existing Tools

The testing of P2P systems production-code (ideally the same code as the simulation code) has been the motivation behind many tools in the research community. We enumerate here some of these tools and explain their desirable properties as well as their shortcomings. TestBeds. The prominent example in this category is the Planet-Lab testbed [9]. It is one the most-widely used tools and an indispensable one. It is probably as close

as one can get to a real P2P deployment. The main problem is the difficulty of debugging due to the lack of reproducibility. The problem is also exacerbated by the huge fluctuation of connectivity and computational resources. A testbed like Planet-Lab can not be replaced by other tools however there is a strong need to complement it. Kernel-Level Emulators. Examples include systems like Modelnet[10] and NCTUns[11]. The main idea is to use the kernel to intercept network traffic and manipulate it to emulate the conditions of a physical topology. Total transparency to the overlying application is one of the strongest advantages of this approach. The main disadvantages are: i) A rather involved deployment process and the need to have a dedicated infrastructure for it, ii) While the emulated network behavior is repeatable in terms of delay, congestion, and packet loss etc., the fact that each Peer lives in a separate (and most likely multi-threaded) OS process violates the high reproducibility requirement. Application-Level Emulators. Examples include systems like EmuSocket[1] and WiDS [8]. The main idea here is similar to Kernel-level emulators. Interception of network events is accomplished by providing to the application an interface that resembles the standard network APIs. Thus, transparency is partial due to the need for slight modification of the application code. The approach retains the lack of reproducibility property due to the same reasons as Kernel-Level Emulators, namely the control of the operating system on the concurrency. However, deployment is much easier and does not need any dedicated infrastructure. Replay Debugging. Such tools tackle the issue of reproducibility by recording the execution of all network and concurrency events. The recorded events can be replayed in a deterministic way thus enabling complete reproducibility. The way of achieving this may vary. For instance, in Liblog[3] call to libc are intercepted and recorded in a causality preserving fashion. In that way Liblog, it’s a perfect complement to Planet-Lab for recording and replaying a specific test run. Thus, it cannot be used for replaying the same experiment in different network conditions after code changes. In [7], an internal Microsoft software, real code is generated from a model written using a specification language. Executions of the generated code could then be recorded and replayed as in the case of Liblog. However, adopting it would require a complete re-writing of the application using the WiDS model, which is not a feasible solution in our case. Moreover, the main disadvantage of both is that they are restricted to the C/C++ programming language.

Scenario Mgmt

Application Instance

Application

Config. Mgmt

Translation

DES

Network Services

Concurrency Services

Network Model (Delay & Bandwidth)‫‏‬

Timers

Context Services

Time Services

Reflection

Figure 1. MyP2PWorld Architecture

4

Our Approach

running threads while redirecting their scheduling to the discrete-event simulator.

Our approach resolves the lack of reproducibility problem of application-level emulation. The main novelty is that we do not stop at emulating the network behavior but we go further into taking control over concurrency and system time. The main idea is that the same code could be executed in emulated and real mode. Real mode means network events are sent to a real network, concurrency and time are provided by the OS. Emulated mode means that network events, concurrency and time are all controlled by a discrete-event simulator. To explain how we realized our approach we have to briefly outline how network communication, concurrency and system time are realized by the already-existing production code and what is needed to create a corresponding emulation environment.

System Time During emulation mode, time is measured in simulated time units. However, the application code, defines time quantities such as the length of a timeout period in real time units. Therefore, again, for transparency’s sake, care has to be taken to provide a proper correspondence between simulated and real time units.

Network Communication. The application depends on Apache MINA[5], a high-performance Java networking framework. It provides an event-driven API on top of Java non-blocking I/O libraries. It has many advantages such as filter chains and decoupling of marshaling formats from communication logic among other things. It provides a threading model to control the number of threads dedicated for network I/O. Creating a corresponding emulation environment requires that we keep all application code that depends on MINA interface intact while providing an alternative implementation. This is very similar in nature to the typical case of providing an emulated socket implementation except that it is done on the level of MINA rather than on the level of TCP/UDP sockets.

Translation Layer: Provides to the real application an interface that looks like real network/OS services however that get routed to the DES instead of being routed to the corresponding network/OS services.

Concurrency. Aside from MINA threads, the application has a number of threads for scheduling of periodic activities and timeouts. To emulate concurrency, we need to preserve the programming interface for creating and

5

System Architecture MyP2PWorld is organized into four layers:

Discrete-Event Simulation (DES) Layer: Provides simulation time and network model and is not visible to the real application.

Real Application Under Test: Multiple instances of the real application that got minimally modified to use the translation layer. Scenario Management Layer: The main execution entry point. Responsible for taking as input a scenario file and configures all layers such as forking and killing instances of the peers at specified times, configure network behavior etc.

5.1

Discrete-Event Simulator

This layer could be (and in fact has been) used on its own as a traditional simulator. Every simulated node has access to a timer abstraction where it can schedule events in the future. For the network model, the most important feature is

the bandwidth model because it is crucial to studying content distribution protocols. Our work has mainly been inspired by BitTorrent simulators such as [2] and [12]. However, we have worked on providing a compact, explicitlyspecified model with an efficient implementation. We won’t delve into the traditional details of the DES, however we will briefly describe our bandwidth model. 5.1.1

Bandwidth Model

Given a peer, we assume that its upload and download bandwidths are independent. Consequently, we logically split the peer into two separate entities: a sender S which controls the upload bandwidth and a receiver R that controls the download bandwidth. Once the sender starts sending a block of data, the network should try to send the block at the maximum possible speed between the two parties. While the piece is in transit we say that S and are R have an ongoing “transfer”. Naturally, the transfer of a certain block is affected by other transfers taking place between S or R and any other third party. The main quantities needed for the description of the model are: β the maximum bandwidth of a party, α the available (free) bandwidth of a party, and τ set of ongoing transfers of party. Bandwidth allocation Each time a block is sent, i.e. a new transfer t is started, the amount of bandwidth bw(t) that is given to the new transfer is equal to:     βS max αS , |τS | + 1 ,   bw(t) = min    βR max αR , |τR | + 1

(1)

Having determined bw(t), allocating it might require the “squeezing” of ongoing transfers at either/neither/one of the sides. At a given side where squeezing is needed, there is a certain amount of bandwidth π = bw(t) − α that need to be collectively deducted from the ongoing transfers to make room for the new connection t. A transfer gets a deduction only if it is using more than its fair share f = β/(|τ | + 1). Note that f = bw(t) for at least one side, but might not be true for the other side. Let τ 0 = {x ∈ τ : bw(x) > f } be the set of transfers that are taking more than their fair share. We only deduct from transfers in τ 0 . However we need to figure out how to collectively deduct π from members of τ 0 . Let ex = bw(x) − f, ∀x ∈ τ 0 be the extra amount of bandwidth that a connection x is taking beyond its fair share. Let bw0 (x) be the new bandwidth of a transfer x after deduction, bw0 (x) = bw(x) − (ex /Σi ei )π. That is, the connections are squeezed proportional to their extra bandwidth, which guarantees that no connection is squeezed to less than its fair share.

Needless to say, when the bandwidth of a transfer is squeezed, the delivery time of the transferred block is rescheduled to a later point in time proportional to the amount of squeezed bandwidth and the duration of time the block stayed in transit before the squeeze occurred. The effect of allocating a new transfer goes beyond the two involved parties, because a transitive chain of readjustment is triggered. A squeeze of a transfer on the sender or the receiver frees some bandwidth on some thirdparty. Consequently, the third party would experience an effect similar to transfer deallocation because some bandwidth was freed on its end. This would result in its turn in the boosting of some of its ongoing transfers which would affect a fourth party and so forth. This process can take some iterations to converge. Ultimately, all bandwidth that could be utilized (respecting the nodes configurations) will be allocated. However, the process can suffer from the fact that the adjustments are very small quantities. Accepting a low threshold of as low as 2% of unutilized bandwitdth usually results in quick convergence.

5.2

Translation Layer

This layer is actually the core layer of MyP2PWorld and provides three core functionalities: 5.2.1

Network Services

Apache MINA is situated between the application and the Java NIO network APIs. It exposes to the application an event-driven interface. We preserved this style of interaction with the application and redirected all interactions to/from the real network that we initially passing through Java NIO to the DES layer instead. Listing 1 shows the skeleton of a minimal TCP server. As we can see, the changes are limited to modifying the import line from the original to the modified version of the MINA APIs.

Listing 1. MINA TCP server with minimal changes that

enable switching between real and emulated modes with minimal code changes import o r g . a p a c h e . mina . common . I o A c c e p t o r ; import org.apache.mina.transport.nio.SocketAcceptor com . p e e r i a l i s m . s i m p i p e . S o c k e t A c c e p t o r ; import j a v a . n e t . S o c k e t A d d r e s s ; .... S o c k e t A d d r e s s s e r v e r A d d r e s s = new S o c k e t A d d r e s s ( ‘ ‘ l o c a l h o s t ’’ , 1 2 3 4 ) ; IoAcceptor acceptor = SocketAcceptor ( ) ; a c c e p t o r . b i n d ( s e r v e r A d d r e s s , new I o H a n d l e r A d a p t e r (){ p u b l i c void messageReceived ( I o S e s s i o n s e s s i o n , O b j e c t message ) { . . . } p u b l i c void messageSent ( I o S e s s i o n s e s s i o n , O b j e c t message ) { . . . } public void s e s s i o n C l o s e d ( I o S e s s i o n s e s s i o n ) { . . . } public void s e s s i o n C r e a t e d ( I o S e s s i o n s e s s i o n ) {...} public void s e s s i o n I d l e ( I o S e s s i o n session , I d l e S t a t u s s t a t u s ) { . . . } ... }); ....

5.2.2

Concurrency Services

The main issue with taking control over concurrency is to eliminate all OS threads while in emulation mode, without changing the production code of the already-developed application. The approach to support this requirement is to have all concurrent events as atomic non-blocking actions. This style is already supported and advocated in Java since version 1.5 by using futures and executors. Futures are abstractions for representing the results of an asynchronous operation. For instance, instead of writing a periodic activity as loop in a blocking thread, one uses a future and schedules its execution using an executor after a certain delay. The executor itself can incorporate a single thread or a thread pool. This means if one has n periodic activities, instead of having n threads, one can use a single executor which can incorporate one or more threads. We have wrapped the Java Futures and Executors classes to provide support for transparent switching between real and emulation modes. Listing 2 outlines this style of programming pattern and shows the minimal change needed by substituting the original import line with an import line that loads our wrapped future and executor. Having said that, we also have to report that not all developers were adopting this style in the production code. So in fact a bit of refactoring was necessary. However this style was embraced by the development team and was regarded as an improvement rather than imposing an unnecessary change just to support emulation. Using this the style provided a cleaner code and simplified among other things the process of tuning the number of threads dedicated to periodic activities and timeouts.

Listing 2. Wrapping of the Future, Executor and System Time

import j a v a . l a n g . R u n n a b l e ; import java.lang.System com . p e e r i a l i s m . S i m u l a b l e S y s t e m ; import java.util.concurrent.ScheduledFuture com . p e e r i a l i s m . S c h e d u l e d F u t u r e ; import java.util.concurrent.ScheduledThreadPoolExecutor com . p e e r i a l i s m . S c h e d u l e d E x e c u t o r ; ... c l a s s S o m e A c t i v i t y implements R u n n a b l e { public void run ( ) { // Activity } } .... S c h e d u l i n g E x e c u t o r e x e c u t o r = new S c h e d u l i n g E x e c u t o r ( ) ; S o m e A c t i v i t y a c t i v i t y =new S o m e A c t i v i t y ( ) ; .... / / Periodic Activity long delay ; long p e r i o d ; ScheduledFuture periodicFuture = executor . scheduleAtFixedRate ( a c t i v i t y , d e l a y , p e r i o d , T i m e U n i t . MILLISECONDS ) ; / / Timeout ScheduledFuture timeoutFuture = executor . schedule ( . . . a c t i v i t y , d e l a y , T i m e U n i t . MILLISECONDS ) ; ....

5.2.3

System Time Services

As mentioned earlier, in emulation mode, events happen on the simulated time scale. A simulated time unit models a millisecond. We have wrapped the System.currentTimeMillisecs() to provide a transparent support for working with system time. Listing 2 shows that the specification of time units is transparent to the mode of operation, again by importing the wrapped libraries. 5.2.4

Context Services

Unfortunately controlling threads inside the application is not sufficient for providing high reproducibility. The main problem is that our application (like most other P2P applications) was not designed for many nodes to run in the same OS process. Global data structures like singletons and loggers are examples of major issues in this category. For that, we had to introduce to the DES layer the concept of a “context”, i.e. when a node is created, it has to request from the DES layer the creation of a context labeled by a unique id of the node. When the time comes for an event to be fired, the scheduler switches to the context of the executing node and we expose to the application the service of querying the emulation layer about the current context. Using the context services, singletons and loggers of all nodes were able to coexist in the same OS process as described below. A singleton, in real mode, stores one instance of an object. In emulated mode, singletons were made to store sets of objects indexed by context ids, every time a singleton is requested to return an instance, it calls the scheduler to know in which context it is running and returns the corresponding instance. The above was a quick solution which does not satisfy the requirement of transparency, however we are working on a better solution using the Java Class loaders. For logging, the product was using the Slf4j[6] package whose purpose is to provide a standard logging interface to the application. Different implementations of this interface may be used. We produced our own context-aware implementation. Therefore, that was a totally transparent change from an application point of view. Other minor issues like port numbers, file locations, etc. were solved using configuration parameters.

6

Related Work

As we explained in section 3, the main approach in application-level emulation is to focus on taking control over network communication and leaving concurrency and system time in the hands of the operating system. The exception of this was the work in RealPeer[4] which, independently, raised the issue for the need to take control over

concurrency and system time. The difference between our work and RealPeer is that we try to achieve this goal be wrapping APIs that are either standard or already widelyused by developers, while RealPeer tries to achieve the same goal by requiring the application to use their framework which has been designed to be comprehensive and generally-applicable as much as possible. One can argue that both approaches have their merits depending the conditions of each project.

7

Conclusion & Future work

In this work, we have provided a case study summarizing our experience with improving the testing/evaluation process of an already-developed P2P application. Our requirements for such an environment were: minimal changes of the production code, ease of deployment, and high reproducibility. By inspecting the state of the art, we found that we could not find a tool that simultaneously satisfies all the requirements. Therefore, we created our own. Our approach was to adopt application levelemulation but with ensuring high reproducibility by controlling concurrency and system time. The resulting environment entitled “MyP2PWorld” has been used for a number of months for testing Peerialism’s P2P live streaming solution. MyP2PWorld has resulted in huge improvements in product quality and bug discovery rate and became an integral part of the testing process. While we have initially developed MyP2PWorld as a tool to complement the testing and evaluation process of a particular product at Peerialism Inc., we are now trying to provide MyP2PWorld as an open source tool on its own that could be used in other projects. We are currently in the process of adding the following features: achieving complete transparency for the context services, improving performance by adding a parallel scheduler, augmenting our bandwidth model to provide an enhanced behavior for UDP communication, adding recording and selective replay features.

8

Acknowledgments

We would like to thank Nils Franzen and Magnus Hedbeck who were the primary users of our work and who have given valuable input that was indispensable for making MyP2PWorld a usable tool in a production environment.

References [1] Marco Avvenuti and Alessio Vecchio. Applicationlevel network emulation: the emusocket toolkit. J. Network and Computer Applications, 29(4):343–360, 2006.

[2] Ashwin R. Bharambe, Cormac Herley, and Venkata N. Padmanabhan. Analyzing and improving a bittorrent networks performance mechanisms. In INFOCOM. IEEE, 2006. [3] Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica. Replay debugging for distributed applications. In ATEC ’06: Proceedings of the annual conference on USENIX ’06 Annual Technical Conference, pages 27–27, Berkeley, CA, USA, 2006. USENIX Association. [4] Dieter Hildebrandt, Ludger Bischofs, and Wilhelm Hasselbring. Realpeer–a framework for simulationbased development of peer-to-peer systems. In PDP ’07: Proceedings of the 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, pages 490–497, Washington, DC, USA, 2007. IEEE Computer Society. [5] The Apache Mina Java http://mina.apache.org.

Netwokring

Library.

[6] The Slf4j Java Logging Library. http://www.slf4j.org. [7] Shiding Lin, Aimin Pan, Zheng Zhang, Rui Guo, and Zhenyu Guo. Wids: an integrated toolkit for distributed system development. In HOTOS’05: Proceedings of the 10th conference on Hot Topics in Operating Systems, pages 17–17, Berkeley, CA, USA, 2005. USENIX Association. [8] Kazuyuki Shudo, Yoshio Tanaka, and Satoshi Sekiguchi. Overlay weaver: An overlay construction toolkit. Computer Communications, 31(2):402–412, 2008. [9] The Planet-Lab Testbed. http://www.planet-lab.org. [10] Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kostic, Jeffrey S. Chase, and David Becker. Scalability and accuracy in a large-scale network emulator. In OSDI, 2002. [11] S. Y. Wang, C. L. Chou, and C. C. Lin. The design and implementation of the nctuns network simulation engine. Simulation Modelling Practice and Theory, 15(1):57–81, 2007. [12] Weishuai Yang and Nael B. Abu-Ghazaleh. Gps: A general peer-to-peer simulator and its use for modeling bittorrent. In MASCOTS, pages 425–434. IEEE Computer Society, 2005.