Quality-of-Service-Aware Configuration of Distributed ...

Lehrstuhl für Informatik 6 Datenmanagement

Quality-of-Service-Aware Configuration of Distributed Publish-Subscribe Systems A Massive Multiuser Virtual Environment Perspective

Thomas Fischer

Dissertation

Quality-of-Service-Aware Configuration of Distributed Publish-Subscribe Systems A Massive Multiuser Virtual Environment Perspective —————–

Dienstgütebezogene Konfiguration von verteilten Publish-Subscribe Systemen Eine Perspektive für virtuelle Welten

Der Technischen Fakultät der Friedrich-Alexander Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades DOKTOR-INGENIEUR (Dr.-Ing.) vorgelegt von

Klaus Peter Thomas Fischer aus Nürnberg

Als Dissertation genehmigt von der Technischen Fakultät der Friedrich-Alexander Universität Erlangen-Nürnberg Tag der mündlichen Prüfung: 17.06.2014

Vorsitzende des Promotionsorgans: Prof. Dr.-Ing. habil. Marion Merklein Gutachter:

Prof. Dr.-Ing. habil. Richard Lenz Jun.-Prof. Dr. Peter Fischer Prof. Dr. Dirk Riehle, MBA

Kurzfassung Mit der Verbreitung von hochskalierbaren Anwendungen wie Massively Multiplayer Online Games (MMOG), entstanden einige einzigartige Herausforderungen bezüglich der Skalierbarkeit und Wartbarkeit von komplexen Systemen. Publish-Subscribe Systeme bieten ein skalierbares und lose gekoppeltes Entwurfsparadigma, um diese Herausforderungen anzugehen. Doch die Vielzahl von technischen Ansätzen, mit ihren unterschiedlichen Optimierungszielen auf der einen Seite, und der Vielfalt der Semantik und Dienstgüteanforderungen verschiedener Anwendungen auf der anderen Seite, führt zu einer Kluft, die nicht einfach durch den typischen Anwendungsentwickler überbrückt werden kann. Diese Arbeit untersucht eine Methodologie für die Konfiguration von Publish-Subscribe Systemen, welche diese Kluft mit Hilfe eines automatisierten Workflows schließt. Hierbei übersetzt er die Anforderungen und Semantik einer Anwendung, die ein Entwickler in seiner domänenspezifischen Terminologie formuliert, in die optimierte Konfiguration einer Publish-Subscribe-Middleware. Dazu beschreibt diese Arbeit ein Framework für konfigurierbare Middleware mit dem Fokus auf Konfigurierbarkeit zur Entwurfszeit. Außerdem wird ein flexibles und erweiterbares Modell für die domänenspezifische Konfiguration verteilter ereignisbasierter Systeme vorgeschlagen, begleitet von dem entsprechenden Workflow für die Bereitstellung einer maßgeschneiderten Publish-Subscribe Middleware. Kombiniert ergeben diese drei Teile eine ganzheitliche, entwicklerfreundliche Methodologie für die dienstgütebezogene Konfiguration von verteilten Publish-Subscribe Systemen. Die Arbeit schließt mit einer Bewertung der Methodologie in Form einer Diskussion über die erreichten Möglichkeiten und einer quantitativen Analyse seiner Leistung und Qualität.

Abstract With the raise of internet-scale applications like massively multiuser virtual environments, some unique challenges were introduced regarding scalability and the maintainability of complex systems. Distributed event-based systems in general and especially publishsubscribe systems offer a scalable and loosely coupled paradigm to address these challenges. However, the large variety of existing technical solutions, with their different optimization targets on the one hand, and the variety of semantics and quality-of-service requirements of different applications on the other hand, introduces a gap that is not easily filled by the application developer. This thesis explores a methodology for the configuration of publish-subscribe systems that closes this gap by providing an automated workflow that translates the requirements and semantics a developer formulates in his domain-specific terminology to a suitable configuration of a publish-subscribe middleware. Hereby, this work covers the design of such a configurable middleware with the focus on design-time configuration, as this configuration method promises to introduce the least overhead. Moreover, a flexible and extensible model for the domain-specific configuration of distributed event-based systems is suggested, accompanied by the corresponding workflow for the provisioning of a customized publish-subscribe middleware. Combined, these three parts provide a holistic developer-friendly methodology for quality-of-service-aware configuration of publish-subscribe systems. The thesis concludes with an evaluation of the methodology in form of a discussion of its capabilities and a quantitative analysis of its performance and quality.

Acknowledgements First and foremost I offer my sincerest gratitude to Prof. Dr-Ing. Richard Lenz, whose invaluable help and advice guided me through this whole PhD project. He also gave me the necessary freedom to pursue my personal research interests in this thesis and for that I am most grateful. I wish to thank my second reviewer Prof. Dr. Peter Fischer for his valuable feedback during the final stages of this project. Prof. Dr. Dirk Riehle deserves my gratitude for taking over the part as third reviewer. I also want to express my thankfulness to all the students who contributed to this thesis. Two students deserve special appreciation: Daniel Bonrath and Michael Baer. They will not only continue the Tri6 project in the future, they also unstintingly squashed any bugs that came across during the implementation phase of the prototype. Special gratitude also goes to my current colleagues Johannes Held and Andreas M. Wahl for the implementation of the library’s core and the configuration component during their master theses. In addition, they always were valuable discussion partners. I am also very grateful for the always sympathetic ear of my former colleagues Christoph Neumann, Michael Daum, and especially my former master thesis advisor Florian Irmert who initially introduced me into research. The event processing group of the chair for data management was always a great place for scientific discussion. I want to substitutionally mention Niko Pollner and Frank Lauterwald, who always where a great asset to tackle tricky problems. Last but not least I wish to thank my family and friends for their endless support, especially my wife Eva who always endured the my lack of time, the effects of sleep deprivation, and my mental absence during the formation of this thesis.

Publications Some parts of this thesis are based on previous publications. Chapter 3 and chapter 10 are influenced by joint work with Michael Daum, Florian Irmert, Christoph Neumann and Richard Lenz published in [FL10a, FDI+ 10, FL10b]. Chapters 12 and 13 are partially based on publications with Johannes Held, Frank Lauterwald and Richard Lenz [FHLL11, FHL11]. Section 15.1 is based on joint work with Andreas M. Wahl and Richard Lenz [WFL14]. Additional publications by the author have not contributed to this thesis [IFMW08, Fis08, ILB+ 08, IFLMW08, IFLMW09, FHM+ 09, NFL10].

Table of Contents Acknowledgements

e

Publications

g

I

1

Prologue

1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 1.1.1 The Quest for Holy Scale . . . . . . . . . . 1.1.2 The Speed of Stock Markets . . . . . . . . 1.1.3 The Lack of Resources in Sensor Networks 1.1.4 Facets of Big Data . . . . . . . . . . . . . 1.2 Open Research Questions . . . . . . . . . . . . . 1.3 Focus and Contribution . . . . . . . . . . . . . . 1.4 Structure . . . . . . . . . . . . . . . . . . . . . . 2 Methods 2.1 Case Study . . . . . . . . . . . . . . . . 2.2 Literature Analysis . . . . . . . . . . . . 2.3 Proof of Concept . . . . . . . . . . . . . 2.3.1 Design-Time Configuration . . . . 2.3.2 Multidimensional Classification . 2.3.3 Non-Parametric Cost Estimation 2.4 Discrete-Event Simulation . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

3 7 7 10 11 12 13 18 22

. . . . . . .

25 25 26 27 29 30 30 31

3 Scenario 33 3.1 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

I

Table of Contents

3.2 3.3

3.4

II

Tri6 . . . . . . . . . . . . . Use Cases . . . . . . . . . . 3.3.1 Movement . . . . . . 3.3.2 Collision . . . . . . . 3.3.3 Chat . . . . . . . . . 3.3.4 Match Coordination Summary . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

State of the Art

4 Overlay Networks 4.1 Unstructured Overlays . . . . . . 4.2 Structured Overlays . . . . . . . . 4.2.1 Chord . . . . . . . . . . . 4.2.2 Pastry . . . . . . . . . . . 4.2.3 CAN . . . . . . . . . . . . 4.3 Network Characteristics . . . . . 4.3.1 Network Topology Models 4.3.2 Network Metrics . . . . .

43 44 46 48 49 51 52

53 . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

5 Event-based Systems 5.1 Data Models . . . . . . . . . . . . . . . . . . . . . . 5.2 Filter Mechanisms . . . . . . . . . . . . . . . . . . 5.2.1 Channels . . . . . . . . . . . . . . . . . . . . 5.2.2 Topic-Based Filter . . . . . . . . . . . . . . 5.2.3 Content-Based Filter . . . . . . . . . . . . . 5.2.4 Type-Based Filter . . . . . . . . . . . . . . . 5.2.5 Advanced Filter Concepts . . . . . . . . . . 5.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Broker-Based Routing . . . . . . . . . . . . 5.3.2 Hierarchical and Rendezvous-Based Routing 5.3.3 Semantic Routing Concepts . . . . . . . . . 5.3.4 Application-Layer Multicast . . . . . . . . . 5.4 Reliability . . . . . . . . . . . . . . . . . . . . . . .

II

. . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

57 59 60 63 65 67 69 70 73

. . . . . . . . . . . . .

77 81 83 83 84 84 89 90 91 92 96 98 99 105

Table of Contents

5.5

5.6 5.7

Quality-of-Service . . . . . . . . . . . . 5.5.1 Latency . . . . . . . . . . . . . 5.5.2 Throughput . . . . . . . . . . . 5.5.3 Delivery . . . . . . . . . . . . . 5.5.4 Order . . . . . . . . . . . . . . 5.5.5 Timeliness . . . . . . . . . . . . 5.5.6 Security . . . . . . . . . . . . . Reconfiguration and Adaptability . . . Existing Publish-Subscribe Middleware

6 CAP Theorem 6.1 Partition Tolerance . . . . . . . . . . 6.2 Consistency . . . . . . . . . . . . . . 6.3 Consistency vs. Availability Tradeoffs 6.4 Discussion . . . . . . . . . . . . . . .

III

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

106 106 107 107 108 111 113 113 115

. . . .

123 125 126 129 130

QoS-Aware Configuration of Publish-Subscribe Systems

131

7 Requirements and Limitations 7.1 Initial Assumptions . . . . . . . . . . . . . . . . . . . . . . 7.2 Requirements for a Design-Time Configurable Framework . 7.3 Requirements for a QoS-aware Configuration Description . 7.4 Requirements for an Automated Design-time Configuration 8 Basic Reference Architecture 8.1 Application Layer . . . . . . 8.2 Overlay Network Layer . . . 8.3 Notification Service Layer . 8.4 Summary . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

9 A Design-Time Configurable Publish-Subscribe Framework 9.1 Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

135 136 138 139 141

. . . .

145 148 149 151 152

. . . .

153 154 155 157 159

III

Table of Contents

9.3 9.4 9.5

9.2.3 Partition . . 9.2.4 Rendezvous 9.2.5 Order . . . 9.2.6 Delivery . . 9.2.7 Timeliness . Interaction Model . Configuration . . . Summary . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

10 QoS-Aware Configuration 10.1 Event Semantics . . . . . . . . . . . . . . . . . . . 10.1.1 Dimensions . . . . . . . . . . . . . . . . . . 10.1.2 Examples of event semantics . . . . . . . . . 10.2 Configuration Model . . . . . . . . . . . . . . . . . 10.2.1 Multidimensional Application Classification 10.2.2 System Model . . . . . . . . . . . . . . . . . 10.2.3 Class Spaces . . . . . . . . . . . . . . . . . . 10.2.4 Artifacts . . . . . . . . . . . . . . . . . . . . 10.3 Summary . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . .

11 Automating Design-Time Configuration 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . 11.2 Basic Solution Framework . . . . . . . . . . . . . . . 11.3 Naive Workflow . . . . . . . . . . . . . . . . . . . . . 11.4 Optimized Workflow . . . . . . . . . . . . . . . . . . 11.4.1 Space-Filling Experiments and Meta-Modeling 11.4.2 Generate Meta-Models . . . . . . . . . . . . . 11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . .

IV

M2etis: Prototypic Implementation

12 M2etis: Architecture

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . .

160 160 161 162 162 162 168 170

. . . . . . . . .

171 171 173 176 176 178 182 183 187 189

. . . . . . .

191 191 192 194 199 200 203 205

207 209

13 M2etis: Library 211 13.1 Overlay Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 13.2 Notification Service Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 216

IV

Table of Contents

13.3 Processing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 13.4 Available Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 14 M2etis: Simulator

223

15 M2etis: Configurator 225 15.1 MATINEE: M2etis QoS-aware Semantics Modeling Language . . . . . . . 225 15.2 MAESTRO: M2etis Adaptive System Configurator . . . . . . . . . . . . 231

V

Evaluation

233

16 Comparison of Capabilities 237 16.1 Validation of Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 237 16.2 Classification of m2etis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 17 Quantitative Evaluation 17.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Resource Consumption of M2etis . . . . . . . . . . . . . . . . . . 17.3 Limitations of the Simulation Model . . . . . . . . . . . . . . . . 17.4 Simulated Scalability of Selected Configurations . . . . . . . . . . 17.4.1 One-to-many distribution . . . . . . . . . . . . . . . . . . 17.4.2 Many-to-many distribution . . . . . . . . . . . . . . . . . . 17.4.3 Impact of Strategies on QoS Metrics . . . . . . . . . . . . 17.5 Required Effort for Configuration . . . . . . . . . . . . . . . . . . 17.6 Performance Impact of Configurability . . . . . . . . . . . . . . . 17.6.1 Design-Time Configuration vs. Run-time Configuration . . 17.6.2 Design-Time Configuration vs. Non-Configurable Solutions 17.7 Quality of configuration decisions . . . . . . . . . . . . . . . . . . 17.8 Expenditure of Time for Configuration Automation . . . . . . . . 18 Discussion

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

245 245 246 249 250 252 253 257 258 260 261 263 265 269 273

V

Table of Contents

VI

Epilogue

19 Further Work 19.1 Framework Enhancements . . . . . . . . . . . . . 19.1.1 Semantic-Aware Filter and Routing . . . . 19.1.2 Security Aspects . . . . . . . . . . . . . . 19.1.3 Software Development . . . . . . . . . . . 19.2 Methodology Enhancements . . . . . . . . . . . . 19.2.1 Content-aware decision model . . . . . . . 19.2.2 Parameter Optimization and Adaptiveness 19.2.3 Refinement of the System Model . . . . . 19.3 Mobile Platforms . . . . . . . . . . . . . . . . . .

275 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

277 277 277 277 278 278 278 279 279 280

20 Conclusion

281

Bibliography

283

Symbols

311

Acronyms

315

Glossary

317

List of Figures

323

List of Tables

326

VI

Part I Prologue

1

1 | Introduction “One of the most difficult tasks men can perform, however much others may despise it, is the invention of good games.” Carl Gustav Jung, 1875-1961, Psychologist

Computer games have become the latest addendum to the great history of games and have ever since been a fast growing part of our social culture. They are now fueling a whole branch of the computer industry with a global market revenue of $65 billion in 2011, up from $62.7 billion in 2010. By 2016 the revenue is expected to be about $81 billion [DFC12]. This large economic impact has changed the way this industry works. From garage startups with a few people hacking games, it has emerged to the current large software studios with hundreds of artists and computer scientists. The budget of so-called next-gen1 titles is often larger than a Hollywood blockbuster movie’s. The average development costs of a multi-platform next-gen game are between $18 million and $28 million, but currently peaks around $50 million [Cro10]. Massively Multiplayer Online Games(MMOGs) like the renowned World of Warcraft (WoW)2 are one of the most challenging genres the gaming industry currently offers. Such games scale beyond classical multiplayer games and enable thousands of players that populate a common virtual world. However, direct interaction between players

1

The video game industry works in iterations, driven by the hardware the games run on. In 2012 we are in the seventh generation of hardware, on the brink of the eighth iteration, beginning with personal computers in 1970s. The term next-gen defines games running on the next generation of hardware which is not yet commonly adopted by the industry and conveniently supported by tools. 2 World of Warcraft is the most successful MMOG ever produced until 2012. It had a maximum of over 12 million paying subscribers in 2011 (cf. Section 3)

3

1 Introduction

is still limited to the lower hundreds. To include serious games, virtual spaces and simulations in the discussion, Massively Multiuser Virtual Environments(MMVEs) is the more general term, of which MMOGs are a specialization. The underlying architectures encounter a lot of advanced and partly unique challenges. They require a scaling infrastructure to cope with the enormous number of concurrent clients. This scalability must be ensured while maintaining some very restrictive qualityof-service requirements in order to enable a fluent responsive illusion of a virtual world. Current MMVEs are mainly developed as centralized client-server systems, with a request-response communication paradigm. To scale such architectures, cluster or grid approaches are employed, leading to large data-centers operating one single virtual world. For example Eve Online, an MMOG in a science-fiction setting, requires around 195 servers to provide a virtual world for 300,000 players [FCBS08]. Besides the huge economic impact and the challenges in engineering, games are more and more recognized as an art form and compared to emergence of radio, movies and television, as Poole discussed in [Poo00]. He states that computer games are an art just like pictures, movies and television, it is just the newest type of art, which has always been the source for heated debates about the good and the evil. But besides these discussions about the social and cultural influence of games, they undoubtedly provide a huge potential for research in computer science, as many authors suggest. For example, Demers et al. [DGK+ 09] and White et al. [WKG+ 07], both working in the same group at Cornell University, identify various challenges in the field of databases. They range from consistency issues over indexing challenges to a data-driven design methodology, covering a large portion of current database research challenges. Zhao [Zha11] focuses on the graphical aspects and the complexity of virtual worlds. Anderson [AECM08] makes the case for best practices and middleware solutions for game development and identifies the corresponding challenges. However, despite their different foci, one challenge is prominently mentioned by all authors: the strict performance requirements introduced by the gaming domain. From databases to software engineering, computer games affect a lot of research fields and are even the motor of some: e.g. research in the field of computer graphics is mainly driven by the gaming industry. Also the hardware industry for personal computers is influenced by gaming. Nobody denies that it is not required to have a bleeding edge quad-core CPU just for office applications. Research should not only try to answer unique questions arisen during the development of games. Computer games can also be used to illustrate already existing research

4

challenges. For example they can serve as a scenario to draw use cases from in a large variety of research fields. This could help to reach a broader audience and generate interest from students to join a certain research field. In the field of data management and distributed systems only a handful of researchers exploit the opportunity to instrument gaming for their research. In contrast, the industry already learned the value of games or at least of game elements in day to day life. Nowadays, you can earn badges for repeatedly posting in a forum, auctioning stuff on ebay feels like a game, even applying for a loan feels like some kind of game. You use sliders to play with the sum and the rate for a loan, while battling anonymous opponents in ebay for an item of interest. This exploitation of game elements defined the term gamification. “Gamification refers to the use of design elements (rather than full-fledged games), characteristic for games, in non-game contexts.” [DDKN11] Even in research areas where a full-fledged game might not be the application of choice, gamification could for example enhance the acceptance or return rate of a survey or foster the usage statistics of an information system in general. If we change the viewpoint from gaming as a motor for research to how research contributes to application domains, one architectural pattern influenced the design of distributed gaming more than others: Distributed publish-subscribe rose to significance as an alternative to the request-response communication paradigm for distributed applications in general, as surveyed in [EFGK03, LP03]. Specialized MMVE architectures based on the publish-subscribe paradigm are suggested by Bharambe with Mercury [BRS02] and Donnybrook [BDL+ 08]. Knutsson [KLXH04] also suggests an architecture using a popular peer-to-peer overlay network. Those architectures enable a loosely coupled design with a data-driven view on communication. A publisher creates events and sends them via an event service. A subscriber is only interested in certain data and subscribes to the event service, stating his interest. The subscriber is notified of all events he has subscribed to. A publisher does not know its subscribers and vice versa. He even need not be online while the publisher sends an event. Moreover, neither the subscriber nor the publisher experience blocking behavior when they create or consume events. This describes the three properties each

5

1 Introduction

publish-subscribe system provides according to Eugster [EFGK03]: space decoupling, time decoupling, asynchrony 1 . In an application using publish-subscribe, the participants therefore merely distribute data to each other. They cannot request a certain piece of data and wait for the response. This enables a variety of optimizations in terms of scalability, not possible in requestresponse architectures (e.g. loose-coupling or one-to-many communication). Even though distributed publish-subscribe is well researched in many contexts, challenges arising in the context of MMVEs have not been completely answered. Especially maintaining scalability while ensuring low latency and other Quality-of-Service (QoS) requirements still states an interesting challenge. This thesis focuses on the publish-subscribe communication paradigm and the qualityof-service they can provide in large-scale environments. In an application domain like MMVEs, many different event-types are required for communication between the clients and the servers. Event-types range from position events over chat messages to metaevents, like login or server-change events. They all require different data fields, target different receiver groups and occur at a different time and frequency. Moreover, the semantics of each event-type demands different guarantees regarding its delivery, order etc. The requirements of next-gen MMVEs are even more challenging, as they introduce more and more dynamic components in the gameplay that lead to even more different event-types. Current distributed publish-subscribe research exploits the capabilities of the different semantics, these event-types have, only on an elementary level. That means there exist approaches that consider and exploit some QoS metrics, but only very few allow for the adaptability of the system with respect to those metrics. This is the essence this thesis discusses: How can event semantics be exploited for adaptation, in order to enhance scalability of a distributed publish-subscribe system? The way how this challenge is addressed in this work is to exploit two assumptions: First, a homogeneous system may be configured at design-time, in order to instrument the compilation process. This leads to design-time configurability. Second, semantic and QoS properties of event-types can be interpreted to derive knowledge about the optimal configuration of an event dissemination system.

1

6

Of course, the state of a distributed game has to be as consistent as possible which suggests synchronous communication. But for a distributed system, prone to a variety of different failures, a synchronous communication could violate the illusion of a fluent game in cases of failure. We discuss the limitations and resulting tradeoffs exhaustively in the context of the CAP Theorem in chapter 6.

1.1 Motivation

Online games themselves were the inspiration for this thesis, as well as the application domain to demonstrate the usability of the presented approach for QoS-aware configuration of distributed publish-subscribe systems. However, the scope of applicability is not limited to this particular application domain, as we will go into detail about in section 1.1. Though, in the spirit of gamification, games will be used as an instrument of illustration and to draw use cases from.

1.1 Motivation The previous introductory words gave a small insight into the motivation and outlined the ideas and the background that inspired this thesis. In the following chapter these motivating thoughts will be detailed. It will also provide some viewpoints from different domains than MMVEs that actuate the later discussed approach on distributed publishsubscribe. Hinze et al. [HSB09] depicted many applications and technologies that motivate event-based systems, to which publish-subscribe systems count. Some of those application domains and technologies are taken up in the following sections and put into context of this thesis. First, some detailed insight in the “quest for the holy scale” is given. As Skibinsky argued in his identically titled article [Ski05], this quest is one of the main challenges in online computer games. They strive for more and more players to compete against each other, or against a computer-controlled environment. Second, this narrow view is broadened by some facets of Big Data, a current topic that spans multiple research fields and communities. This short introduction to big data is exemplified by an indepth look at the infrastructure driving todays stock markets, where speed is everything and money does not matter (in terms of hardware). To contrast this world of big iron another viewpoint is given: wireless sensor networks(WSNs) as a world driven by shortcomings. Limited energy and resources as well as unreliable connections provide unique motivational aspects compared to large scale data-centers. 1.1.1 The Quest for Holy Scale The fastest growing segment in the computer game industry are online games [MM06]. This promising market states some unique challenges for game designers and researchers. MMVEs define a distributed virtual world shared by thousands of participants, each represented by an avatar, who compete and cooperate in one enormous persistent world.

7

1 Introduction

This world may be a game world as in MMOGs, a large scale simulation e.g. for military training or a virtual environment like e.g. Second Life1 . On the one hand the design of such a living virtual world requires armies of artists. On the other hand the software backing a world of such enormous dimensions must satisfy several hard to achieve requirements. These requirements are widely publicized and discussed in the literature [KLXH04, WKG+ 07, SZ99, PW02b, HY05, NPVS07, Zha11] and depicted informally below: Consistency The MMVE must be as consistent as possible for all participants. Every event that happens inside the virtual world has to be recognizable for all affected avatars in a timely manner. Hereby, the common consistency models apply, as discussed in section 6.2. Availability The MMVE should be available 24 hours a day, 7 days a week. Users expect a high availability for the fee they pay. Therefore, down-time and login problems may decrease or even inhibit the success of the project. Persistence In MMVEs, an environment that preserves the action of its users has to be provided. Interactivity Interactivity is a crucial factor of success for MMVEs because it provides the desired experience. The illusion of a fluent responsible environment has to be maintained. Depending on the genre, avatars communicate with each other, e.g. to give tactical orders or just for social reasons. Security With millions of users the likelihood of customers trying to cheat increases. MMVEs must be able to maintain high quality regarding these requirements even if the user-base grows beyond any predictions. Scalability of an MMVE’s architecture is the crucial requirement. This states the challenge of MMVE architectures: Preserve scalability while all other requirements are met to the desired quality standard [FDI+ 10]. Obviously these requirements push against the limits of the CAP Theorem (cf. Section 6) and require complex and sophisticated architectures to meet the introduced requirements as good as possible. Current industry strength MMVE architectures favor a client/server architectural style [JWB+ 04]. In the industry tremendous effort is invested to make these architectures 1

8

Second Life (http://secondlife.com, visited on 2013-07-28) provides a living virtual world, in which users live, chat and interact represented by avatars. The virtual world itself is also enhanced by the users, creating a vibrant always changing virtual reality.

1.1 Motivation

scalable. Examples are the usage of grid approaches in Linden Lab’s Second Life or the deployment of a large hierarchical cluster for the operation of CCP’s Eve Online 1 . Currently, these architectures serve a confirmed maximum of about 47 thousand users (Eve Online) in one persistent virtual world [FCBS08]. They are only capable of serving such an amount of users as long as their avatars are somewhat evenly distributed in the virtual world. When avatars start to flock, the servers’ load reaches critical levels very fast. This problem is well-known amongst users as the “crowding” problem. Moreover, Skibinsky showed in [Ski05] that this type of architecture has “high operational cost, capable of serving in the low thousands of users in the same world and having scalability limits for future growth”. As a consequence, nearly all currently available MMVEs design their virtual worlds in a way that encourages participants to distribute evenly over the whole world and try to avoid an avatar concentration in one region. Another approach is to limit the number of avatars allowed on one server and introduce overflow servers in order to maintain a fluent simulation. The player limits in recent games have shown that game design and client/server architectures alone are not sufficient to solve the challenge of scalability. Existing classical multiplayer game architectures often use a broadcast mechanism to keep the distributed world state consistent. Ideally, all events are securely delivered and processed in the same order nearly at the same time on all clients. In a distributed environment, this requires a messaging effort of O(n2 ), n being the number of participating clients of the world. It is obvious that such architectures do not scale well beyond a certain number of clients. Multiplayer games like Quake 3 Arena2 have shown that without any optimizations, the limit is reached at about 64 clients, depending on the required update rate of the game. With current grid and cluster approaches, which shard and partition the virtual world into many instances and regions, this limit is pushed to the lower thousands. Enabling scalability beyond that limit still presents a challenge. Distributed systems using specialized peer-to-peer (P2P) architectures for MMVEs, as e.g. surveyed by Yahyavi [YK13], are a promising attempt to handle the snowballing number of players. In other domains like file sharing these distributed architectures

1

Eve Online (http://www.eveonline.com, visited on 2013-07-28) is a large scale space simulation with many galaxies to explore. It is the only large scale MMVE that does allow for all players to populate one enormous single world. It provides a player driven economy as well as whole galaxies reigned by players. 2 Quake 3 Arena (http://www.idsoftware.com/games/quake/quake3-arena, visited on 2013-07-28) is a fast-paced ego-shooter game, where players engage each other in an arena using a vast arsenal of weaponry.

9

1 Introduction

already have proven their incomparable scalability like BitTorrent [Coh03, IUKB04]. But even in such systems, the main challenge in handling player concentration in one region is the enormous amount of messages (events) that have to be delivered to every single player in a timely fashion. Therefore, the processing and dissemination of messages needs optimization in order to increase the throughput of messages or reduce the latency of message delivery. The consideration of QoS properties or event semantics in order to handle the workloads more intelligently suggests some potential. The road for the quest for holy scale is paved, but there is still a long way to go. Exploitation of event-semantics and QoS awareness in distributed systems is a promising research area on the way to solve this quest. 1.1.2 The Speed of Stock Markets Most of nowadays’ quotes and trades conducted at stock exchanges are processed digitally. They generate enormous data streams, which have to be distributed to all traders as fast as possible. In the context of algorithmic trading, these traders are computer programs, because humans react too slow to perform trades in fractions of a second. The idea behind this trading strategy is to exploit small fluctuations in the price of a stock and place orders with high volume and a very short duration. Despite of the current ethical discussion about investment banking in general and this form of trading in particular, this domain presents a unique challenge for hardware and software architectures. The information systems backing the major stock exchanges must cope with an extremely high volume of trades per second. For example the New York Stock Exchange (NYSE) has an infrastructure to publish all quotes and trades called Securities Information Processor (SIP). This messaging system processes an average of 215,162 quotes and 28,375 trades per second. The peak load the system handles is 308,705 quotes and 49,570 trades per second in the fourth quarter of 2011. Despite the enormous number of messages, the average latency is still less than 3 milliseconds, depending on the network connectivity [Cla11]. The middleware behind this infrastructure uses a publish subscribe paradigm with Remote Direct Memory Access (RDMA) and application-layer multicast (ALM) protocols [A-T09, New12]. RDMA was invented in the context of high performance computing (HPC). It provides a non-copy semantic. To achieve this, it bypasses kernel, cache hierarchies and the normal protocol stack and directly writes into remote memory. It requires high dedicated bandwidth and specialized networks

10

1.1 Motivation

to operate on, as for example InfiniBand1 . This architecture obviously does not scale globally or even regionally with reasonable costs. Therefore, ALM products like TIBCO Rendezvous are used for the distribution beyond the borders of a dedicated network, where IP-multicasting or RDMA is not possible due to the router restrictions on the internet. TIBCO Rendezvous for example uses a proprietary reliable ALM protocol (cf. section 5.3.4) based on UDP for message dissemination called TRDP. In this area optimization of routing protocols and the adaptation of QoS requirements to the exact needs is crucial to maintain a high throughput and low latency, as every delay or incorrect information can cost millions of dollars. Currently, these architectures scale by iron, meaning the hardware determines the limit of messages per second that can be processed. Detailed exploration of QoS parameters and exploitation of event-semantics of those quote and order events could lead to optimization potential beyond system-wide optimization and reduce hardware cost, as one can focus on single event-types and optimize their dissemination independently. 1.1.3 The Lack of Resources in Sensor Networks Besides large scale architectures, another field of application motivates the exploitation of event semantics and taking QoS parameters into account for message dissemination. In the last years miniaturization of sensor nodes made some major progress. Sensor nodes became smaller and more powerful. Therefore, they have enough spare resources to be even able to do some processing on the node itself. They may be deployed in large networks, rendering even remote areas accessible for area-wide surveillance. Dargie [DP10] defines such WSNs as networks of small autonomous nodes that do not only consist of a sensor but also of a processing part. They communicate with each other to exchange and disseminate data through the network. With those enhancements, nodes are able to perform in-network data processing and analysis. The raw or preprocessed data eventually reaches a base station that controls the WSN and provides a gateway to other networks. Their application areas range from tracking in various domains (e.g. habitat tracking of animals) to monitoring purposes like patient monitoring in healthcare [YMG08]. The major challenges in this field are scalability as well as dealing with the constraints regarding energy management and the lack of processing resources [ASSC02, YMG08]. Addressing these challenges requires research regarding the hardware

1

InfiniBand (http://www.infinibandta.org, visited on 2013-07-28) is an industry-standard specification for input/output architectures with data-rates up to 120 gigabits per second.

11

1 Introduction

as well as the whole software stack. This includes the transport layer as well. Yick [YMG08] states that WSNs should support different QoS parameters e.g. reliability or congestion control depending on the application. This motivates the configurability of a middleware for WSNs. Depending on the application, the capabilities of the WSN should be configurable regarding the required QoS guarantees. But due to the resource constraints on such sensor nodes, a one-size-fits-all solution is not applicable. Design-time configuration of suitable middleware solutions poses some potential in order to keep the overhead vs. adaptability tradeoff as small as possible. 1.1.4 Facets of Big Data Events are not only required to manage virtual worlds but also in the real world as the examples of stock markets and sensor networks show. Actually events are ubiquitously present in many real world applications. If those applications reach a certain scale the exchanged data streams have an enormous volume and, assuming they have to be preserved, produce very large sets of data. In this context one very popular buzzword in the early second decade of the new millennium emerged: Big Data. The rage for collecting data increased tremendously over the last years, and continues with no decline. Estimates range about a growth of fifty percent per year [Loh12]. But what does the term Big Data actually mean? Besides a marketing term, Big Data subsumes some interesting research topics that are not uniquely new, but the viewpoint on the research topics may have changed. Jacobs [Jac09] sketches the implications on application design when dealing with Big Data. The bigger the data, the closer you get to hardware and application limits. It is inevitable to consider every step in the processing pipeline. From disk speed, memory access patterns up to the range of values variables support in the application, all matters. Stonebraker [Sto12] tries to subsume all the different aspects of Big Data in four use cases, all with different research challenges: Big volumes - small analytics Very large data sets must be query-able, but the complexity of the queries is relatively small. This means only simple operations are performed on the data like sum, count or average. Big volumes - big analytics The large data set is examined by complex analysis. In this case we speak of data mining algorithms and statistical exploration like clustering, machine learning or regression analysis. Big velocity Data arrives at a high volume and with a high velocity. This firehose of data must be absorbed and processed.

12

1.2 Open Research Questions

Big variety The data sources that form the data set vary in their format, structure, location and semantics. For example the mixture of XML, RDBMS, Web sources and spreadsheets form the data set of many companies nowadays. All those use cases have in common that they deal with an abundance of data and that this data has to be analyzed to gather insights. However, these use-cases all yield different solution spaces. Some are not new, for example storage concepts for large data sets like column store databases. But nevertheless one use case states a generic interesting motivation for this thesis. According to Stonebraker, big velocity data may be processed in two ways: As long as there are no real-time requirements, the challenge is merely a database problem and may be addressed by classical means of RDBMS. But if sub-second answers or decisions are required, CEP or data-stream systems like Esper1 , Streambase2 or Tibco Rendezvous3 are the single choice available, as they process the data on the fly. The downside is the loss of a persistent storage of the raw data stream. The challenge of such event-based systems is twofold: On the one hand high performance processing of the queries’ operators is needed. Recent research in this field targets distributed processing or the incorporation of specialized hardware like FPGAs [MTA09] or GPUs [BFH04]. On the other hand a sophisticated transport-layer or event-bus that distributes the data to the processing nodes is required. Many high-performance Publish-subscribe systems provide such a fast message dissemination infrastructure. At this particular point the discussion about exploitation of the transmitted data’s semantics can take place to further improve the performance of such an infrastructure.

1.2 Open Research Questions In the previous section we illuminated some different domains, all suitable for the deployment of publish-subscribe systems. In those domains, system design and architecture strive for a certain optimization target. Big data applications like MMVEs and dissemination of stock market events strive for shorter latencies and higher data volume, while 1

Esper (http://esper.codehaus.org, visited on 2013-07-28) is a component for complex event processing (CEP), available for Java as Esper, and for .NET as NEsper. 2 The StreamBase Complex Event Processing platform (http://www.streambase.com, visited on 2013-07-28) is a high-performance system for rapidly building applications that analyze and act on real-time streaming data. 3 TIBCO Rendezvous® is the industry-leading – and most widely deployed – low latency messaging and data distribution solution on the market. http://www.tibco.de

13

1 Introduction

sensor networks favor smaller energy consumption. They all have in common that they are systems which require a scalable architecture. This scalability must be maintained while meeting the applications’ different optimization goals and QoS requirements to the best degree possible. Motivated by those different viewpoints, we can argue about open research questions for event dissemination architectures in general and especially for publish-subscribe systems. A selection of those, relevant for this work, is introduced in the remainder of this section. Scalability Scalability is one of the main concerns when designing distributed systems. Despite the fact that there is no common definition on scalability, a common sense has been formed and Speed-up1 as well as Efficiency 2 are well-known metrics. Amdahl’s law [Amd67] provides the theoretical boundary for speed-up of algorithms achievable by scaling out. In the context of publish-subscribe systems many different approaches have been taken in order to build scalable systems. Mühl [Mü02, MFP06] focuses on different routing approaches and their impact on the scalability of a system. Bharambe suggests different architectures like Mercury [BRS02], Colyseus [BPS06], and Donnybrook [BDL+ 08]. Those systems address unique challenges that arise in systems with spacial contexts like online games. Systems like Hermes [PB02, Pie04] examine the usage of of overlay networks to increase scalability. They all have in common that they are striving to reduce unnecessary routing of messages and distribute work or resource consumption as much as possible. Nevertheless, each of them comes with a certain tradeoff. For example, a P2P system like Mercury or Hermes has a higher path latency than a centralized solution, but provides a better scalability. Another example for a tradeoff comes with the employed routing algorithms. Routing table optimizations like merging or covering only provide a clear benefit, if the rate of notifications is high compared to the rate of subscriptions [MD10]. These examples suggest that the characteristics of different algorithms and systems make them optimal for different use cases and event-types. This offers the potential for custom-tailored distribution of each event-type, according to its requirements. It can be expected that this assumption leads to a better speed-up as well as efficiency than uniform event handling.

1 2

14

Speed-up metrics measure how the rate of doing work increases with the number of processors [JWM00]. Efficiency measures the Speed-up per processor [JWM00].


QoS in publish-subscribe systems Large scale publish-subscribe systems have many different subscriptions to handle. All of them expect different guarantees regarding the dissemination of their notifications. Therefore, e.g. Mühl [Mü02] suggested the examination of QoS in the context of publishsubscribe. Since then, some research has been done on QoS. From the exploration of the meaning of different QoS parameters [AR02, BFM06, CQ06, TKK11] to focused research on system-wide characteristics like reliability [BSBA02, CF04, MBCK12, ECR13] or selected QoS parameters like order [ZMJ12, BBPQ12], the interest in this topic has been and is still high. But there are still no approaches that consider QoS holistically and incorporate all currently suggested parameters. Most approaches are limited to latency reduction or bandwidth conservation. Mahambre [MKB07] underpins this assumption with an analysis of existing QoS-aware systems and their considered parameters. Despite the existence of approaches and algorithms for the optimization of nearly all identified parameters, no configurable approach that enables their usage in one system has been proposed. As initially stated, the requirements of different event-types vary. Gilbert [GL12] argues that especially in the spirit of the CAP Theorem, uniform handling of messages is inferior to partitioning of concerns. That suggests some parts of the system should provide certain QoS guarantees, while other parts do not guarantee anything at all and use a “best effort” approach. Current work does reflect this aspect only to a certain degree, as they only allow for the adaptation of parameters in order to optimize algorithms. The next step is to adapt the algorithm itself, based on the QoS requirements of each event type in order to partition the capabilities of such systems. In the context of publish-subscribe this partitioning means each topic or subscription should be handled individually with exactly the required amount of guarantees it needs. This is only possible with algorithmic as well as parametric adaptability of the system. However, adaptability and configurability of publish-subscribe systems are only sparsely researched. And only very few approaches exist, and even fewer allow for the configuration based on QoS parameters (cf. section 5.7). Hereby, only one distributed approach, the ADAMANT project [HMS10], addresses this challenge for embedded systems, but only allows to adapt the underlying network implementation. Integration of optimization approaches Current research in distributed publish-subscribe systems lacks the integration of optimization approaches from other research areas as well as publish-subscribe related

15

1 Introduction

optimization techniques. application-layer multicast (ALM)1 is an example for a related area in which many optimization approaches were published, which are only sparsely considered in publish-subscribe research as Martins [MD10] observed. Despite this fact, there are different proposals for publish-subscribe middleware solutions in the literature [EFGK03, LP03, MKB07]. However, they lack the proposal for a framework integrating different algorithms that guarantee different QoS capabilities, while considering the performance overhead this adaptability introduces. Most solutions were designed for a special use case which leads to a specialized optimization characteristic. Most of them lack general applicability in other contexts. Even if the approaches support a certain kind of configurability to adapt to different use cases, the gained flexibility comes with some overhead as it is performed at run-time (cf. Section 5.6). There are only a few existing approaches that examine design-time configuration in order to minimize the run-time overhead. However, all existing design-time configurable approaches either employ run-time techniques like GT [AGG09] or ADAMANT [HMS10] or they are centralized like FAMUOSO [SF11]. Moreover, the design of these few adaptable approaches is not rigorously formalized, at least not regarding their adaptability and therefore hinders the extensibility of their optimization strategies. In order to provide an extensible framework that is generally applicable, a formalized reference model for extendability and adaptability is required. Support of application developers The field of publish-subscribe is a research field that creates communication infrastructures for complex distributed systems. Developers of such complex systems currently have to scan a vast amount of literature that describes different aspects of publish-subscribe. Each approach has its own advantages and disadvantages in terms of their optimal application field. Without a huge evaluation phase, in which the different approaches are compared to each other, no profound answer can be given to the question which algorithm is better for a particular constellation. Such a phase is required, as the suitability of a certain algorithm for a designated scenario depends on a variety of parameters like the network characteristics (e.g. topology, data rate, latency, etc.), the required guarantees for event delivery (e.g. reliability, timeliness, etc.), and the semantics of the events itself (e.g. spatial and temporal context). Current research lacks mechanisms to support

1

16

ALM deals with multicasting protocols on the application-layer. Many different routing algorithms have been proposed in this field. (cf. Section 5.3.4)


developers in this jungle. For example if a developer wants to build an application for WSNs the applicable approaches vary vastly from those used for building an MMVE. Also the requirements in terms of QoS guarantees may be completely different. Even the terminology used in these domains may differ completely. A method, with which a developer can specify his needs in his own terminology, and which relieves the developer of the need for manual optimization, promises a speed up in the development of distributed publish-subscribe architectures. Currently, only ADAMANT [HMS09] allows for the application-specific deduction of configurations for distributed publish-subscribe. This approach employs neuronal networks for the deduction process and aims for autonomous adaptation and configuration. Other approaches like Green [SBC05], YANCEES [FR05], or REBECA [Mü02] only support developers by providing components that may be composed to form the desired system. However, these approaches still require profound technical knowledge for configuration. Evaluation Despite many existing evaluations in literature, there are shortcomings regarding an exhaustive evaluation on the cost of the different QoS guarantees. Mostly comparisons of certain algorithms or whole systems like [Mü02, Hen10] exist, but only a few try to quantify and compare the cost of single QoS parameters. For example Pongthawornkamol [Pon11] examines timeliness and reliability of whole systems. However, for an evaluation on an algorithmic level, there are no middleware solutions available that would provide a sufficient framework to evaluate single characteristics. Moreover, there are only a few comparisons between different flavors of distributed publish-subscribe and their appropriate fields of application [Mü02]. For example, where is the break-even point between topic- and content-based publish-subscribe for a certain scenario or how much overhead does a ordered communication channel cost? Nevertheless, as a foundation for such comparisons, benchmarks have already been suggested, for example by Sachs in [SAKB10], but no standard emerged, yet. Besides that, many evaluations, especially propositions of new algorithms, rely on simulation models for evaluation, which are unrealistic when it comes to implementation details. Results that rely on a model and do not employ the implementation, merely give insight into how the realization of an algorithm behaves in detail, they can only provide the magnitude of how the approach behaves.

17

1 Introduction

1.3 Focus and Contribution This work strives to contribute to the research questions introduced in section 1.2. Due to the broad nature of the questions that span many research communities, a holistic discussion of all introduced questions is not possible in the narrow space of one thesis. Therefore, this work focuses on the examination of a few corresponding aspects and their impact on the solution of the introduced research questions. Three coarse goals can be formulated: Firstly, how can event semantics be described in a way, to easily gain enough knowledge about an event type for optimization decisions? Secondly, what architecture is required to build a scalable, design-time configurable distributed event-based system (DEBS)? Thirdly, how can the configuration decision based on the semantic description of event-types be automated? The studied goals are briefly described and concretized in form of hypotheses to narrow down the focus of this thesis. These hypotheses are the coarse challenges this thesis addresses. They will be broken down into requirements in chapter 7, before the suggested solution is discussed. Exploitation of Event Semantics The first examined challenge is the exploitation of event semantics and QoS properties. An event-type can be described by various semantic properties. However, QoS related properties are mostly motivated by technical constraints. They range from latency over notification order to delivery guarantees (cf. Behnel [BFM06]). In order to grasp those aspects in a way that requires less technical knowledge by the developer, a classification of the semantic requirements is a suitable mechanism. For example, it may be easier for an application developer to define an event-type as a“high velocity event with a one-to-many distribution characteristic” than to quantify technical characteristics of an event-type like “30 Hz frequency, a rendezvous-based routing with a maximum latency of 300 ms”. The classification on the one hand should abstract from the need to give numerical values to describe an event-type. On the other hand it should limit the search space for suitable algorithms that fulfill the class of an event-type. Based on such a classification, it should be possible to support a decision regarding the question which algorithm provides the best performance regarding a certain QoS metric. The challenge to design such a classification model is formulated in Hypothesis 1.

18

1.3 Focus and Contribution

Hypothesis 1: A model can be found that allows the classification of event-types in a developerfriendly way and may be used to reduce the search space for algorithms with respect to a certain optimization goal in form of QoS requirements. However, not only the semantics of event-types themselves influences the decision for a certain algorithm, the application domain also contains valuable information for the configuration decision. For example, network characteristics like topology, throughput or link latency influence the decision on the best suitable algorithm to a certain degree. Moreover, each domain usually has its own terminology. In order to define the semantics of event-types in a user-friendly way, it should be possible to use this terminology for the description of event-types. Hence, Hypothesis 2 addresses the exploitation of domain-specific terminologies and their characteristics for the description of event semantics. Hypothesis 2: Application domains can be described to provide a domain specific terminology that reflects the characteristics of the domain. This terminology can then be used to describe and classify event-types. This hypothesis could be reflected by the introduction of different profiles that structure the semantics rooted in the application domain and provide a domain-specific terminology. These profile can then be used for the description of event-types. A domain-specific language (DSL) that provides the syntax for these profiles and event-type descriptions seems to be a possible realization of both, hypotheses 1 and 2. Hence, this thesis aims to provide a language for the classification of event-types that allows the description of their QoS requirements and their semantics based on domain-specific terminologies. Configurable Integration Framework Such semantic descriptions can be used to support or even automate the configuration of a publish-subscribe middleware. However, before any decision on optimal configurations can be made, a framework has to be provided that integrates existing algorithms with respect to their different optimization goals. As we discussed in section 1.2, certain algorithms

19

1 Introduction

are advantageous in different constellations. Typically, a developer has to decide which algorithm is optimal for his particular scenario and then choose the appropriate library or implement it accordingly. The advantage of a configurable framework is, he only has to choose the correct algorithm for the scenario and does not have to care about integration. Moreover, it would enable the developer to change his decision without much effort, because the algorithms follow a common interface. Currently, no existing middleware provides a framework that covers all popular QoS properties. For a framework that also allows the integration of future approaches, extensibility must be examined and respected. This challenge is formulated in Hypothesis 3. Hypothesis 3: It is possible to build an extensible and configurable framework that integrates all major aspects of publish-subscribe with respect to QoS requirements. Hereby, routing and filtering aspects as well as delivery, order, timeliness, and security requirements are covered. The variety of existing event-dissemination approaches with their different architectures and focus calls for a formal and rigorous specification of such a framework in order to ease the integration process. This thesis provides a formal model for such a framework and discusses the extensibility as well as the completeness of the proposed reference architecture with respect to the analysis of event semantics and QoS parameters. One major challenge in the design of configurable frameworks is the overhead generated by the generalization. Overhead includes longer processing times due to indirections caused by components or larger messages founded in their generic header design. In large scale scenarios such overhead is often not tolerable. Hypothesis 4 reflects this fact and proposes design-time configuration as a suitable method. The omission of run-time configuration allows for light-weight abstractions and the employment of compilers to produce efficient binaries. On a technical level design-time configuration results in Template-Meta-Programming and Policy-based Design. Hypothesis 4: The limitation of the configurability to the design-time can significantly reduce the run-time overhead compared to run-time adaptable or configurable frameworks.

20

1.3 Focus and Contribution

Such an approach to middleware design is applicable in a variety of application scenarios with requirements and constraints like in MMVEs (cf. chapter 3). Methodology for the support of application developers Even with a configurable integration framework, the decision a developer has to face is still on a technical level. He still has to choose the suitable algorithm from a bunch of available ones in the framework. A methodology striving to ease developers’ work at design time is the logical consequence. It begins with the design of applications’ event models and supports the developer until the final deployment. With the help of the classification model for event-types in a domain-specific terminology, it should be possible to design a workflow that automatically derives the configuration of the framework. Hypothesis 5: Based on a description of the event-types and the application domain, an automated decision on the configuration of a publish-subscribe middleware for a particular application can be made. Limited by the precision of the description of the application domain, this decision should be optimal with respect to QoS requirements. This enables developers to concentrate on the development of their application and eases the burden of technical decisions. The technical decision, which dissemination strategy is best fitting for the scenario, is automated by a decision component that uses black-box testing and simulation to derive an optimal configuration. The remaining challenge is that a brute force measurement strategy results in thousands of simulation experiments, depending on the number of available algorithms. In addition, the experiments have to be repeated every time the developer changes the event model. Therefore, a reduction of the number and frequency of experiments is feasible. A method showing some potential is to measure the whole application domain once using sparse sampling with a suitable interpolation mechanism. The result is a cost model for a certain application domain, which can be used for an instant decision for a particular event model. Hypothesis 6: A whole application domain can be sampled once and used for multiple automated decisions, without unreasonable error.

21

1 Introduction

With a pre-measured application domain, a developer does only have a minor slowdown from the decision component, making this a desirable goal for the decision component. Evaluation and applicability Although different application scenarios have different requirements, this work focuses on one scenario: MMVEs. As the assumptions made in chapter 3 apply to a class of scenarios, not only to MMVEs, it is sufficient to restrict the evaluation to one representative of the class. This representative scenario is thoroughly discussed and provides illustrative use cases. Each of these use cases generates a unique workload with different event semantics and can therefore be used to evaluate the applicability as well as the performance of QoS-aware design-time configuration for distributed publish-subscribe. The applicability is exemplified in a concrete implementation of a simple game that uses the proposed approach, as well as by the simulation of general performance metrics. Summarized Contribution A solution that addresses the introduced hypotheses results in an approach that provides a methodology for the easy-to-use configuration of event-based middleware. Starting with a classification of event types and their requirements in a domain-specific terminology, an optimized configuration with respect to QoS requirements is automatically derived, compiled and delivered as a ready to use custom-tailored library. Hence, the contribution conceptually consists of an extendible event description model, based on a multidimensional classification scheme, a reference architecture, tailored towards design-time configuration, and a methodology that uses both to provide an automated configuration workflow. For all three parts a reference implementation is given in form of a DSL for the semantic description of event-types, a configurable distributed publish-subscribe library, focusing on minimal run-time overhead, and a configuration component that coordinates the automated configuration workflow.

1.4 Structure This section gives an overview on the structure of this thesis. In addition, it gives the reader a guideline how to read this work, depending on his proficiency in the different covered topics. The thesis is divided in six parts. Each part serves another purpose and is briefly described in the following.

22

1.4 Structure

Prologue: The prologue consists of some introductory words as well as some motivating thoughts in section 1.1. Moreover, open research questions in fields related to this work are briefly sketched in section 1.2. These open questions lead to the overview of the focus and contribution of this thesis in section 1.3. Herby, the hypotheses addressed in this work are identified and briefly explained. In chapter 7, they are later broken down into smaller requirements. Chapter 2 sketches how these hypotheses are addressed in this work and which scientific methods are applied. The prologue closes with the discussion of the scenario in chapter 3. Hereby, the domain of MMVE with its unique challenges is briefly introduced. Based on a simple multiplayer game, called Tri6, some exemplary use cases are identified that serve as examples throughout the thesis. As the scenario is primarily used for motivation, illustration and as a source for examples, it is not considered as state of the art, required for the understanding of this work. Therefore, it is part of the prologue rather than state of the art. State of the Art: This part describes the required knowledge of related research areas as well as recent research in those fields. The addressed topics are overlay networks in chapter 4 and event-based systems in chapter 5. The discussion on overlay networks covers the generations of overlay networks with focus on structured P2P systems. Moreover, network characteristics and topologies are discussed as a foundation for the design of network simulations. Event-based systems are discussed with all their fundamental aspects like routing and filtering and advanced topics like QoS (section 5.5) and adaptability (section 5.6). The chapter on event-based systems closes with a corresponding taxonomy of existing systems. QoS-Aware Configuration of Publish-Subscribe Systems: This part suggests a reference architecture for QoS-aware configuration of publish-subscribe systems. For the discussion of this reference architecture, to prerequisites are described. In chapter 7, the assumptions and limitations for the remaining discussion are introduced and justified. Chapter 8 covers a basic reference architecture which is subsequently enhanced by configurability in chapter 9, by a QoS-aware configuration model in chapter 10, and a corresponding configuration workflow in chapter 11. The result is a methodology for automated QoS-aware configuration of publish-subscribe systems. M2etis: Prototypic Implementation: This part discusses the architecture and the challenges faced during the implementation of Massive Multiuser Event Integration

23

1 Introduction

System (M2 etis), a prototype that realizes the reference architecture for a QoSaware configurable publish-subscribe system. The technical discussion is divided by the major components: The library is described in chapter 13. A simulator for the library in chapter 14 and a component for the automatic configuration of the library in chapter 15. Evaluation: The evaluation assesses to which degree the introduced hypotheses could be validated. Hereby, each requirement that has been associated with a hypothesis is discussed and if possible also confirmed by measurements or simulations in chapter 17. Epilogue: The epilogue contains some concluding remarks in chapter 20. It also identifies remaining or new challenges in chapter 19 that could enhance the suggested reference architecture by contributing either to the methodology in section 19.2 or the framework itself in section 19.1. The proficient reader, only interested in the research contribution and its challenges can easily skip state of the art and the part describing the implemented prototype. In contrast a developer, interested in the challenges on a technical level, will find a discussion of the technical challenges in part IV during the discussion of the prototypic implementation. The chapter on the scenario of MMVE can also be skipped, either if the reader is proficient with the domain, or if the background of the examples, used throughout the thesis, is of no interest.

24

2 | Methods After the introduction of relevant research questions in section 1.2 and the focus of this work in section 1.3, this section covers in detail the used methods applied to contribute to the introduced hypotheses. Due to the broad nature of the hypotheses, a variety of scientific methods have to be employed to validate them. Each method is put into context and it is argued how it contributes to validate the different hypotheses.

2.1 Case Study In order to abstract the configuration of the publish-subscribe system from a technical level to a more intuitive, developer-friendly approach, the characteristics and terminologies of different domains have to be described in order to provide a domain-specific classification for event-types. Such an abstraction is believed to ease the development process, as it translates from the terminology of the application domain, a developer normally understands, to the technical level of the implementation of publish-subscribe systems only experts in the field fully can grasp the meaning of. This fact is reflected in Hypothesis 1 and 2. Hypothesis 1 focuses on the classification of event-types, while Hypothesis 2 aims for the semantic description of whole application domains. In order to underpin those two, a case study that analyzes an application domain seems a suitable method. As this work relies on MMVEs for illustration, this application domain is analyzed as a case study in order to gain insight on the semantical properties of event-types and application domains themselves. MMVEs are a representative for a class of application domains that have requirements addressable by QoS-aware design-time configuration. These requirements are that they are large scale applications with strict QoS requirements. Moreover, they are homogenous installations, meaning all clients run the same software, and nearly always provide a patching mechanism. The former two requirements imply the need for high performance solutions, as well as their QoS-

25

2 Methods

awareness. The latter two requirements allow for design-time configuration, because a redeployment of the software is no problem. In addition, MMVE are complex distributed architectures that can benefit from some facilitation in the development process. They usually also employ a huge variety of different event-types that may be studied for their semantics. Therefore, a case study of MMVEs seems promising to derive assumptions on event semantics and general requirements for QoS-aware design-time configuration that can be generalized later for a whole class of application domains. Chapter 3 contains a general introduction into MMVEs and provides an in-depth example on how to build a simple game application. Section 10.1 discusses event semantics in the context of MMVEs in order to derive assumptions and parameters for a generally applicable model on event semantics. This is not sufficient to validate the two hypotheses, but supports them enough to formalize the found characteristics and parameters in form of a reference model and build a proof-of-concept prototype. Such a prototype can be used to measure the influence of the identified parameters on configuration decisions and therefore validate the initial hypotheses and the resulting model.

2.2 Literature Analysis Hypothesis 3 states that it is possible to integrate all popular aspects of publish-subscribe. In order to contribute to the goal, such a hypothesis sets, a thorough literature analysis and overview is necessary in order to grasp at least the most popular conceptual approaches and build the foundation for an abstraction that enables their integration. An extensive overview can be found in part II of this work. The thorough literature analysis enables the definition of a formal reference model in chapter 8 that defines the constraints and abstractions for an integrated publish-subscribe framework which underpins the validity of Hypothesis 3. A complete validation is not possible at this point, because we can only reason about the extensibility and completeness of such a reference architecture (cf. Chapter 16). Only time can tell, whether future approaches can be integrated into the proposed reference architecture. What can be done in the narrow space of this work is to validate the reference architecture to its current extend by a proof of concept implementation.

26

2.3 Proof of Concept

2.3 Proof of Concept The central scientific method used to contribute to the validation of all introduced hypotheses is building a proof of concept prototype. The proof of concept is discussed on a conceptual level in form of a reference architecture in part III. The prototype that implements the reference architecture is discussed in part IV. The whole suggested approach to QoS-aware configuration at design-time is a combination of well known techniques used to build information systems. First design-time configuration has to be mentioned as a method to make a system configurable at design-time before the system is deployed. This technique on the one hand promises to reduce the required overhead at runtime, while providing adaptability to a wide range of scenarios. This fact is represented in Hypothesis 4 which states that design-time configuration minimizes the overhead of an integration framework. This hypothesis can be underpinned by a proof of concept, showing such a framework can be built. But a thorough validation would require a measurement of the induced overhead by the framework in comparison to the reference implementations of related approaches. This is done as far as possible depending on the available source code or papers and discussed in chapter 16 and chapter 17. Generally speaking, scenarios have to fulfill some prerequisites in order to be design-time configurable. These requirements are described along with the exemplary MMVE scenario in chapter 3. Moreover, design-time configuration or in general configurable systems lead to a repository of components implementing different optimization algorithms. The prototype’s repository is filled with algorithms chosen based on the literature analysis. A developer can then select the best-fitting composition of components by configuration, without having to touch source code at all. To be able to easily configure a system for different scenarios, the scenarios’ parameters must be identified, which is done by the case study and described in section 10.1. A formalized generalization of the identified event semantics uses a classification of the different parameters in order to make the resulting model simple to use. As the semantics of events provides some orthogonal characteristics, a multidimensional classification for the relevant event semantics will be employed. With such a classification at hand, a description of the system’s required behavior should be possible, which further would confirm Hypotheses 1 and 2. This technique promises to provide the needed expressiveness

27

2 Methods

and simplicity to model the semantics of event types in a way that they can be exploited to reason about the optimal configuration of a system. Such a description can be used to deduct a configuration of a system, easing the decision process to determine the optimal composition. A developer does not have to struggle with technical detail but only has to define the semantics and the QoS requirements for the event types. This requires an automated estimation on how the system behaves given the desired semantical description. As Hypothesis 5 suggests, this estimation automates the technical decision on the optimal configuration. In the domain of publish-subscribe systems each message dissemination has costs in terms of resources like memory, latency, etc. It is not possible in every case to define a parametric cost-model for the used algorithms, especially if extensibility, and therefore currently unknown algorithms are taken into account. The consequence is, a configuration must be seen as a “black box” which requires measurement to gain enough data to decide the optimal algorithm. As measurement in real distributed systems of such large scale as an MMVE is rather impracticable and requires an abundance of hardware, this work employs discrete-event simulation. In order to underpin Hypothesis 5, a simulation model must be developed (cf. section 11.2), implemented (cf. chapter 14) and validated (cf. chapter 17) that can be parametrized to approximate different scenarios without unreasonable error. The more configuration options the prototype has, the more measurements are required to decide the optimal algorithm. Consequently, more simulations have to executed to gain the measurements. As these simulations have to be performed every time the event type description changes, an optimization of the decision process seems inevitable. Hypothesis 6 suggests to measure a whole application domain once and provide a domain specific reusable database by sparsely sample the simulation parameter space. This leaves non-parametric cost estimation as a suitable candidate to make a prediction of the configuration’s behavior. It uses the precomputed measurement database for calibration and the description of the event types to perform a decision. Chapter 15 discusses the prototype of the configuration component that consists of the implementation of a non-parametric cost estimator and a decision component. In chapter 17 the estimations are evaluated to support Hypothesis 6. In the following, each of technique used for the proof of concept is described in more detail.

28

2.3 Proof of Concept

2.3.1 Design-Time Configuration Design-time configuration has its origin in component-based software engineering (cf. [Som07]), where functionality is divided into standardized, independent, composable, deployable and documented components. All components must adhere to a common component model, describing their structure e.g. their interface and implementation guidelines. In the corresponding development process, the development of an application is the configuration and composition of components rather than implementing the functionality. Each of the components implements a certain aspect or feature of the software. Components that implement the same aspect or feature are later exchangeable. Usually, the implementations vary in some behavioral aspects that allow the composition of an architecture tailored to the specific requirements of one application. This technique for software development fosters the reusability of code as well as it supports product-line development, where one software product is configurable according to the requirements of the customer. The challenging questions are: How can a system be cut into components to cover a huge configuration variety and how can existing algorithms be integrated into this component model. However, the configuration itself can either happen at run-time or at design-time. At run-time the wiring between the interfaces of different components is done by the component framework as for example in OSGi [HPM09] or J2EE [Keo02]. In those modern component frameworks a service is provided to register available components, which are made searchable via a directory-service for other components. This enables run-time adaptation by exchanging certain components with different behavior [IFMW08]. Of course this technique introduces a certain amount of run-time overhead for each function call over component edges. In the domain of high throughput and low latency publish-subscribe systems such an overhead is unacceptable and other ways to modularize and configure software are needed. Design-time configuration is a way to gain configurability without the runtime overhead, but at the cost of adaptability at runtime. This is achieved by configuring the source code in a preprocessing step [KRT97], before the compiler generates the actual application. Either custom code generation or compiler features like the Cpreprocessor is employed for such tasks. For example, compiler features are exploited in C++ Template-Metaprogramming or domain-specific languages use custom code generation excessively. In this work a combination of both, custom code generation and

29

2 Methods

C++ Template-Metaprogramming is used to achieve configurability with a minimum of run-time overhead. 2.3.2 Multidimensional Classification The semantics and QoS requirements of event types show many orthogonal characteristics (cf. section 10.1). Therefore, a multidimensional classification of relevant event semantics seems to be the adequate method to structure the configuration on application level. Strictly speaking, a classification only allows to classify event types according to one property. A multidimensional classification, also called colon-classification [Gau05], enhances this strict definition by classifying an event type according more than one property. Each property defines another classification dimension. Each dimension should independently provide a classification itself. That means a classification of an event type is a combination of classes, one of each dimension. A multidimensional classification therefore allows for a more precise classification of each event type, taking more different aspects into account. With such a classification the required expressiveness should be easily achievable, as long as the relevant aspects are orthogonal. However, the general assumption of independent dimensions is inadmissible. Independency can only be determined for known dimensions. But, recent research on event semantics and QoS in publish-subscribe systems [CQ06, BFM06, AR02] suggests that this orthogonality can be assumed for already examined characteristics. Hence, a multidimensional classification of event semantics and QoS requirements is a suitable method to describe event types for the configuration of publish-subscribe systems in a concise and developer-oriented way. 2.3.3 Non-Parametric Cost Estimation The two previously described methods introduce a mapping problem between the application domain and the technically motivated configuration of the actual publish-subscribe system. In other words the costs in technical terms like latency, throughput and scale-out have to be estimated, based on the multidimensional classification of the application. As it is not possible to cover all existing and future algorithms in a cost model, parametric cost estimation is not an option in this case (cf. section 11.4.1). It would require to find a cost function for each algorithm which is added to the system. The complexity of this task depends on the complexity of the algorithm. It is relatively easy, if literature provides a simulation model describing the behavior of an algorithm, but poses a tedious task if this is not the case. However, it requires knowledge on the inner workings of an algorithm and

30

2.4 Discrete-Event Simulation

hinders or in some cases even prevents the extensibility of the system. Another approach is to view an algorithm as a “black box” and measure its behavior in a target domain model. This requires a cost estimation technique which is generally applicable and does require as few as possible samples for a precise estimation. Therefore, non-parametric cost estimation is employed, in order to provide a generic common estimation technique which covers all current and future algorithms. An in depth discussion on this topic, with its different approaches can be found in section 11.4.1.

2.4 Discrete-Event Simulation The most exact method to evaluate the prototype would be to measure in a real world scenario on the internet or on a network with the intended characteristics. This is hardly possible, due to the number of nodes, scenarios at the scale of MMVEs would require. Therefore, a model of a target domain must be developed that represents internetscale applications. A simulation of this model replaces the need for large hardware clusters. However, in order to use simulation as a method to validate proof of concept implementations, the simulation model must be validated against the measurement of a real-world scenario. Therefore, the validation of the simulation model requires real world measurements at least once. Due to the lack of access to a large scale network during this work, this validation is only done by spot tests for a small network in chapter 17. Consequently, the validity of the evaluation is limited by the correctness of the simulation model. Despite this limitation, with a simulation model the prototype of the framework can be compared against the published metrics of the integrated algorithms to support Hypotheses 3 and 4. Moreover, simulation is part of the process automating the technical decision defined in Hypotheses 5 and 6. The measurements are used by the decision component to find the optimal configuration. To prove the validity of Hypotheses 5 automated decisions are compared to manual decisions based on simulation experiments. In order to validate Hypothesis 6, non-parametric cost estimation is compared to the measurements of a directly simulated event type description. To gain an advantage over real-world measurements, the simulations themselves should only require a realistic amount of hardware. Therefore, discrete-event simulation [BCNN01] is a suitable method to simulate such distributed systems in a resourceprotecting way. Discrete-event simulation uses a discrete time-advance algorithm and an

31

2 Methods

event-scheduling mechanism. This leads to the advantage that simulations do not run in real-time and are faster or slower than real-time, depending on the complexity of the simulation. The samples required for non-parametric cost estimation each pose a single simulation, so the simulations may easily be parallelized and performed independently. Available simulation models and software for discrete-event simulation of networks1 endorses the usage of discrete-event simulation even further and eases the validation process. Therefore, discrete-event simulations seem to provide a suitable method for both, the evaluation of this work as well as part of the decision process. The discussion on the developed simulator and further in-depth discussion on the simulation model used can be found in chapter 14. Part V covers the evaluation results achieved with discrete-event simulation.

1

32

Currently, two open-source projects are available for simulation of networks: OMNET++ (http:// www.omnetpp.org) and ns-3 (http://www.nsnam.org). Both support a variety of network simulation models and have a large community.

3 | Scenario When my dad was young he shot marbles. When I was young I played Marble Madness on my Nintendo Entertainment System. Kevin James Breaux, Science Fiction Author

An Massively Multiuser Virtual Environment (MMVE), also called Distributed Virtual Environment (DVE) or Networked Virtual Environment (NVE), is a large scale virtual environment with thousands of avatars, interacting according to some rules. We will use the term MMVE throughout this work, as it grasps the focus on large scale systems more precisely. Such a virtual environment can be a large scale simulation, or just a game with the sole purpose to entertain. Such virtual worlds are called Massively Multiplayer Online Games(MMOGs). As the differentiation between the term MMVEs and its subforms is mainly their purpose, they are mostly interchangeable when it comes to technical and architectural discussions. Therefore, we will only speak of MMOGs, when the gaming aspect is important. MMOGs cover many different genres, each with different play styles and therefore unique challenges on their world model. Their systems’ architectures instead are very similar, at least on an abstract level. The most successful MMOGs are role playing games(RPGs) like WoW, Rift1 or Star Wars: The Old Republic2 . These so called MMORPGs have their roots in table-top or pen-and-paper RPGs like Dungeons & Dragons [APS08]. All RPGs have in common that a hero – the player – explores a dungeon or area filled with obstacles and enemies together with a group of other

1 2

http://www.riftgame.com http://www.swtor.com

33

3 Scenario

players in order to fulfill a quest. Basically only the limits of imagination constraint the experience. Dungeons & Dragons defines a basic mathematical ruleset to guide the role-playing experience like how a player character develops over time and how the outcome of fights are resolved. These game concepts can be found in the corresponding computer games as well. A player controls an avatar trying to overcome obstacles in form of computer controlled enemies with the goal to get stronger and finish quests that tell a story. These kinds of games have proven to attract a large audience. Figure 3.1 shows the trend of the subscriptions to WoW from 2005 to 2012 for each quarter according to Activision Blizzard1 . It is obvious that a game with a user-base and a life-cycle of this Number of World of Warcra2 subscribers from 1st quarter 2005 to 1st quarter 2013 14

12

subscribers in millions

10

8

6

4

2

0 Q1 '05 Q2 '05 Q3 '05 Q4 '05 Q1 '06 Q2 '06 Q3 '06 Q4 '06 Q1 '07 Q2 '07 Q3 '07 Q4 '07 Q1 '08 Q2 '08 Q3 '08 Q4 '08 Q4 '09 Q3 '10 Q4 '10 Q1 '11 Q2 '11 Q3 '11 Q4 '11 Q1 '12 Q2 '12 Q3 '12 Q4 '12 Q1 '13

Figure 3.1: Trend of subscriptions for World of Warcraft (Source: Activision Blizzard)

dimension has some hard to achieve requirements regarding system evolution and its architecture. MMOGs of this scale require enormous server clusters to operate. The reason for these cluster designs is the requirement of those games to enable players in the lower thousands to share the same area in the virtual world, while maintaining the illusion of a fluent real-time game. The third-generation games (e.g. Rift or Star Wars: The Old Republic) further increase the communication overhead, as the world itself changes dynamically from time to time. First we dive into the architectural challenges, before evolutionary aspects of such systems are discussed.

1

34

The numbers where published by Activision Blizzard in May 2013. (retrieved via Statista: http: //www.statista.com/statistics/208146/number-of-subscribers-of-world-of-warcraft/)

MMVEs in general, to which MMOGs count, are large scale distributed systems which consist on a large number of interconnected nodes. These nodes act as clients or servers depending on the services they consume or provide. The most common architecture of industry strength MMVEs is a client-server based system with different scaling strategies on server side. To achieve a larger scale, two approaches are possible: Scale-up and scale-out. Scale-up or vertical scaling means adding more resources (e.g. memory or CPUs) to one node, while scale-out means adding more processing nodes to the system. Both approaches have their benefits and drawbacks, depending on the workload the system has to withstand. In most cases a scale-out solution is more efficient than a scale-up approach, especially if the workload may be subtly partitioned and parallelized (cf. Michael [MMSW07]). As this is the case for MMVEs, most systems employ cluster or grid architectures. An MMVE as a whole represents a virtual world, which is the simulation of a model that describes an artificial or natural system. This virtual world may exist exactly once like in the game Eve Online or many times, as in most other commercial games like WoW. Each replica of the virtual world is called shard. A shard contains all services required for an independent operation of the virtual world. An entity located on one shard cannot move to another shard or interact with an entity in another shard without greater effort. Moreover, if the virtual world itself is large enough so that processing resources of one node do not suffice, this world is partitioned in mostly independent regions. For each region all services required for operation are provided on dedicated machines. This forms two levels of scale-out: On the one hand whole worlds may be replicated so that different instances can be populated with different avatars. On the other hand one world is partitioned into smaller simulation cells (one per region). Another possible approach to scale-out is the distinction between different services. As there are many different services required for the operation of a virtual world, each of the services may be separated on different machines. Figure 3.2 shows the reference architecture for current MMVE architectures [JWB+ 04]. In this architecture both scalability measures can be observed. Different services are located on different machines and a group of machines is employed per shard (depicted by the “shard line”). For complexity reasons figure 3.2 does not contain region-based partitioning. It is obvious that this architecture may also be applied to a region based partitioning. Each of the machines in Figure 3.2 provides another service for the virtual world: The Login/Account Server coordinates the authentication of clients willing to join the world. This service exists only once or only a few times per MMVE, as it is not shard specific and is used for accounting and web services as well.

35

MMOG Clusterarchitektur!

12

3 Scenario

“Shard Line”

Game Logic

Account DB Login/Account Server

Player Actions

Chat

AI

Internet Web Services

Action Processor Game World Clients World View

Figure 3.2: MMVE Reference Architecture following [JWB+ 04] 17.3.2009! Thomas Fischer!

It is backed by the Account DB containing all data required for the authentication and accounting. Often it is secured by an own firewall due to security requirements of the Payment Card Industry Data Security Standard 1 . Web Services like a forum, a blog or statistics on the virtual world are also centralized and not shard dependent. However, they may interact with the Game World Databases of all shards, for example to generate statistics like “all players together walked 30000 kilometers today”. Behind the “shard line” are the services required once per shard. The central service is the Game World. This is a database containing all relevant information that forms the state of a virtual world shard. This service is shielded from the clients by World View machines. They provide the view on the virtual world as it is required by the clients and ensure the security as well as balance the load on the game world database via caching. The Action Processors are the interface to the clients and accept the users’ actions. They delegate the actions to the appropriate Game Logic machines that may be partitioned according to their provided services, e.g. Player Actions, Artificial Intelligence (AI) or

1

36

A proprietary security standard for companies operating with cardholder information. https: //www.pcisecuritystandards.org/security_standards/index.php

Chat. This reference architecture may be applied for example to a game world in an

Proxy

Proxy

SOL

SOL

Server

Server

Proxy

SOL

SQL

SOL

Server

Server

Server

Client

SOL

SOL

Server

Server

Proxy

Proxy

Figure 3.3: Architecture of Eve Online following [FCBS08]

MMOG. Figure 3.3 shows an actual architecture as employed for Eve Online’s Tranquility Cluster [FCBS08] that forms the single shard of Eve Online. The central database is partially replicated to the so called SOL Servers which act as world view as well as game logic servers. At least one SOL Server for each simulated solar system1 is deployed. These servers are load balanced to different proxies that act as action processor and world view for the clients. A proxy may hold connections to many SOL Servers and vice versa. So the whole universe of the Eve Online world is partitioned and each partition represents a solar system. In 2008 the whole cluster had 195 machines with more than 420 CPUs and more than 7,5 Terabytes of RAM [FCBS08]. To size such an architecture up-front is a difficult task, as there are many factors to consider [Dol07]:

1

Eve Online is a space simulation that provides many galaxies, each consisting of many solar systems.

37

3 Scenario

• The average amount of data transmitted per second per player must be estimated as it has an impact on the infrastructure needed for the action processors. • The target number of total players must be estimated. This is especially difficult as it depends on the success of the game. • The total number of concurrent players must be defined setting the goal for the scalability of the system. • The target number of total and concurrent players per geographic location has to be estimated. This is required to design the size of each data-center, especially for fast paced games, as they require lower latencies than 150 ms1 . • The size of the “world data” and player-related data must be estimated in order to plan the central database(s) and their architecture. Even if a good estimate of the expected player population is made and an appropriate architecture is deployed, two problems still remain: Crowding and Zoning. The crowding problem occurs if the player density in one partition of the world respectively on one server is getting higher than the infrastructure allows. Zoning is the change of a player from one partition to another, meaning a player’s state is transferred from one server to another. Crowding is often caused by special events in the virtual world, like the opening of a special dungeon2 or an epic player vs. player (PvP) battle in the open world3 . Crowding can be fought twofold. On the one hand designers can avoid events that attract a larger number of players than the infrastructure is designed for. On the other hand architectural optimizations can be employed. For example the usage of an optimized communication infrastructure. Zoning is usually handled by a certain zoning protocol that ensures a flawless transition from one partition to another. Depending on the design of the virtual world, the problem may be a bit more complex. A world may be divided into different maps, each representing a partition. A transition between maps is visualized by a loading screen. That is the

1 2

3

38

One data-center for world-wide operation means crossing the atlantic ocean, which introduces a latency of at least 150 ms. In WoW a one time event, the opening of the Gates of Ahn’Qiraj (http://us.battle.net/wow/en/ game/the-story-of-warcraft/chapter9), took place. Nearly the whole population of each shard took part in the event. This crowding lead to major framerate drops and enormous lags, rendering the game unplayable. Some MMOGs allow for so called open-world PvP, which means player may attack each other anywhere in the virtual world not only on especially designed PvP maps.

easy case. Otherwise a seamless world can be designed, meaning there is no visible transition, when one partition is left and another is joined. Such a design increases the requirements for the zoning protocol as it has to be completely transparent to the player and the illusion of a fluent gameplay must not be disturbed. Another aspect of designing an MMOG is the long lifecycle of such a game. Usually computer games have a lifecycle of about 1 year [Dob07], after which about 85% revenue has been created and the producer focuses on the next title. An MMOG, however, stays in the market at least 5-10 years. WoW was first published in 2004, Eve Online in 2003 and there are even older titles still running like Ultima Online1 , which was published in 1997. That means the design of an MMOG equals the design of an evolutionary system. The infrastructure must be scalable over time, content has still to be delivered after release and bugs must be continuously squashed in order to keep players happy and the revenue rolling. That not only requires the employment of an easy to handle patching mechanism, but also the frequency and the time of day for the rollout of patches must be ingeniously planned. Of course the software itself must be carefully built and designed to be maintainable over such a long time. One approach is to employ middlewares and libraries to reduce the codebase the developers are responsible for and therefore ease the burden of maintenance. But these aspects are not different from other major software products with long time support like e.g. operating systems and will therefore not be discussed any further. The required communication infrastructure provides some kind of unique aspects. A distributed system of this scale requires a robust and optimized communication infrastructure. Generally two messaging paradigms have found their way into current architectures: Request-Response and Publish-Subscribe. Request-response architectures fire queries to their destination and expect a suiting answer. This is the classical operation mode of many applications like database systems and many distributed protocols. Publishsubscribe architectures on the other hand decouple the request and the response. A requesting node subscribes to a certain topic and gets all responses to that topic from that point in time on. These characteristics lead to a more scalable behavior and therefore publish-subscribe is employed most in MMVEs.

1

Ultima Online (http://www.uo.com) is a 2D fantasy RPG, with an isometric perspective on the world.

39

3 Scenario

Applications like MMVEs show certain characteristics that may be exploited for optimization. The following points summarize these characteristics. They form the fundamental assumptions about MMVEs used in the remainder of this work: • The distributed system is a homogeneous system with all nodes running the same version of a software. Thus, in contrast to architectures including legacy components, there is no need to provide a specialized functionality to deal with heterogeneous formats and message transformation. • One can assume that a suitable patching mechanism is available which consistently updates the software version on all communicating nodes and avoids errors due to version mismatches. • The event types required for the application are well-known at design time and therefore it is feasible to reason about the best suited optimization strategy at compile time. • The anticipated users and their distribution can be roughly estimated and therefore a model of the required hardware and the network infrastructure can be formulated. Even if those estimates are exceeded or the error is high, they can be made, formulated and refined as appropriate. • Publish-subscribe is a suitable paradigm to model the messaging requirements of an MMVE.

3.1 Application Model After the initial reasoning about the domain of MMVEs, in this section a simple application model is developed. It aims to precisely formulate the dependencies of the different components and to provide a common terminology for the remainder of this work. An MMVE creates a virtual world V . Such a world has a certain dimensionality dimensions(V ). In most cases developers try to limit the model to a 2-dimensional space as this eases the physical calculations in the world. Even if the world appears 3-dimensional, the model for state interactions and physical calculations is often 2dimensional with some graphical extensions to create the illusion of a 3-dimensional world.

40

3.1 Application Model

V itself contains a set of entities En = O ∪ A composed of relevant intractable objects (passive entities) O = o1 , ..., oi and a set of actors (active entities) A = a1 , ..., aj . The contents of these constitutive sets En, O and A may vary over time as the virtual world V continuously evolves. Objects are entities that do not actively interact with V , for example flowers, treasure chests, monsters etc. Actors are entities that perform actions that influence V and therefore change the state of V . Each entity en ∈ En has a state state(en,t) at application-time1 t. For example, a part of the state could be the position of the entity in V at application-time t. Each entity en can be uniquely identified by S an ID: id(en). The state of the world state(V,t) = en∈En {state(en,t)} is formed by the state of all entities. The state of the world changes over time by state transitions. Each transition trans(e, en, t) requires an event e, an entity and a point in time t to calculate the new state state(en,t) of that entity based on state(en,t − 1).

V

en1

p1

p2

en2

p3

pn Figure 3.4: Application model with partitions

If V is reasonably large, it is typically partitioned into a number of partitions P = p1 , ..., pn , each representing a part of the world (Figure 3.4). Each partition pi only has one coordinator per entity en at a certain time t. That means exactly one application node in the network controls an entity by maintaining its state and managing its events. Therefore, we can define a function coord(en,t) which returns the coordinating application node lApp . At one point in time t, each object and each actor is associated with exactly

1

Application-time is a time defined by the application itself. It may be discrete and may differ from physical time significantly. Moreover it is synchronized over a distributed system. Each application must define a time-stepping function incrementing application-time t. A time-step may be triggered by events or wall-clock time, depending on the application.

41

3 Scenario

one partition. The set of entities associated with a particular partition varies over application-time. The transition of an entity from one partition to another is called zoning. It may be defined by a zoning function zoning(en, psource , ptarget ) that changes the coordinating node of entity en. Figure 3.4 shows the zoning of entity en2 from p3 to p2 . For simplicity of the discussion, we initially assume V is small enough not to be partitioned and therefore no zoning occurs. Therefore, we can assume that the coordinator of an entity cannot change. Events e are triggered by actions and result in state changes of entities. Et := {e1 ,e2 , . . . ,ei } is the set of i events e that occurred until application-time t. Moreover, an event e has an origin(e) that returns the application-node the event is originated in, a type(e) that returns the type τ of the event, a timestamp(e) returning the time of occurrence, a header(e) containing all processing related information and a payload(e) containing the attribute-value pairs1 of the event.

e

en2

en3

en1 AoI(en1 ) Figure 3.5: Area-of-Interest and event distribution

All events in V are processed by coordinating nodes. They decide the resulting actions for the entities they are controlling. An event nearly always affects more than one entity. That means an entity is interested in all events occurring e.g. within a certain distance or caused by a certain subset of entities. In the simplest case, this may be e.g. for displaying the entity to the user. The area is called area-of-interest (AoI) of a certain entity. It can be formulated as a function AoI(en,t) returning a set of all entities, in whose events entity en is interested in at a certain time t. Figure 3.5 shows the AoI of entity en1 when an event e occurs at en2 . en2 lies in AoI(en1 ) and therefore en1 must receive event e, while events produced at en3 do not reach en1 .

1

42

We assume that the content of an event is modeled as attribute-value pairs. A pair consists of an attribute name and a value. For example name: Fischer is the pair of the attribute name and the value Fischer.

3.2 Tri6

If we switch our view to the level of application nodes and e influences a state, a replica of the state of an entity enj has to be maintained on coord(eni ,t) as long as this entity is in in the AoI of eni . The result is the replication of the state of an entity on each application node that is interested in events of that entity. For a simple scenario with n coordinators where every entity is interested in all other entities, the computation of a new world state state(V,t + 1) scales with O(n2 ) as all coordinators need to send messages to all other coordinators in order to get their state transitions broadcasted. Based on this application model of a virtual world, a few exemplary use-cases are discussed to exemplify such an application and provide some profound examples that are used for illustration in the remainder of this work. In order to give the reader a more intuitive picture of the use cases, the game Tri6 is depicted and then abstracted to general applicable use cases in the domain of MMVEs.

3.2 Tri6 Tri6 is a simple multiplayer racing game that was developed during a student project over the course of three years. It is based on an open source game engine named i6engine1 , also developed in the exact same student project. This game engine uses the prototypic implementation of the proposed approach called M2 etis that is thoroughly discussed in Sections 12-14. The idea behind the development of this simple game was to demonstrate the applicability of the whole approach. The decision against an MMOG to study the applicability is based on the enormous complexity “massive” number of players introduces into development. To build such an architecture is simply not possible in the limited context of a student project. But as described in Section 3 there are many scale-out techniques available ready to be applied to this simple game. However, even this simple game provides enough use cases and design problems to study the applicability of this work. The architecture of the game employs a client-server model. The coordination of all passive entities as well as collision detection is performed on that one server. The game itself is inspired by Tron2 as it uses the idea of bikes racing against each other with the goal to get other players to crash into walls. It is played on an arena-like

1 2

Available at http://sourceforge.net/projects/i6engine/. Tri6 is also available under this URL. A film about a hacker, produced in 1982 that presents a virtual reality setting, in which bikes race against each other as a metaphor for battle.

43

3 Scenario

Figure 3.6: Screenshots of the Tri6 game

map large enough for 10 to 20 players. The difficulty of not crashing is increased as each player has a number of capabilities. Each player has the basic capability to create a wall following him. Each player has a different color for his walls. The wall exists for a certain time and then disappears. Moreover, a player has two different slots for special skills. One for an active skill and one for a passive skill. Those slots are filled by picking up power containers. Each of those containers holds a random skill. Active skills are for example the ability to build ramps or throw mines. Those active skills consume energy, which is automatically recovered over time. Passive skills are for example faster driving speed or invisibility. Each passive skill lasts for a certain time. Figure 3.6 shows two screenshots of a typical gaming situation. The bottom left corner contains some statistics: The current ranking and how often the player crashed and caused others to crash. The game is played in rounds. Each round lasts for 5 to 10 minutes and the winner is the player with the fewest crashes. Even such a simple game has many different event types to synchronize the states between server and clients. Based on the definition of use cases the event-types can be specified with their different semantics and non-functional requirements. Some of the use cases that can be identified in such a game are discussed in the next section.

3.3 Use Cases Based on the informal description of Tri6, some essential use cases can be illustrated in more detail. The first use case, which is essential to virtual worlds, is Movement. Each avatar is moving in the virtual world. This of course changes the distance to entities

44

3.3 Use Cases

in the virtual world and if it gets close enough it may then interact with entities by issuing actions. Two types of actions are distinguished: Target Actions and Chat. A target action has a predefined target, which is affected by the action. For example, a druid targets a friendly player to cast a healing spell or one entity collides with another, like a bike hitting the wall in Tri6. Chat is another use case with relevant characteristics for optimization. Each player is able to communicate via text messages with a subset of other players or one single player. In the following sections each use case is described in more detail. A protocol suggestion, as well as the resource consumption of the associated messages and their occurrence frequency is discussed. In order to foster the expressiveness of the resource consumption, the consumption estimated for each use case is not based on measurements in Tri6 but on game traffic analysis of different popular MMOGs [CW05, CHHL05, PHH+ 12]. Latency and its impact on a fluent gameplay has been examined by Pantel [PW02b] for a fast paced racing game similar to Tri6. He states that a fluent gameplay with minimal impact can be achieved up to a latency of 100 ms assuming an event frequency of 25 Hz. Above that the measured game became increasingly unplayable. Up to 200 ms the game was still playable, but with observable impact on realism and responsiveness. Therefore, for the following use cases this upper bound on latency is adopted. For simplicity of the discussion and the designed protocols, it is assumed that there exist no malicious nodes in the game. Hence, no measures have to be taken in order to detect or even prevent cheating, griefing1 or other demeaning activities. Such security related issues are briefly discussed in section 5.5.6, although they are not in the focus of this work. Moreover, for simplicity reasons, as already informally mentioned in the description of Tri6, there exists only one coordinator per region controlling all entities in that region (∀en1 , en2 ∈ Ent : coord(en1 ,t) = coord(en2 ,t) at time t). This results in a client/server behavior per region. The messaging paradigm is publish-subscribe. To be exact, at least type-based publish-subscribe is assumed, but some of the described use cases require content-based capabilities (cf. section 5.2.3). Each entity can publish events of a certain event-type as well as receive them, if it is subscribed to the event-type.

1

Grief play or griefing describes a play style in which the player deliberately annoys or harasses other players by using game mechanics in an unintended way and finds pleasure in doing so.

45

3 Scenario

3.3.1 Movement Movement is the most basic action an actor can perform in a virtual environment. It is also the cause of most of the messages transferred. The rate how often movement messages are distributed depends on the type of the virtual environment, but ranges in most cases between 3 and 60 messages per second. The frequency of these messages often defines the rate of the whole game simulation, as this event-type is the most frequent one in virtual environments. Therefore, movement events have been subject to many optimizations on application level in order to reduce their frequency or size. Size reduction is often done by compression of the payload or by delta encoding of the position change [SZ99]. Another optimization is Dead Reckoning [SZ99, SKH02, PW02a], which allows to reduce the number of required messages in order to synchronize the position at the cost of precision. The basic idea is for the participating nodes to agree upon a common extrapolation algorithm and a threshold for the deviation of the extrapolated position from the real position. Only if the threshold is exceeded and a message is distributed to correct the position. 1 2

metadata : id : Movement

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

payload : - entity_id : type : int - x: type : float - y: type : float - z: type : float - region : type : string values : - Arena - Hub - Racetrack

Listing 3.1: Schema of a movement event-type

46

3.3 Use Cases

A simple movement event-type as used in Tri6 may look like Listing 3.1. It is actually an excerpt of the description of an event-type thoroughly discussed in section 15.1. Movement is the name of the event-type. In the payload section entity_id is the unique identifier of an entity. x, y and z are the coordinates of the position in a 3-dimensional space and region defines the region the entity is in. Each attribute has an integral type required for C++ code generation and the codomain if it is restricted (cf. the region attribute). If a player joins a certain region he subscribes to the event-type movement for that region. That means he subscribes with a predicate restricting the region he is interested in, like region = Arena. Additionally, an AoI can be specified by extending the predicate for example to region = Arena ∧ 10.0 < x < 20.0 ∧ 10.0 < y < 20.0, if a grid-based 2-dimensional AoI is assumed. That means he gets all movement events from all entities in his AoI. The protocol to achieve this is rather simple. Each actor publishes his movement events at a certain frequency, containing his current absolute position coordinates, his current region and unique id. All interested entities including the coordinator, actors as well as passive entities are subscribed to the event-type movement, with or without an AoI. This leads to the different parameters for this use case as shown in Table 3.1. Two different basic use cases are distinguished. The Simple Movement case and the Movement with AoI case. In the simple movement case all subscribers only subscribe to the topic Movement, whilst in the Movement with AoI case the subscription contains filters representing the AoI. Therefore, depending on the use case, different publish-subscribe capabilities are required: In the first case topic-based publish-subscribe and in the second case content-based publish-subscribe. Moreover, the table contains two different game genres. RPGs usually run with a slower simulation speed than first person shooters(FPSs) [CHHL05]. This results in different event rates and is mostly owed to the different number of players and the speed of gameplay. 6 events per second and publisher for RPGs is the average event rate found in many MMOGs. An average of 30 events per second and Use Case Simple Movement Movement with AoI Simple Movement Movement with AoI

Game genre Publishers Subscriptions Payload Events/sec RPG all actors topic-based 24 bytes 6 RPG all actors content-based 24 bytes 6 FPS all actors topic-based 24 bytes 30 FPS all actors content-based 24 bytes 30

Table 3.1: Event-type characteristics for the movement use case

47

3 Scenario

publisher is a good estimate for FPSs (cf. Chang [CW05]). The size of the payload of one event is 24 bytes, if the schema in Listing 3.1 is used and it is assumed that the string attribute has a length of 8 bytes. Of course this is not the actual size that is transmitted over the wire. To calculate that size, the TCP-Header (20 bytes), the IPv6-Header (40 bytes) and middleware as well as serialization overhead have to be considered. With this parametrization, the reader should have an understanding of how movement in Tri6 and generally in MMVEs works. 3.3.2 Collision As the intention of the game is to get other players to crash into walls, collision is a fundamental use case. Two entities collide, if their polygon models intersect, based on their position. These intersection tests are performed pairwise by the coordinators of the corresponding entities. In the case of Tri6, these coordinators are one host, significantly reducing the complexity of this problem. If such an intersection test succeeds, a collision event is generated and published. Each entity that receives such an event, reacts accordingly. Either by just showing the appropriate effect like an explosion or by an actual reaction. For example if a bike collides with a wall, the bike shatters and is reset to the spawn point, while the wall producing player is awarded with a point. 1 2

metadata : id : Collision

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

payload : - coll_id1 : type : int - coll_id2 : type : int - x: type : float - y: type : float - z: type : float - region : type : string values : - Arena - Hub

48

3.3 Use Cases

- Racetrack

20

Listing 3.2: Schema of a collision event-type

Listing 3.2 shows the schema of such a collision event. It contains the position of the collision defined by x,y,z and region. Moreover, the colliding entity ids are transmitted in coll_id1 and coll_id2. The position of the collision is only necessary if contentbased subscriptions should be possible. Else only the two ids suffice, as each subscriber knows the position of each entity. Table 3.2 takes this into account, when modeling the payload size. An event of type collision must be propagated to both participating entities and to all entities that are able to see the collision. In Tri6, this is done by the central coordinator for all entities that also performs the intersection tests. All entities subscribe to their viewing range for collision events and so get notified if a collision occurs they may have to react to. Of course it is also possible to just subscribe to all events simplifying the use case and reduce it to a topic-based subscription type. Table 3.2 shows an overview on the use cases the collision event type results in. The event rate is an estimation of how often a collision takes place in Tri6 on average. For this event type the use cases do not vary over game genres, as the pace of gameplay merely influences the rate of occurred collisions. Of course the number of players and their concentration influences the rate, but the traces of Tri6, which is a fast paced game, show one event per second is a good average estimate. 3.3.3 Chat Chat –or more generally speaking communication– is just like movement a very basic action an actor performs in games. Actors may exchange information in text form in a variety of chat-channels, restricting the recipients of the messages. For example in WoW there are chat-channels for trade, general chat etc. for each partition (called region in WoW). Moreover, chat-channels exist for different group sizes as well as for the private communication of two parties, in the following called whisper chat. Use Case Simple Collision Collision with AoI

Game genre all all

Publishers Subscriptions all entities topic-based all entities content-based

Payload Events/sec 8 bytes 1 28 bytes 1

Table 3.2: Event-type characteristics for the collision use case

49

3 Scenario

Chat is predestined for distributed processing via publish-subscribe. Each chat-channel may be represented by a multicast tree, either managed completely distributed by P2P protocols, or with a dedicated root node. There is no need for centralized processing or security checks. Of course spam, denial-of-service or tapping foreign whisper chats may require security measures, but in this use case we assume these attack scenarios are either unlikely or prevented by other means like firewalling etc. Therefore, an actor just subscribes to the channel he wants to listen to and posts events to channels he wants to write on. An event type for chat may look like in listing 3.3. It contains a sender_id and a receiver_id representing the sending and receiving actor and a message. 1 2

metadata : id : whisper_chat

3 4 5 6 7 8 9 10

payload : - sender_id : type : int - receiver_id : type : int - message : type : string

Listing 3.3: Schema of a whisper chat event-type

All actors are subscribed to the whisper_chat event type with a filter containing their own id as receiver_id. Despite the simplicity of this use case, the event-type resulting from this scenario is interesting because of its different characteristics compared to the previous examples. If an actor whispers to another actor, only his subscription matches and he receives the event. An event type for a partition-wide chat looks very similar to the one in Listing 3.3, except that receiver_id is not required. In this simple scenario, partition chat has its own event type per partition. Therefore no filter is required for subscriptions. Table 3.3 Use Case Whisper Chat Partition Chat

Game genre all all

Publishers all actors all actors

Subscriptions content-based topic-based

max. Payload Events/sec 263 bytes 0.1 259 bytes 0.1

Table 3.3: Event-type characteristics for the chat use case

shows an overview of the characteristics of the resulting use cases. The use cases do

50

3.3 Use Cases

not distinguish between different game genres as communication is independent of the genre. In both cases all actors act as publishers because they all potentially communicate. The whisper chat use case requires content-based subscriptions as a subscriber is only interested in events addressed to him. In contrast, the partition chat use case is topicbased, because all subscribers should receive all messages published partition-wide. The payload sizes of both use cases are the result of the ids and an maximum message length of 255 bytes. The maximum number of characters in chat messages of existing systems vary very much. Twitter for example limits messages to 140 characters, while WoWs limits to 255 characters for an ingame message. TeamSpeak even uses 1024 characters as a limit for text chat messages. The event rate is an estimate of the average chat frequency per actor. Of course this rate varies but for this initial use case modeling it should be sufficient. The intention is to give the reader a sense for the magnitude of the event rate. 3.3.4 Match Coordination Besides event-types that directly influence the virtual world, a computer game also requires a number of so called meta-events. These event-types coordinate game mechanisms like match coordination or the transfer of statistical information for leaderboards etc. For this use case a simple match coordination event type is described. Such events are distributed by the coordinator at the beginning of a match and after the time for a match has been exceeded. During a match, each second a time event with the remaining time is distributed. Each client receives these events and acts accordingly, either by starting a match round, updating his timer, or by finishing it. The schema of this event type is rather simple. Listing 3.4 shows a field coordination_message that defines the type of coordination event sent. If a time event is sent, the time_left field is filled with the remaining seconds. 1 2

metadata : id : ma tc h_ co or di na ti on

3 4 5 6 7 8 9

payload : - coordination_message : type : enum values : [ BEGIN , FINISH , TIME ] - time_left : type : int

51

3 Scenario

Listing 3.4: Schema of the match coordination event-type

Table 3.4 shows the characteristics of the match coordination event-type. One publisher, the match coordinator, publishes events to all participating clients. Therefore a subscription is type-based. The payload consists of two integers and is 8 bytes. The event frequency results from the update rate of 1 time event per second. Use Case Match Coordination

Publishers Subscriptions 1 type-based

max. Payload Events/sec 8 bytes 1

Table 3.4: Event-type characteristics for the match coordination use case

3.4 Summary This chapter gave an insight into the domain of MMVEs. Beginning with the larger picture, we discussed the architecture of such systems. Each service required for the operation of an MMVE was introduced with some examples from industry-scale architectures. First, some of the unique challenges arising when sizing cluster architectures for games were discussed as well as the zoning and crowding problems of current architectures. This bird’s-eye view concluded with the subsumption of certain characteristics that may be exploited for optimization. Secondly, a simple application model was described starting from cluster architectures to clarify the used terminology and the interaction of different application components. The model was applied to a simple multiplayer game and then exemplified by the informal description of Tri6, a fast paced racing game. Based on this formal and informal notion of game development, some exemplary use cases, required in games, were defined and put into context of publish-subscribe systems. Each use case defines a corresponding event type. For these event-types, basic characteristics are defined. These characteristics form basic parameters which are part of the type’s semantics. These use cases with their parameters will serve as examples throughout the thesis. Step by step, their semantics will be extended accordingly. The intention behind the thorough discussion of this scenario, was to give the reader a certain sense how MMVEs work in the large scale and drill all the way down to simple use cases that can be used to illustrate the research throughout this work.

52

Part II State of the Art

53

State of the art gives an overview on the current research state and fundamental techniques related to the hypotheses addressed in this thesis (cf. section 1.3). The structure of this part is organized bottom-up, starting from physical networks building up to the work done in the field of distributed event-based systems(DEBSs) and a bird’s-eye view on the theoretical limitations of distributed systems like e.g. Massively Multiuser Virtual Environments(MMVEs). First, the concept of overlay networks, which provides an abstraction from physical networks is introduced. The section details on the different flavors of overlay networks and their major representatives. Due to the fact that MMVEs are commonly modeled as event-based systems, an exhaustive overview on their components and important aspects is given in chapter 5. It contains the techniques and concepts influenced by the different fields of research that cope with event dissemination. The focus hereby lies on publish-subscribe systems as the research in this field directly relates to the approach proposed in part III. With the understanding of the fundamental concepts and representatives of overlay networks and publish-subscribe, more advanced aspects of publish-subscribe are discussed. One aspect of interest is Quality-of-Service (QoS) in general and how publish-subscribe systems incorporate it. Finally, reconfiguration and adaptability of event-based systems, as a relatively new research aspect, is discussed. The chapter on event-based systems concludes with a taxonomy of the last two decades of publish-subscribe systems. Their respective capabilities regarding routing and filter aspects, QoS, and adaptability are considered and structured. As MMVEs have a distributed character, e.g. a client/server, grid or peer-to-peer architecture, we briefly discuss theoretical foundations for distributed systems in chapter 6. Hereby the CAP Theorem is detailed, describing the tradeoffs between consistency, availability, and resilience against network partitions. It can be seen as a motivation of the later proposed approach on a theoretic level. This exhaustive state-of-the-art discussion on overlay networks, DEBSs and the CAP Theorem should lay the foundation for the understanding and discussion of the central hypotheses in part III.

55

4 | Overlay Networks Overlay networks form a logical network on top of the physical network they abstract from. There may exist more than one overlay network on one physical network. The purpose is to provide a service not available at the underlying network. They are often used to implement new capabilities on top of the existing network stack. For example IP can be seen as an overlay network on top of a layer 2 network1 and provides routing of IP-packets beyond the edges of a network consisting of layer 2 switches. Another popular example are virtual private networks(VPNs), which build a private IP address space on top of the existing IP based network. The applications of overlay networks are not only restricted to routing or security, but span a variety of application domains like pure addressing, mobility and multicast (cf. [CLBF06]). One application domain rose to public significance: content delivery networks(CDNs). They are used to distribute content to the end-user. Those overlays range from commercial providers like Akamai2 to popular peer-to-peer protocols used for file sharing like BitTorrent [Coh03, IUKB04] or Gnutella3 . As the scope of this thesis suggests, I will restrict my discussion to protocols relevant to building overlay networks for multicasting. The term peer-to-peer is often used in the context of overlay networks. Steinmetz [SW05] defines such systems as follows: “A peer-to-peer system is defined as a self-organizing system of equal, autonomous entities (peers) which aims for the shared usage of distributed resources in a networked environment avoiding central services.” 1 2

3

ISO-OSI layer 2 (Ethernet/ATM) Akamai is one of the largest CDN providers world-wide. Akamai’s Solar Sphere network consists of over 100,000 servers in over 2000 locations. (cf. http://www.akamai.com/dl/brochures/sola_ media_experience_product_brief.pdf, visited on 2013-07-25) Gnutella was one of the first decentralized peer-to-peer (P2P) protocols. The Gnutella Protocol Specification is available here: http://web.archive.org/web/20090331221153/http://wiki. limewire.org/index.php?title=GDF

57

4 Overlay Networks

Therefore, each peer-to-peer system forms an overlay network but not vice versa, as not all overlay networks fulfill the definition of a peer-to-peer system. In this discussion solely peer-to-peer overlay networks are discussed, because they scale well and avoid centralized servers, as these are characteristics physical networks inherently lack. Based on the structure of the overlay network structured and unstructured overlay networks are differentiated [SW05, CPSL05]. Unstructured overlay networks organize peers in a random flat or hierarchical graph. Structured overlay networks assign keys to resources and structure peer graphs according to a function that maps those keys on peers. Wehrle compares the different resource lookup concepts in [WGR05]. He distinguishes between central servers, flooding based search as it is used in unstructured overlays, and distributed hash tables which are the foundation of structured overlay networks. The complexity depending on the number of nodes is depicted inTable 4.1. The column node state describes how the size of the required information on each node grows with the number of nodes N and communication overhead the overhead resulting from a query. Fuzzy queries allow for imprecise resource lookup, which is not possible in networks using distributed hash-tables(DHTs). Robustness addresses the resilience of the network against node churn. Another differentiation is provided by Wang [WL03], who introduces different generations of overlay networks in the context of peer-to-peer overlays: 1st Generation Unstructured peer-to-peer overlay networks like Gnutella and BitTorrent. 2nd Generation Structured peer-to-peer overlay networks like Pastry [RD01], Chord [SMK+ 01], or CAN [RFH01]. 3rd Generation Structured peer-to-peer overlay networks with focus on anonymity, authentication, and role diversity of particular nodes like Low-Diameter [PRU03] or Butterfly [Dat02]. Concept

Node state

Central server Flooding search DHT lookup

O(N ) O(1) O(logN )

Communication overhead O(1) ≥ O(N 2 ) O(logN )

Fuzzy queries

Robustness

yes yes no

no yes yes

Table 4.1: Comparison of lookup concepts following [WGR05]

58

4.1 Unstructured Overlays

This rather historic classification also is coarse grained and merely details the initial differentiation. Therefore, we stick to the initial distinction of unstructured and structured overlay networks. In the following sections both manifestations are covered and exemplified by popular representatives.

4.1 Unstructured Overlays Unstructured overlays are normally random flat graphs of interconnected nodes. In some overlays these graphs may be hierarchical e.g. by the introduction of a super-peer layer as used by KaZaA/FastTrack1 . A typical mode of operation in such an overlay is as follows: A client joins the network and requests a certain resource by querying the network. If one or more suitable peers are found that provide the requested resource, a direct connection is established in order to transmit the response. This response may be the transfer of a file or just an answer message. Unstructured overlays may be further distinguished depending on their search paradigm [WL03, CPSL05]. They either use a central server for the lookup of the queried resource or flooding techniques to distribute queries. Centralized unstructured overlay networks like Napster employ a centralized server to which all peers connect in order to fulfill resource queries. This directory server returns the nodes containing the requested resource. These central nodes therefore often become bottlenecks, depending on the number of queries. Moreover, they limit the scalability as well, as they hinder the robustness of such protocols due to a single point of failure. Decentralized unstructured overlay networks like Gnutella employ a flooding technique to propagate queries. This makes them very robust to node failures, but slows down the speed between query and reply. Another aspect is the excessive bandwidth consumption for flooding queries throughout the network. Lv [LCC+ 02] examines some improvements like random walks or expanding ring searches on different network topologies. Random walk searches use an algorithm in which a query “walks” from node to node. This reduces the network load, but increases search times significantly. Expanding ring searches flood a query with a small TTL2 . If the resource is not found the TTL is increased and the query is flooded 1

FastTrack was a proprietary P2P protocol used by the KaZaA client software. It was very popular in 2003. Protocol Specifications are only partly available. Its coarse functionality is described in [CPSL05] 2 In this context, Time-To-Live means that a search query only can live for a certain amount of hops. Afterwards the query package is discarded.

59

4 Overlay Networks

again. This improves the average resource consumption of flooding. Nevertheless, search in unstructured overlay networks requires either a huge amount of time or increases the network’s load significantly. Even if those aspects are put aside, the quality of the reply is not deterministic. It is possible that due to TTL limitations not all or not the desired resources are found. For example in a file-sharing application only unpopular files might be found that lie within the TTL range. The actual desired file might be never found, because the node is outside the TTL range. Based on these characteristics unstructured overlays are not suitable to build scalable and reliable publish-subscribe architectures. Therefore, an in-depth discussion and classification of different unstructured overlay protocols is omitted in this work.

4.2 Structured Overlays Unstructured overlays with their lookup concept via flooding do not scale for every application. Structured overlay networks address this problem and introduce another concept for resource lookup. In order to speed up resource lookups a key space is defined. E.g. Pastry uses a 128-bit key space. On the one hand each node that joins the overlay is assigned with a random NodeID from this key space, e.g. calculated by a hash function1 using the IP address. On the other hand each resource is also assigned with a key from the same key space. Such a resource may be a multicast group or a content object, depending on the service the overlay network should provide. Both, the resource key and the NodeID, can be used to efficiently implement search. The overlay network sorts the NodeIDs into a graph that maps each resource key to a node. Therefore, search is reduced to routing the query to the mapped node, which holds the resource itself or at least a pointer to the resource. Overlay algorithms are distinguished by the different structures such mapping graphs can attain. In the following we discuss three selected algorithms that gained some prominence over the last years: Pastry [RD01], Cord [SMK+ 01] and CAN [RFH01]. These representatives were selected, because they are widely accepted and employed in industry and research.

1

60

Of course the possibility of hash collisions exists, but their probability is negligible (in the range of 10−18 using SHA-1) for sufficiently few hash operations, which is always true for hashing IPv4 addresses with a theoretical maximum of 232 addresses. For example a collision attack on a full SHA-1 hash still has a complexity of about 269 hash operations [WYY05].

4.2 Structured Overlays

In order to give an exhaustive comparison on different structured overlay algorithms, Lua [CPSL05] and Götz et al. [GRW05] introduced relevant characteristics, each overlay must fulfill. Table 4.2 shows the overview of the characteristics for selected structured overlay networks. We discuss each overlay algorithm in-depth in sections 4.2.2-4.2.3. But first, each characteristic used in table 4.2 is shortly introduced. Decentralization describes how decentralized the system operates and whether centralized components are needed for operation. For example Napster requires a centralized index for search, while Pastry employs a DHT to address nodes and index content. Architecture describes the network’s architecture, focussing on the structure of the resulting graph, nodes are structured into. These range from random flat graphs over multidimensional spaces to rings. Lookup protocol describes how the protocol works, which performs the lookup of a key. This is mostly dependent on the architecture. System parameters are system parameters relevant for performance and state sizes. N is usually the number of nodes in the overlay. All other parameters are specific to the algorithm. Routing performance describes the performance of routing in terms of required hops to reach the target key. Routing state describes the size of routing information kept on a node depending on the system parameters. Node arrival and departure This characteristic discusses how the system handles node churn and how expensive self reorganization after a join is. Security is an important aspect, but not discussed in this work. Most of the introduced algorithms do not consider security issues themselves. For the interested reader, Urdaneta et al. give a survey in [UPS11] on existing techniques for security in DHT-based overlay networks. Reliability and fault resiliency describes how the system reacts to the various faults, like node churn or message loss in the network. These characteristics describe the behavior and the induced structure of the overlay as well as their expected performance and scalability sufficiently to get a notion of their inner workings.

61

4 Overlay Networks

Algorithm taxonomy

Structured overlay networks Chord

Pastry

Lookup protocol

Architecture

N peers in the network and d dimensions in the coordinate space.

O(logN )

N peers in network.

{key,value} pairs to map a Matching key and NodeID point in the coordinate space using a uniform hash function.

Multidimensional ID coordi- Uni-directional and circular nate space. NodeID space.

logB N × (B − 1)

O(logB N )

N peers in network and b bits (B = 2b ) are used as the base of the chosen identifier.

Matching key and prefix in NodeID.

Forms a Plaxton-Mesh [PRR99] network.

CAN

System parameters

O(d.N 1/d )

logN

DHT functionality at internet scale

Routing performance 2d

O(logB N )

Decentralization

Routing state

O(log22 (N ))

1

O( d2 N d )

Node arrival

O(logb N ) O(2d)

Failure of peers will not cause network- wide failure. Replicate data on multiple consecutive peers. On failures, application retries.

Failure of peers will not cause network-wide failure. Replicate data across multiple peers. Keep track of multiple paths to each peer.

Low level. Suffers from man-in-the-middle and Trojan attacks.

O(log22 (N ))

Node departure Security Reliability and fault resiliency

Failure of peers will not cause network-wide failure. Multiple peers responsible for each data item. On failures, application retries.

Table 4.2: Comparison of selected structured overlay schemes following [CPSL05, GRW05]

62


What remains is the discussion on the interface, such an overlay should provide. Dabek et al. [DZD+ 03] introduced a common application programming interface (API) for key-based-routing (KBR), which is the abstraction covering all structured overlay networks. Basically an API for KBR consists of three functions: Route(key, message, hint) forwards a message to its destination key. hint is an optional parameter giving a key for the first hop. Forward(key, message, nextHop) is called on each hop a message is forwarded on, including the source and the destination node. NextHop is optional and indicates the next hop the message should be forwarded to. Deliver(key, message) is called on the destination node for the key, delivers the message to the application and therefore terminates the routing process. Application

route

forward

Application

Application

forward

forward

Application

forward

deliver

Figure 4.1: KBR routing scheme

The chain of invocation leads to the behavior as shown in figure 4.1. The sending application calls route on the source node. This leads to a chain of forward upcalls depending on the structure of the overlay network. The routing process is terminated by the deliver upcall on the destination node. Moreover, Dabek defines some auxiliary methods for routing state and neighborhood control, not relevant for this discussion. With this general characteristics and API proposal in mind, we discuss the structure of the three selected overlay algorithms in a concise way illustrating the basic ideas behind the algorithms. The reader proficient with the routing algorithms of Chord, CAN and Pastry may skip the following sections. 4.2.1 Chord Chord [SMK+ 01] has been developed in 2001 by Stoica et al. at the Massachusetts Institute of Technology (MIT). It is a simple algorithm, but nevertheless shows scalable routing characteristics (cf. Table 4.2). The used key-space consists of l-bits, e.g. integers in the range [0,2l − 1]. This key-space forms a one-dimensional ring modulo 2l with clock-wise increasing keys. Each key that identifies resources and each NodeId is chosen from this key-space. A node is responsible for all keys before or equal its NodeId along

63

4 Overlay Networks

the modulo ring. As a node only gets queries for larger keys than the NodeId of its predecessor on the ring, this leads to the behavior that each node is responsible for a range of keys bound by the own NodeId and the previous node on the ring. The basic routing algorithm is based on the successor of a node. Basically, a message is forwarded to the neighbor node on the ring until the responsible node for the target key is found. In the worst case, this inefficient routing mechanism requires a number of messages equal to the number of nodes on the ring. Therefore, a finger table has been introduced, pointing to other nodes on the ring besides the successor node. The finger table has a maximum of l entries. The first row in the finger table is always the immediate successor of the table’s node. The granularity increases by factor 2 with each row, leading to larger hops in the key-space. 0 (10.0.0.1) Msg: 12

63 62

3 4 (10.0.0.34)

58 (10.0.0.98) Finger Table Idx Trgt ID 0 59 1 60 2 62 3 2 4 10 5 26

r

6 (10.0.0.2)

Succ 0 0 0 4 10 31

42 (10.0.0.102)

10 (10.0.0.5)

42

12 n 13 (10.0.0.23) 39 (10.0.0.67) Nodes 31 (10.0.0.45)

23 (10.0.0.198)

Keys

Figure 4.2: Chord routing example (6-bit key-space)

Figure 4.2 shows a Chord ring with a 6-bit key-space. Each node has its IP-address and an associated NodeId. For node 58 the finger table is also depicted. This finger table is used to reduce the required routing hops. It consists of an index Idx describing the exponent to the base 2. The resulting ID is depicted as T rgtID which can be calculated by (N odeId + 2Idx )mod2l . It is used to search for the next routing hop, by selecting the closest Id to the message’s Id. The message is then routed to the Succ in this row, which is the responsible node for this key. In our example a message is to be routed to key 12. On node 58 the lookup in the finger table results in a T rgtID of 10 which is therefore the next hop. On node 10 the key of the message lies between the node’s own Id and its successor. Therefore, the final hop is found and the message is routed to node 13. The

64


structure of the finger table in conjunction with the simple routing algorithm results in an average of O(log(N )) routing hops for a ring with N participating nodes. Node arrival in Chord originally requires the joining node to choose an arbitrary key k. A bootstrapping node already in the ring is queried for k. This results in the successor of k which then is informed about the new node. Afterwards the finger table is built by successively querying the required keys in the finger table. This leaves the predecessor of k and existing entries in finger tables updating their successor pointers. For this reason Chord employs a so called stabilization protocol [GRW05] which is performed periodically. This protocol ensures the consistency of predecessor-successor pairs on the ring by comparing the entries in finger tables. Node failures are detected by timeouts when contacting nodes. Subsequently, the corresponding entries in the finger table are removed. If a node fails all associated keys are taken over by the successor of the failed node. This behavior makes replication for robustness easy, as the successor poses an ideal replication target. 4.2.2 Pastry Pastry [RD01] is one of the most popular DHT-based overlay networks. It was proposed by Microsoft Research in 2001 and is completely decentralized, scalable, and self-organizing. Tapestry [ZKJ01], developed at Berkley in 2001, has similar characteristics. Both overlays are based on Plaxton-Meshes [PRR99] and therefore only one representative is discussed in this work. Each node in a Pastry network is identified by a unique 128-bit NodeID. It is assigned randomly on the join of a node. They are typically generated by a hash function using the nodes’ IP-addresses. When a message is to be routed to a certain key (which is from the same key-space as the NodeID), Pastry does so by routing the message to the NodeID numerically closest to the key. This is performed by a prefix based routing. Therefore, a key is defined as a number of digits to the base 2b . In each routing step, a node forwards the message to a node whose NodeID shares a prefix with the key that is at least one digit (or b bits) longer than the prefix that the key shares with the current node’s NodeID. If no such node is known, the message is forwarded to a node whose NodeID shares a prefix of equal length, but is numerically closer to the target key [RD01]. To make these routing decisions, a Pastry node keeps track of a routing-table, a neighborhood-set and a leaf-set. Figure 4.3 shows an exemplary routing state. The leaf-set contains the direct neighbors of node 34F1 (cf. Figure 4.4). Typically, the leaf-set has 2b entries. It is composed out of the 2b /2 preceding and succeeding nodes. The routing-table consists of

65

4 Overlay Networks

leaf set

...

23BC

3523

smaller

level 1

37AC

...

larger

routing table 0x 1x

2x

3

4x

1.0.0.1

level 2

level 3

30x

...

33x

...

343x

Dx

Ex

Fx

...

34

...

340x

...

344x

1.0.0.7

35x

...

37x

38x

...

1.0.0.20

...

1.0.0.30

1.0.0.3

...

345x

...

347x

...

34F

...

...

...

... Figure 4.3: Pastry routing state example for node 34F1 (16-bit key-space, b=4)

different levels. Each level indicates the prefix shared between key and target NodeID. The table is sparsely filled, as not all prefixes can be found in NodeIDs actually assigned to nodes. Averagely, the routing-table contains for N nodes about log2b N × (2b − 1) entries. The neighborhood-set (not depicted in Figure 4.3) typically contains the 2b nodes closest to the local node. How close a node is, is determined by a proximity metric like IP-routing hops or geographic distance. Pastry assumes the application provides such a distance metric. Figure 4.4 shows an exemplary routing path for a message with key: 37FB F3BC (1.0.0.7)

23BC (1.0.0.1) C316 (1.0.0.6) 3x 387C (1.0.0.3)

37x

34F1 (1.0.0.2)

37FC (1.0.0.5) leaf set: 37FC

3523 (1.0.0.20)

37Fx 37F4 (1.0.0.4)

37AC (1.0.0.30) 37A3 (1.0.0.39)

Figure 4.4: Pastry routing example (16-bit key-space, b=4)

66


the key 37FB. In each routing step the local leaf-set is checked for the key. If it is found the message is delivered and the routing is finished. If not the routing table is consulted and the appropriate level for the prefix is chosen and the message is sent to the IP in the routing table. If we use the table from figure 4.3 to reenact the routing decision on node 34F1 we can see the target key is not in the range of the leaf node. Therefore, the routing-table is used. The common prefix is 3 so the second level of the routing table applies, leading to the target IP 1.0.0.30 for the prefix 37. As the result the next hop is node 37AC. This routing algorithm leads to log2b N expected routing steps, given that there were no recent node failures and accurate routing tables. An overlay network has to cope with node churn, as nodes arrive and leave. Pastry distinguishes between node arrival and node failure. Node departure is handled like a failure in Pastry, so only two cases have to be discussed. The join procedure needs a bootstrapping operation. A known node k is required that is located near the joining node n based on the proximity metric. For the initialization of node n, its routing state must be filled. The neighborhood-set of k is a reasonably good initial choice, as it is not located too far away. The routing and leaf-sets require information from nodes, close to n in the NodeID space. Therefore, a join message is routed from n via k to a key equal to n. Based on the routing rules this message passes a number of nodes and finishes on the numerically closest node to n. The leaf-set of this particular node is suitable for n and thus is copied to n. Each hop on the route of the join message sends its routing-table to n. Combined with the routing-table of k, this leads to a sufficient initialization of the node n’s state. Afterwards, n sends its routing state to all nodes in its routing-table in order to enable them to update their routing information. Failures of nodes are detected by Pastry during routing, when it attempts to contact nodes in the routing-table or leaf-set. The neighborhood-set, since it is not involved in routing is periodically checked. If an entry cannot be reached, it is replaced by querying neighbor nodes in the appropriate set for a replacement. Hence, joining and failing nodes only affect a relatively small number of nodes [GRW05]. 4.2.3 CAN The Content Addressable Network (CAN) [RFH01] was introduced by Ratnasamy et al. in 2001. CAN generalizes the approaches taken in Chord and Pastry. It introduces a d-dimensional identifier space. Each key is a tuple like x,y,z for d = 3. The key-space for each dimension forms a ring, analogous to Chord. Geometrically this leads to a d-dimensional torus for a CAN key-space. The key-space is partitioned among the

67

4 Overlay Networks

participating nodes. Each node is responsible for a certain interval along each dimension and maintains a neighbor set for routing purposes. The neighbor set contains nodes which have an overlapping interval in one dimension and are adjacent in all other dimensions. Figure 4.5 shows a simple two-dimensional key-space with 5 nodes. For simplicity reasons (63,63)

(0,63)

Msg: (3,35) 10.0.0.119 ([0,31],[48,63]) 10.0.0.34 ([32,63],[32,63])

neighbor set

10.0.0.15 10.0.0.2 ([0,15],[32,47]) ([16,31],[32,47])

10.0.0.119 ([0,31],[48,63]) 10.0.0.2 ([16,31],[32,47]) 10.0.0.1 ([0,63],[0,31])

10.0.0.1 ([0,63],[0,31])

(0,0)

(63,0)

Figure 4.5: 2-dimensional CAN key-space with 6-bit keys along each dimension.

the key-space is depicted as a plane. Each node has an IP-address and an interval, it is responsible for. For node 10.0.0.34 the neighbor set is depicted. Routing in CAN is rather simple and uses solely the neighbor set. A message is always routed to the numerically closest node in the neighbor set until the local node is responsible for the target coordinates. In Figure 4.5 a message is to be routed to the coordinates (3,35). Using the neighbor set, it requires one intermediate hop to reach the destination in this 1 case. In general, for a d-dimensional space the number of hops grows as O(d(n d )) with n nodes in the overlay. Götz et al. [GRW05] describe the join process of an arriving node concisely as follows: The procedure for arriving nodes in CAN is divided in three consecutive steps: bootstrap by using a node already part of the CAN overlay; find the suitable zone for the new node; and update the neighbor lists, so the new node is considered in routing decisions. Ratnasamy et al. do not define a definite mechanism for bootstrapping in [RFH01], but suggest to use a dynamic DNS-based mechanism to infer the IP of a bootstrapping node. For a randomized key in the key-space, the join message of an arriving node n is sent to the bootstrapping node and from there regularly routed to its destination

68

4.3 Network Characteristics

node s. Node s, splits its assigned zone in half and assigns one half to n. Both nodes, n and s exchange neighborhood information, so that they both learn each other and n gets an initial neighborhood list. Afterwards n immediately informs all neighbors of its presence. Moreover, via an update mechanism, neighborhood information is exchanged periodically between direct neighbors. Therefore, the join procedure only affects a small portion of nodes in the overlay. The number of affected nodes of course depends on the dimensionality of the CAN overlay, but stays constant with the number of nodes in the overlay. Failed nodes are also detected through the update mechanism. A node is considered as failing, if it stops to send update messages. Nodes maintain timers for all neighbors and send takeover messages, if the timer fires because of missing update messages. The election of a best suitable node to take over the zone of a failing node is done by different timer lengths depending on the size of the zone a node manages. Nodes with smaller zones have shorter timeouts and therefore send their takeover messages first. This ensures that nodes with smaller zones are preferred when taking over zones of failed nodes. After the takeover the zones are either merged or managed separately, depending on the zone ranges. Departing nodes notify neighbors about their departure. Neighbors that are merge candidates are thereby preferred. If no merge candidate exists, the neighbor with the smallest zone is selected.

4.3 Network Characteristics In the previous sections we discussed the flavors of peer-to-peer overlay networks. Their performance can be described theoretically by the analysis of their computational complexity. The other possibility is the measurement in real world scenarios or by simulation. Overlay networks consist of hundreds or thousands of nodes, which makes it rather difficult to measure the characteristics of an algorithm in a real-world deployment on the internet. The deployment on large testbeds, scattered over the internet, like the well

69

4 Overlay Networks

known PlanetLab1 , is one option but is usually the last step in the development cycle and requires significant effort. Calvert [CDZ97] suggests simulation as the appropriate method for initial measurement and development of large scale distributed applications. But in order to get accurate measurements for networks at internet scale, he also argues, their topology must be realistically modeled. Therefore, in the following sections we will not only discuss relevant metrics to evaluate overlay networks but we begin with models that generate realistic network topologies for different scenarios. 4.3.1 Network Topology Models Network topology models are used to describe the topology of real world networks. In this section only a short overview is given with the focus on the simulation of internet topologies following Xiao Fan Wang [XG03]. The interested reader finds an exhaustive discussion of complex networks by Cohen in [CH10]. The focus on internet scale topologies is motivated by the scale of MMVEs with their hundreds of thousands of participating nodes that arbitrarily form overlay networks of internet scale. But before models of the internet topology and other network topologies are further discussed, a short overview on the structure of the internet is given. Internet Topology Calvert [CDZ97] gives a brief summary on the structure of the internet on domain level: Figure 4.6 shows an exemplary simple topology of the internet. The internet consists of a large number of routing domains also called autonomous systems (AS). All those domains are interconnected and form the internet. Each routing domain is a group of nodes (routers, switches, hosts) under a single administration that share routing information and policy. Routing domains can be classified either as a stub domain or a transit domain. A stub domain is only origin or destination of traffic. A transit domain also carries traffic that is neither. The purpose is to efficiently interconnect stub domains. Transit domains consist of backbone nodes that connect to a number of stub domains via their gateway nodes. Stub domains can be further distinguished into single- and multi-homed stubs. Multi-homed stub domains have more than one transit domain they connect to.

1

70

PlanetLab is a global research network that supports the development of new network services. Since the beginning of 2003, more than 1,000 researchers at top academic institutions and industrial research labs have used PlanetLab to develop new technologies for distributed storage, network mapping, peer-to-peer systems, distributed hash tables, and query processing. (http://www.planet-lab.org, accessed on 2013-08-20)


multi-homed stub domain

transit domains

stub-stub edge stub domains

Figure 4.6: Exemplary internet topology on domain level following [CDZ97]

Network Concepts and Models The structure of the internet and other natural and artificial networks raise the question on their properties, how to classify them, and a suitable way to simulate their behavior. In contemporary simulations two levels of abstractions are popular for the simulation of internet topologies: AS-level topologies and router-level topologies [LCGM05]. On the AS-level each node represents a domain and each link a connection between domains. On router-level each node models a router e.g. in an ISPs network. In the last decades, according to [XG03], three interesting measurements emerged in network theory: average path length, clustering coefficient and degree distribution. These three properties of complex networks were discovered along with the emergence of today’s most popular network models. Average path length is a basic measurement to describe networks. It is based on the distance di,j between two nodes i and j and counts the edges along the shortest path between them. The diameter D of a network is the maximal distance among all distances between any pair of nodes in the network. The average path length L of a network with n nodes is defined as the mean distance between two nodes, averaged over all pairs of nodes (cf. Equation 4.1) [XG03]. L=

X 1 di,j n(n − 1) i,j

(4.1)

71

4 Overlay Networks

Figure 4.7 shows different network models. Beginning from regular networks (a) and random networks (d), small-world networks (b) and scale-free networks (c) emerged. Regular and random networks pose the two extremes regarding the randomness of the edges between the nodes. However, it is interesting fact that most real networks have a relatively short average path length, even if most of their nodes are not neighbors. This led to the concept of small-world networks. A small world network is a network in which most nodes are not neighbors but despite that fact show a small average path length. The small-world property of a network is measured by its clustering coefficient C and its average path length L. Scale-free networks are ultra small-world networks1 with the additional property that the distribution of the number of edges a node has follows a power law.

(a) regular network

(b) small-world network

(c) scale-free network

(d) random network

Figure 4.7: Illustrations of the different network models

Simple networks with small-world properties can be constructed by the algorithm Watts and Strogatz suggested in [WS98]. The algorithm produces networks, starting from a regular network up to random networks, depending on a configuration parameter. This parameter also defines the clustering coefficient C of the resulting network as shown in Equation 4.2. The clustering coefficient measures the degree to which n nodes in a network cluster together. It is defined as the average of the local clustering coefficient Ci which is calculated for each node i. The local cluster coefficient is defined based on the neighborhood set Ei of a node i containing all edges to neighboring nodes. Each node i has ki edges connecting it to ki other nodes. All these nodes are the neighbors of node i. At most there can be ki (ki − 1)/2 edges between them (assuming an undirected graph).

1

72

In [CH03], Cohen defines ultra small-world networks as networks in which the average path length L grows proportionally to log log N with N nodes in the network. In contrast, regular small-world networks only grow proportionally to log N .


The number of actually existing edges |Ei | between the neighboring nodes divided by the possible edges between them defines the local clustering coefficient Ci (cf. Equation 4.3). n 1X Ci n i=1 2|Ei | Ci = ki (ki − 1)

C=

(4.2) (4.3)

That means the more connections between the neighbors of a node i exist the higher Ci becomes. The third interesting measurement to describe networks is the degree distribution. It is based on a simple characteristic of a node: its degree. The degree ki of a node i is the number of edges a node is connected to. Thus, it can be implied, the more connections a node has, the more important it probably is. The distribution of node degrees over a network is denoted by a distribution function P (k). P (k) is the probability that a randomly selected node is of degree k. The distribution function of a regular network shows a single spike (delta distribution), because all nodes are of the same degree. Randomness in the network broadens the spike and ultimately obeying a Poisson distribution for random networks [XG03]. But many empirical results have shown that most large-scale real networks deviate significantly from the Poisson distribution. A power law in the form of Equation 4.4 provides a better approximation [SFFF03]. P (k) ∼ k −γ

(4.4)

The larger γ gets the smaller the probability for a high degree node in the network. Because these power-laws are scale invariant, i.e. scale-free, networks following this characteristic are also often called scale-free networks. The introduced characteristics can be found in many real-world networks. Table 4.3 shows a few selected examples and their characteristics and provides some insight in the structure of those large-scale networks. For example the clustering coefficient of the domain level internet suggests a higher global interconnectivity of domains than on router level. This makes sense as domains only have a few edge routers and many internal routers. 4.3.2 Network Metrics Besides topology related metrics, the performance of overlay networks and generally speaking of routing algorithms is measured by a variety of metrics. Baumann et al. give

73

4 Overlay Networks

Network Internet, domain level Internet, router level Email

Size 32711 228298 56969

C 0,24 0,03 0,03

L 3,56 9,51 4,95

γ 2,1 2,1 1,81

Table 4.3: Selected network characteristics following [XG03]

an exhaustive survey of common routing metrics in [BHSW07]. Lao et al. [LCGM05] and Fahmyl [FK04, FK07] focus their work on metrics for multicast protocols. Following their work, the most popular and relevant metrics for the evaluation of multicast architectures are briefly described. The metrics refer to one link between two nodes, but it is also stated how they concatenate to paths of multiple hops. Delay Delay (also called latency or transmission time) is one of the central network metrics. It measures the time a message takes from sender to the receiver. The minimization of the delay is a common optimization goal for distributed applications. Especially fast-paced MMVEs require a low latency throughout the network. The overall delay Dl between two nodes, connected by a link l can be divided in phases. It is composed of the processing delays Ps and Pr and queuing delays Qs and Qr . Each for the sending node s and the receiving node r. With the propagation delay P and the transmission delay, the delay can be computed as shown in Equation 4.5. The propagation delay is the fraction of the distance and the wave propagation speed. Hence, it is the time, the head of the signal takes to reach its destination and depends on the used medium. The transmission delay is the fraction of the size of the message b and the data-rate p of the link. Dl = Ps + Qs + P +

b + Qr + Pr p

(4.5)

Delay is measured as a unidirectional delay by sending a timestamp to the receiver, but requires synchronized clocks. To circumvent synchronization often the round trip time (RTT) is measured. RTT measures the time until a message reaches its destination and is sent back to the sender. In order to get stable latencies, an exponential weighting moving average (EWMA) can be employed to flatten the variance of delay measurements [BHSW07]. The variance itself is also measured and called jitter. It is defined as the variation of an average delay and is often calculated using the classical statistical variance. The concatenation of delay is additive.

74


Throughput Throughput (also called bandwidth, data-rate or capacity) defines the amount of data that can be transmitted over a link in a given time. The throughput can be measured on different layers. On the physical layer the nominal physical link capacity defines the theoretical maximum amount the link can support under ideal circumstances. Above, the IP layer capacity and the transport layer capacity are the corresponding metrics on the IP layer and transport layer. The throughput of a path is determined by the link with the minimal throughput along the path. Packet Loss Ratio Packet loss ratio defines the percentage of lost messages on a link. Such losses occur for example due to overloaded links where routers have to drop messages. This metric is essential for every unreliable protocol, e.g. for video and audio streaming. But also for reliable transport protocols, a high packet loss ratio increases the number of resends and as a result the useable throughput shrinks. The packet loss ratio can be measured similar to the delay metric, either unidirectional or by a roundtrip. But as routes and queuing may differ on each direction in this case unidirectional measurement is more advisable. The packet loss ratio for a whole path is calculated as defined in Equation 4.6 [BHSW07]. P LRpath = 1 −

Y

(1 − P LRl )

(4.6)

l∈path

Protocol Overhead The protocol overhead measures the additional messages that are required for a certain protocol to operate correctly. In most protocols this overhead consists of handshakes, join and leave messages, tree restructuring etc. The overhead is usually measured by the number of messages. It is an important metric in order to evaluate the scalability of certain protocols [BB02]. Link Stress Link stress indicates the number of identical copies sent over one link. Of course this metric only makes sense if an overlay protocol is evaluated. It measures, how many overlay links are bundled on one underlay link in terms of messages. For example

75

4 Overlay Networks

IP-multicast has a link stress of one on all links, as there is no packet replication and the tree structure reflects the physical network topology [BB02]. Link Stretch Link stretch is also a metric to measure the application of overlay networks onto an underlying network. It is defined per node n and describes the ratio of the path length from the source to the node n and the direct path from source to n on the underlying network. Of course a direct unicast to all destinations has the average link stretch of one. An average link stretch larger than one means the average path length on the overlay is larger than the direct path and in most cases results in a higher path delay.

76

5 | Event-based Systems The conventional mode of interaction in many existing distributed systems is request/response. But, despite the easy programming model request/response provides, this interaction mode leads to tightly coupled systems. Moreover, they only offer synchronous operation which results in blocking behavior. In contrast, event-based computing takes a fundamentally different approach. Event-based systems 1 provide inherently decoupled system components. Mühl provides a concise overview on such systems in [MFP06]. Subsequently, the structure of this overview as well as its terminology is adopted and extended where necessary. Alternative introductions to event-based systems are given by Luckham in [Luc02] or Etzion in [EN10]. Both, Luckham and Etzion have a similar view on event processing but with a different focus. They focus on the detection of complex derived events from an enterprise application based view, rather than a bottom-up, formal view like Mühl. He motivates event-based systems as an extension of event notification services, namely publish-subscribe systems. In his introductory book [Tar12], Tarkoma lays special focus on the design principles of publish-subscribe systems. Figure 5.1 shows the basic components an event-based system consists of and how they interact. An event is any occurrence of interest that can be observed on a node, participating in the system. The nature of these events may be physical, e.g. the detection of a car passing a camera, or temporal like the progression of time. Generally speaking, any status change in a physical or artificial system can result in an event. We only consider events that are detectable by computer systems as event detection itself is out of scope of this thesis. If we talk about periodic events, we have to distinguish between type and

1

The discussion about event-based systems inherently applies to publish-subscribe systems as an event-based system’s notification service is often realized by a publish-subscribe middleware (cf. Mühl [MFP06]).

77

5 Event-based Systems

Event Producer Producer

Consumer

Producer

Consumer Consumer Notification

Notification publish()

subscribe()

Notification Service

Notification Service Communication Implementation

RPC, Multicast, Gossip, Pub/Sub

Message

Figure 5.1: Event-based system components following Mühl [MFP06]

instance, which means for example “item pickup” is an event type and “player x picks up flower y at position z” is an corresponding event. Notifications are artificial representations of detected events that conform to the world model the application built as a simplification of the real or virtual world. They contain a description of the event, like its type, timestamp, attributes and optionally some more information about the circumstances of the occurrence. They also may be encoded in different data models like XML, name/value pairs or objects. On the communication layer, we speak about messages. Messages are just containers on the network level transmitting the notifications between the different nodes. A message consists of a network header and a payload which is the notification in this case. This differentiation clearly distinguishes the abstraction level of the current discussion: On the application layer we speak about events, on the layer of event-based systems about notifications and on the communication layer about messages. The components of an event-based system act as producers and/or consumers of notifications1 . As producers detect an event they decide whether to publish a notification of that event via a notification service or not. The decision is inherently made by the application logic of the producer. The publication of a notification is not addressed to a specific set of consumers, but only published. The routing and delivery to the consumer is the responsibility of the notification service. Neither the producer knows which consumers receive a published notification, nor the consumer knows which producer

1

78

On the level of publish-subscribe middleware systems a consumer is often referred to as a publisher and a consumer as a subscriber (cf. [EFGK03]). In this thesis both terms are used synonymously.

published or will publish a certain notification. A component may act as both, a producer and a consumer. It reacts to notifications originating from producers as well as from observed events. Based on the logic of the component, these notifications may result in the publication of new notifications. The consumer states its interest for a certain kind of notification by a subscription. A filter, as a part of the subscription, selects which notifications are delivered to the consumer. Basically, a filter is a predicate evaluating to either true or false. Different kinds of filter mechanisms are discussed in Section 5.2. Advertisements are announcements from producers indicating the kind of notifications, they will publish in the future. They are mostly required to guide routing decisions on network level. The expressiveness of subscriptions regarding filtering capabilities is determined by the combination of the filter model and data model and is called notification classification scheme [MFP06]. The notification service itself is a mediator between the producer and the consumer. It is often implemented by a publish-subscribe middleware, but can also be realized by other techniques like message passing or remote procedure calls. All techniques provide different characteristics in terms of decoupling. We focus on publish-subscribe notification services: Eugster [EFGK03] describes a publish-subscribe notification service1 as mediating component that decouples producers from consumers in three dimensions. Space decoupling: The producers do not know any consumers. The notifications are disseminated by the notification service without their participants knowing of each other. Time decoupling: Both, producers and consumers do not need to participate in the notification process simultaneously. A producer may continue to publish notifications even if one or all consumers are unsubscribed. Conversely a producer can get notified of an event even if the originating producer is currently offline. Synchronization decoupling: The notification service provides non-blocking behavior. That means a consumer is not blocked during the dissemination of a notification, as well as a consumer is asynchronously notified about events and he is not blocked while waiting for a notification. This decoupling of producers and consumers renders a publish-subscribe notification service more flexible and scalable than classical interaction schemes or other implementa-

1

The notification service is called event service by Eugster in [EFGK03].

79


tions of notification services. In the remainder of this work it is always assumed that the notification service employs a publish-subscribe middleware. From an application perspective the notification service acts as a black box only defined by the semantics of its API. According to Mühl [MFP06] and Eugster [EFGK03], the basic operations a publish-subscribe system should provide are: publish: To distribute a notification, the producer typically calls publish and passes the notification to the notification service, which distributes the notification to all subscribed consumers. notify: This method is called by the notification service on the consuming node in order to notify the application about the arrival of a new notification. subscribe: If a consumer has interest in a certain kind of notification, he calls subscribe with a filter describing the kind of notifications he wishes to receive. unsubscribe: The symmetric operation to subscribe which releases an existing subscription to a certain kind of notification. advertise: A producer can advertise the kind of notifications he will be publishing in the future. This knowledge can be used to adjust the routing of certain notifications in the notification service. Pietzuch takes up this basic operation set and defines a procedural and XML-RPC API in [PEKS07]. He distinguishes between a core which consists of the above methods and some extensions for advertise management. The aim is to provide a common API for publish-subscribe systems, as Dabek proposed for KBR in [DZD+ 03]. In the following sections the different aspects and components of event-based systems are discussed in detail. The focus hereby lies on the notification service and how its communication implementation can be realized by a publish-subscribe system. That means we will not discuss complex event processing or stream processing systems, but how notifications are routed, filtered and which QoS properties are relevant on this level. Nevertheless, some advanced topics like adaptability and security are shortly sketched. Stream processing systems(SPSs) enhance the capability of simple event processing systems like publish-subscribe systems by the transformation of streamed data via a set of stream operators. Based on that set, they define continuous queries that transform the input data to a result stream. These queries are typically formulated in languages similar to SQL. Complex event processing (CEP) enhances simple event processing by the capability to generate and process complex events. Complex events may additionally

80

5.1 Data Models

represent or respect correlations between events. Those correlations are often of spatial, temporal or even spatio-temporal nature. Moreover, they often provide rule engines that allow to deduct complex events based on rules that are applied to input events. Both classes, SPSs and CEP systems, have aspects in the realization of their notification service that go beyond the scope of this thesis. Therefore, we will focus on an overview of publish-subscribe systems and stick to aspects that are relevant for the conceptual understanding of them. In-depth description of different algorithms is limited to those relevant for the proposed framework, contributing to Hypothesis 3.

5.1 Data Models Each notification contains data of interest for the consumer. Due to the decoupling of producer and consumer a common data model1 for the messages is required in order to exchange information. Generally header and payload of a notification can be distinguished [EN10]. A header contains system inherent attributes like timestamps, source or destination, while the payload consists of attributes specific to the notification type. In a simple publish-subscribe system the payload is just unstructured data with no semantic value during dissemination. For example a character array with only a header defining the topic of the notification. Of course this limits the possible filtering mechanisms (cf. Section 5.2). In current research prototypes and commercial systems three types of data models are popular: tuples, structured records and semistructured records [MFP06]. Tuples In data models that use tuples, notifications are an ordered set of attributes. An example would be (1,35,4,"Nuremberg") for a 3-dimensional position with an additional region attribute. In most tuple-based approaches, subscriptions are formulated as templates with wildcards. For example (1,35,*,*) would match the above notification. In order to match subscriptions not only by positions, but also by attribute names, additional schema information like (x,y,z,region) is required. Such an extension leads to structured records, if the schema information is transferred over the wire as part of the message.

1

The data model is also called event model by Rozsnyai in [RSS07] or event definition by Etzion in [EN10].

81


However, a simple tuple-based model is more efficient than a model using structured records, at the cost of flexibility in the expression of subscriptions. In order to gain the same flexibility as structured records, the schema of each notification type must be known on all nodes in the system. A tuple-based data model can be found in JEDI [CDF98], where a notification is a tuple of strings. Bates et al. [BBMS98] employ an object-oriented design. They use classes to define the schema of a notification and each notification is an instance of that class. Structured Records Notifications based on structured records provide more flexibility as they inherently contain a named schema for all attributes. They consist of a set of name/value pairs. Each name uniquely identifies an attribute. The notification about a position event would be formulated as (x=1,y=35,region="Nuremberg"). Many systems like SIENA [CRW01], Gryphon [BCSS99], REBECA [Mü02], and JMS [Ora02] follow this data model. Mühl argues in [MFP06] that at the first glance there is not much difference between tuples and records. But records are more powerful because of their inherent schema. They allow for attributes which do not have to be part of a filter. E.g. x > 5 ∧ region = ”N uremberg” is a valid filter even if the y and z attribute is missing. This also affects extensibility because new attributes may easily be added without affecting existing filters or requiring an update of the respective schema on all nodes in the system. Formally a notification e using structured record is a nonempty set of attributes {a1 , . . . ,an }, where each ai is a name/value pair (ni ,vi ) with name ni and value vi [MFP06]. A position notification is written as follows: {(x,1),(y,35),(region,Nuremberg)}. Semistructured Records Semistructured records do not adhere to a strict schema like structured records. However, they still contain certain markers or tags that provide some sort of semantic value. Generally speaking, semistructured data often has a treelike structure or is self-describing because the schema is part of the data itself [Bun97]. For example a well-formed XML document can be used to define a semistructured record as a data model. This broadens the naming of attributes to paths, because attribute names can occur more than once in a given tree structure. Despite the interesting research challenges regarding publish-subscribe systems that use a semistructured data model, the size of messages in relation to the transported information using semistructured data models is large, at least for XML-based models.

82

5.2 Filter Mechanisms

As a consequence, such data models only inefficiently use the available throughput which is too expensive in high performance scenarios like MMVEs. Therefore, semistructured records are not considered as a suitable data model in the remainder of this thesis.

5.2 Filter Mechanisms The capability to express interest for a certain subset of notifications in an event-based system is realized via a filter mechanism. A consumer states his interest in form of a filter predicate during the subscription process and only receives notifications that satisfy the filter predicate. Filters have a twofold benefit: On the one hand they reduce the number of notifications a consumer receives and has to process. On the other hand they may be exploited to optimize message routing and to conserve network resources. Formally1 , a filter is a predicate F (e) that takes a notification e as argument. It matches a notification e if F (e) returns true. The set M (F ) of all matching notifications for a filter F is defined as {e|F (e) = true}. For simple filtering a matching algorithm is sufficient for implementation. But to optimize routing decisions and to reduce the required messages, additional operations on filters are needed. Overlapping, covering, and merging operations enable this optimization potential. Two filters F1 and F2 overlap if M (F1 ) ∩ M (F2 ) 6= ∅. A filter F1 covers F2 if ∀ei ∈ M (F2 ) : ei ∈ M (F1 ). Merging of filters F1 and F2 is defined as ∃F 0 : M (F 0 ) = M (F1 ) ∪ M (F2 ). Current systems show a variety of filtering mechanisms ranging from simple models that only provide different channels inspired for from message oriented middleware to sophisticated mechanisms like content-based filtering. In the remainder of this section the major classes of mechanisms are briefly described. Content-based filters are covered in more detail, as they are the dominating filter strategy in current systems. 5.2.1 Channels Channels are the simplest filter model that originates in message-oriented middleware systems and provide the basic abstraction for most messaging solutions [HW03]. Basically, a channel is used to bundle certain notifications [EFGK03]. Each producer that publishes on a channel reaches all consumers subscribed to the same channel. Channels may be distinguished by some form of unique identifier, e.g. a name or number. If channels are 1

For this thesis, we follow the filter formalism as introduced by Mühl in [MFP06].

83


observed from a technical point of view, they can be seen as a logical instance of the event processing pipeline. 5.2.2 Topic-Based Filter Topic- [EFGK03] or subject-based filters [MFP06] have been used in many early commercial messaging solutions like TIBCO Rendezvous1 . Topic-based systems allow for a hierarchical topic filtering based on a simple string comparison. For example a position notification can be published under /Germany/Bavaria/Nuremberg to categorize the region the position is in. Interested consumers can subscribe to /Germany/Bavaria/* to receive all notifications of positions in Bavaria. Obviously the strict hierarchical structure of the topics limits the expressiveness of filter predicates, but the processing of matching filter predicates is very efficient. To cope with more complex requirements, other filter models that offer more flexibility have been suggested. 5.2.3 Content-Based Filter Content-based filters [EFGK03, MFP06] provide a more flexible filter model. Such a model does not impose an external hierarchy to classify different notification types. It rather takes the content of a notification into account in order to fulfill different subscriptions. It enhances the expressiveness of the filter language to support a higher selectivity. For example a filter expression for a position event could look like region = "Nuremberg" ∧ x > 5. Basically filter expressions exist in a variety of formats, mainly depending on the chosen data model, their expressiveness and performance considerations. The major representations in current systems are according to Eugster [EFGK03] strings, template objects and executable code. String representations are the most common representation of filter expressions. They conform to a certain language grammar, e.g. SQL2 , X-Path [BBC+ 10], or a proprietary language. Template objects define templates that may contain wildcards. Notifications are then matched against these templates and must conform to them (besides the wildcards) to pass the filter. Executable code filters are objects that represent predicates in a way that they may be executed at runtime with a notification as a parameter and return the result of the matching operation.

1 2

84

http://www.tibco.de SQL is standardized by the ISO/IEC JTC 1/SC 32 committee.


A content-based filter expression is usually a boolean expression [Mü02]. A filter can consist of one or more predicates. A simple filter or constraint is a filter expression that only contains a single predicate. A constraint itself follows a certain structure and depends on the chosen data model. For structured records a simple filter is an attribute filter like region = "Nuremberg", consisting of an attribute name, an operator and a constant. Semistructured records require more complex constraints allowing path expressions (cf. Mühl [MFP06]). A compound filter is a combination of simple filters with the help of boolean operators. If the combination is only a conjunction of simple filters, they are called conjunctive filters. This distinction is important as some content-based filter models only allow conjunctive filters, e.g. in REBECA [PGS+ 10] or Hermes [PB02]. In [BH07] Bittner argues against this restriction and proposes a filter model that also allows general boolean expressions. In [Bit08], he also provides a two dimensional classification schema for different content-based filter algorithms based on the index structures they maintain for efficient matching. It differentiates between predicate indexing and subscription indexing. For predicate indexing, two categories are identified: No Predicate Indexing (NP): Algorithms that do not employ predicate indexing do not maintain any indexing data structures on predicates in order to determine a match of predicates on the attribute values of incoming notifications. One-dimensional Predicate Indexing (OP): Approaches supporting OP construct indexes for predicates on an attribute level. That means each predicate that contains a certain attribute is included in the corresponding index for that attribute. Regarding subscription indexing two classes are distinguished: Individual subscription indexing (IS): IS algorithms maintain individual index structures for each subscription with the aim to efficiently match against incoming notifications. Shared subscription indexing (SS): In order to exploit similarities in subscriptions SS approaches maintain shared index structures. All subscription filters are added to a common index structure and therefore compact the data structures required for matching. These categories result in four classes of content-based filter algorithms. Bittner [Bit08] discusses all four classes with their respective representatives. They differ in the range of their applicability because of large varieties in their memory consumptions and

85


complexity. He concludes that on the one hand only basic filter algorithms for general boolean subscriptions exist and on the other hand conjunctive filter algorithms are tailored to different application scenarios. Three exemplary filter algorithms are discussed in the following, one for each class except for OP-SS. According to Bittner only one algorithm (cf. Jacobsen [LHJ05]) exists that classifies as OP-SS. It shares the same deficiencies as the NP-SS algorithm by Campailla et al. [CCC+ 01]. In order to contribute to Hypothesis 3, only different behavior and performance characteristics are essential. Moreover, as no algorithm for OP-SS with fundamentally different characteristics is available, this class is omitted in this work. NP-IS Algorithms The so called “brute force” [MFP06, Bit08] algorithm was introduced for ELVIN [SAB+ 00], a content-based event notification system. It is an example for an NPIS algorithm with no predicate indices as well as individual indices on the subscriptions that only provide the access to the different subscriptions. On the one hand, this algorithm does not impose any restrictions on the filter expressions, but on the other hand it also does not apply any optimization technique. It merely matches all subscription filters sequentially with the attributes of an incoming notification. It provides a kind of baseline against which all optimizations can be evaluated. NP-SS Algorithms For the class of NP-SS algorithms three major algorithms were proposed: Gough and Smith [GS95] and Aguilera et al. [ASS+ 99] suggest a tree-based matching algorithm, whereas the algorithm in [ASS+ 99] is an advancement of the algorithm proposed by Gough and Smith. The third algorithm is proposed by Campailla et al. [CCC+ 01] and uses an ordered binary decision diagram. Because Campailla did not provide a method how to insert new subscriptions into the shared index and the algorithm shows an exponential memory consumption in the worst case [Bit08], we discuss the approach of Aguilera in more detail. The tree-based algorithm of Aguilera et al. [ASS+ 99] is limited to a conjunction of attribute filters. A tree is constructed to build a shared index for subscriptions. Aguilera’s approach is more efficient for equality predicates, because the decision tree structure can be faster traversed in this case. Generally all operations are supported but the size of the tree and its efficiency degrades to the algorithm of Gough and Smith, if

86


operations besides equality are indexed. Despite this disadvantages, it is suitable for certain application scenarios and is used in popular systems, e.g. Gryphon [BCSS99]. The tree itself has to be built in a preparation step in order to be able to match notifications. To illustrate the general applicability, figure 5.2 shows an exemplary tree with two conjunctive subscription filters F 1 = {x < 20 ∧ y < 20} and F 2 = {y > 40 ∧ region = Nuremberg} that are not only equality predicates. Generally speaking, x< 20 y
40

20

region = Nuremberg

F1

F2

Figure 5.2: Decision tree example with two filters

nodes consist of an attribute name and an operation. Leaf nodes represent filters. Edges between nodes are constants, except the “don’t care” edge, denoted with a *. The combination of one node and an edge results in an attribute filter. Hence, a matching operation is a depth first search for a path that terminates in a leaf. OP-IS Algorithms In the class of algorithms using one-dimensional predicate indexing with individual subscription indexes, three important algorithms can be named. The clustering algorithm, by Fabret et al. in [FJL+ 01], the counting algorithm originally suggested by Yan and Garcia-Molina in [YGM94] and the algorithm for general boolean expressions by Bittner [BH07]. The first two are limited to conjunctive filter expressions. To give an example for another matching mechanism that supports general boolean expressions, the latter algorithm introduced by Bittner is detailed in the following. Figure 5.3 shows an overview how the algorithm for general boolean expressions operates. The algorithm can be divided into two steps, a preprocessing step in which all required index structures are generated, and a matching step in which the actual matching operation takes place.

87


One-dimensional predicate indexes

id(p)… id(p) id(p) Fullfilled predicate

Unfullfilled predicate

Fulfilled predicate vector

1

0

1

1

0 … … … … 0

0

Predicate matching

1

List of fulfilled predicates

Predicate-subscription association table

id(p) {id(s)}

Actual number of fulfilled predicates per subscription

Hit vector

10

10

… …

79 70

Subscriptions containing the predicate

Accumulation per Subscr.

11

Minimal number of fulfilled predicates per subscription

Minimum predicate count vector

1 2,3

…

…

…

12

Candidate subscription matching

Greater or equal test

15

…

…

…

12

Access subscriptions

Subscription location table

id(s) loc(s)

1 0x12

Subscription trees

…

70 0x34

Final subscription matching

Evaluate subscriptions

Figure 5.3: Outline of the matching algorithm for general boolean expressions [Bit08]

In the preprocessing step two major data structures are built. First, one-dimensional predicate indices are generated. These indices can for example be simple hash tables, one for each attribute/operator pair. The constants are used as keys and the predicate identifiers as values. For detailed discussion on the generation of predicate indices or a detailed description of the general boolean expression algorithm refer to [Bit08]. Secondly, as a subscription may contain more than one predicate, the minimal number of predicates required for the fulfillment of each subscription is counted and saved in the minimal predicate count vector. Moreover, some additional structures for translation are generated. A predicate-subscription association table that translates from predicates to a set of subscriptions the predicate is part of. A subscription location table that maps from unique identifiers to the actual subscription representation. Some more structures are required during the matching process, but they are generated on the fly and specific for each matching operation. The actual matching operation starts with predicate matching. The notification is matched against the predicate indices which results in a fulfillment predicate vector. With the help of the predicate-subscription association table, the fulfillment vector

88


can be translated to the hit vector that contains the count of matched predicates per subscription. The hit vector is subsequently compared to the minimum predicate count vector. All subscriptions with a greater or equal hit count than the minimum predicate count, identify a candidate subscription. Each candidate subscription is then individually matched to the initial notification. According to Bittner [Bit08] the runtime and memory requirements are roughly equal to the counting algorithm for conjunctive filters, but are not restricted to this filter type. Merging and Covering The previous algorithms tackled the challenge how to match subscriptions and incoming notifications. Besides this central problem, optimization steps can be taken that optimize routing decisions and therefore reduce matching operations. The basic idea is to reduce the subscription filter tables and/or advertisement filter tables that influence routing decisions. Two optimization operations can be performed: Merging or covering tests of filters. On the one hand, if a filter is covered by another one it can basically be ignored for routing decisions [Mü02]. Merging on the other hand is a more complex optimization. As well as covering, it also reduces the number of routing table entries. But merging can be performed either perfect, if there exists set-equality for the resulting filter, or imperfect, if only a proper superset can be found [Bit08]. The implications and drawbacks of covering and merging based optimization as well as content-based filtering in general is thoroughly discussed in recent literature. The interested reader is advised to refer to Bittner [Bit08] or Mühl [MFP06] for further information. 5.2.4 Type-Based Filter Type-based filters [EFGK03, MFP06, Eug07] extend the expressiveness of the topic-based scheme by using the type of a notification instead of a topic. The idea is to provide a better integration into the application’s programming model. There is no requirement for artificial hierarchies or topics. The type is the topic. As a consequence the content of a notification is represented by the state of an object of the corresponding type. In [Eug07], Eugster claims essential characteristics for type based publish-subscribe systems and therefore inherently for the realization of type-based filters. He states five requirements: Encapsulation preservation: Notifications are instances of types and their implementation details should not be used for system decisions.

89


Type safety: The system should ensure correct typing locally as well as remotely. Errors should be detected at compile-time. Application-defined notifications: Notification types are to be designed as a part of the application with minimal design constraints. Open content filters: Filter expressions should be usable for routing and optimization decisions, not only for filtering purposes. Event semantics: To ensure QoS requirements, basic semantics should be expressible. Based on these requirements a filter expression for a position notification object n would look like this (written as application code): n.getRegion() == "Nuremberg". To use such an expression for routing decisions, some open form like region == "Nuremberg" is required where the attribute’s name is interpretable by the system. To achieve that, at least some form of introspection like the Java Reflection API1 for a type is required in order to match against attribute names. In addition the same challenges and requirements as introduced for content-based filters apply, because matching against members of a type is inherently content-based filtering on a certain data model. However, more complex types, e.g. with inheritance hierarchies, require even more complex solutions like language extensions as proposed by Eugster with JavaP S in [Eug07]. 5.2.5 Advanced Filter Concepts In recent research, two popular advanced filter concepts emerged: spatial and conceptbased filters. Both of them strive to enhance content-based filtering by adding semantical value. Spatial [CCR03] or location-based filters [EGH05] raise the spatial context of an event to a first class property of the event-based system in order to optimize location based matching and routing. Context-aware filter models [CMM09] generalize this concept to arbitrary contexts. A generic approach to add semantic value to filter mechanisms are concept-based filters [CABB04] (also called ontology-based filters [WJL04] or semantic-aware filters [PB12, WJLS04, PBJ03]). They employ a simple hierarchy [PBJ03] or ontology [PB12, WJLS04] to model semantic relationships between attributes of notifications, building a common vocabulary of synonyms or convertible attributes. These ontologies are used to enhance the matching operation. Filter do not only match values based on attribute names, but 1

90

The Java Reflection API is part of the Java Language Specification and allows access to members and methods of a class via their names at runtime.

5.3 Routing

also consider semantic relationships between attributes. For example if a subscription filters position events for region == Bavaria, a concept-based filter would also match notifications with Nuremberg as region value as long as the concept of Nuremberg lies in Bavaria is part of the ontology. The semantic value added by these approaches can not only be used for filter decisions but also for general routing decisions as for example Preuveneers [PB12] shows for energy-restricted mobile ad-hoc networks.

5.3 Routing Routing states one of the central challenges in the design of a notification service for a DEBS. The basic task of routing is to answer the question which notification has to be delivered to which node in the network in order to satisfy all subscriptions. On each node a routing decision has to be made. The node decides to which nodes a certain notification is forwarded. The decision is made based on local knowledge, which consists of a routing table and the notification itself. A routing table contains at least node addresses, but can be enhanced with meta-information like subscriptions, attribute ranges etc. Martins [MD10] provides a survey on the last decade of routing algorithms with special focus on content-based publish-subscribe. This survey is especially noteworthy as it not only considers proposals of the event-based systems community, but also from the networking community. As content-based publish-subscribe states the most complex routing challenge, all other types of filter mechanisms can be reduced to content-based algorithms and therefore here discussed routing algorithms also applicable to all other kinds of publish-subscribe systems. Martins distinguishes five categories of approaches to the routing problem: Centralized Matching and on-demand multicast (CM-Routing): These algorithms require publishing nodes to know all current subscriptions in the system. Based on this global knowledge for each notification, a suitable multicast tree is calculated for dissemination, e.g. as proposed in [BCM+ 99]. Thus, the most naive example for this category is a client/server system with a star topology. Despite different manifestations all algorithms in this category show limited scalability, because of the required global knowledge resulting in complex replication challenges and a high memory footprint.

91


Usage of bounded number multicast groups (BM-Routing): Subscriptions are clustered to a limited number of multicast groups. Algorithms in this class are distinguished by the way how the clusters are found. For in-depth analysis of different cluster algorithms refer to Riabov [RWY02]. Learning by the reverse path (RP-Routing): All algorithms in this category rely on one principle: They learn routing information from the reverse path. Conceptually, routing tables are built reversely, beginning from the subscriber to the publisher. Hence, subscriptions are flooded throughout the network. Each node receiving them gathers information about the link they received it from. Routing tables are built by accordingly associating links and subscriptions. Notifications are disseminated via the neighbors with matching subscriptions in the routing table. Mapping of notification subspaces to key spaces (KS-Routing): Nodes are organized in structured overlay networks (cf. chapter 4) introducing an abstract key-space. Based on the introduced key-space, subscriptions as well as notifications are mapped to this key space. A notification is routed to the node responsible for the notification’s key. Ultimately the notification then reaches all nodes responsible for interested subscriptions. Semantic Neighborhood (SN-Routing): These algorithms employ full meshed networks that allow direct connections among the nodes. The connections are made based on a semantic distance metric. For example this metric can be a spatial distance in an MMVE as used for the construction of VON [HCC06]. We subsume the first three categories under broker-based routing [MFP06], as this term describes the basic structure of the overlay substrate used. They are discussed in section 5.3.1. Some algorithms based on reverse path learning employ tree structures which are similar to the data structures of algorithms that use key space mapping. We categorize these algorithms under the term hierarchical and rendezvous-based routing [MFP06] and discuss them in section 5.3.2. Algorithms that exploit special semantic properties of the application domain are discussed in section 5.3.3, conforming to the semantic neighborhood category with respect to the classification by Martins. 5.3.1 Broker-Based Routing Broker-based routing comprises all routing algorithms adhering to a common system model. In figure 5.4 the basic structure of a broker-based routing topology is illustrated.

92

5.3 Routing

It consists of a set of Brokers B. These Brokers are interconnected by an acyclic graph and form a peer-to-peer topology, meaning all brokers are equal. The acyclic graph introduces a restriction that may result in a bottleneck1 inside the broker overlay. Nevertheless, broker-based routing is the fundamental routing mechanism for many research projects like Gryphon [BCSS99], Jedi [CDF98] Padres [JCL+ 10], REBECA [PGS+ 10] or Siena [Car98]. In addition to inter-broker connections, each broker may have links to publishers and/or subscribers. Publisher and subscriber do not take part in any routing tasks. They are merely producer or consumer of notifications. Of course, figure 5.4 shows only a logical overlay and a physical node acting as a publisher may also subscribe to the same network. Sub Sub

Pub

Sub

B B

Pub

B Sub

B

B

Sub Sub

Pub

Figure 5.4: Structure of broker-based routing topologies

The two central tasks for all broker-based architectures is the routing of notifications and the propagation of subscriptions. The latter defines the way how routing tables are populated. Hereby, in addition to the basic publish-subscribe API (publish, subscribe, unsubscribe, notify, advertise), two additional operations have to be defined: forward and admin [MFP06]. These two operations with their respective message types are sent and received between brokers. The forward operation routes notifications throughout the broker network according to the sending broker’s routing table. The admin operation distributes incoming subscriptions in order to populate the routing tables throughout the network. In figure 5.5 the routing operations in a broker-based system are depicted. The event type is a two-dimensional position event with additional region information.

1

There has been some work on how to cope with cyclic broker overlays. Basically, the idea is to calculate spanning-trees for the broker network used for the distribution of subscriptions [MD10].

93


Sub3 Pub1 e1 = {(x,6),(y,1),(region,Erlangen)}

notify(e2)

admin(s)

B2

publish(e1)

e2 = {(x,2),(y,4),(region,Nuremberg)} Pub2

forward(e1) B1

publish(e2) notify(e2) Sub2

routing table B2,{region=Erlangen} Sub3,{y>2} Sub2,{region=Nuremberg} Sub1,{region=Fuerth}

subscribe(s) Sub1

Figure 5.5: Routing operations in broker-based routing topologies

Two published events e1 and e2 and a subscription s are depicted. The subscription s is added to the routing table of B1 and forwarded to B2 via an admin message. For each event received in B1, the routing table is consulted and e.g. in the case of e1 no direct subscriber is interested. However, somewhere in the network an interested subscriber exists. Therefore the notification is forwarded to B2. This example illustrates the basic interaction of the different operations in a broker-based network. With this routing framework in mind for the remainder of this chapter we discuss basic routing mechanisms limited to RP-Routing and roughly sketch some optimizations. CM-Routing algorithms have not much impact on recent architectures as they only show limited scalability and are therefore omitted. For an in-depth discussion of RPRouting algorithms in the context of broker-based routing and its different optimization approaches, the interested reader is referred to Mühl [Mü02, MFP06]. For further reading on BM-Routing, Riabov [RWY02] is advised. Flooding and simple routing The most naive solution to routing in a broker-based network is flooding [MFP06] or event forwarding [Bit08]. Flooding algorithms, as their name suggests, flood notifications through the whole broker network. A broker that receives a notification from a local publisher via a publish operation, forwards the notification to all neighboring brokers. Since the broker network adheres to an acyclic graph, no duplicate messages are processed by the brokers. The matching of filter expressions only takes place for local subscribers, right before calling a notify operation. Hence, subscriptions are also only maintained

94

5.3 Routing

locally for all local subscribers. No content-based optimization is performed in order to reduce the number of messages or shrink the routing table. That implies that no admin operations are required in this scenario. To prevent flooding of notifications, a network can employ a simple routing [MFP06] or subscription forwarding [Bit08] algorithm. To do so every incoming subscription is not only processed by its local broker but also forwarded to all neighboring brokers and so on. So basically, instead of flooding notifications, subscriptions are flooded throughout the broker network by the use of admin operations. Each broker adds a pair of the incoming subscription and the sending broker to the local routing table. Based on these entries, notifications are routed to those brokers with matching subscriptions. Unsubscribe messages have to be handled analogously. They are also flooded throughout the broker network in order to clean up routing tables. In conclusion, this algorithm leads to global knowledge about all subscriptions on each broker. This is unnecessary for a correct routing decision and unfavorable regarding many performance metrics, e.g. routing table size and number of messages required for subscriptions and unsubscribe operations. Routing Algorithm Flooding Simple

Identity-based

Covering-based Perfect Merging Imperfect Merging

Use Case Easy to implement, subscriptions become effective immediately, but has worst-case notification forwarding overhead. Significantly reduces notification forwarding overhead if subscriptions and clients are sparsely distributed. Routing table sizes grow linearly with the number of subscriptions. Every routing table is affected by a new or canceled subscription. Reduces routing table sizes and filter forwarding overhead if the set of subscriptions contains a lot of identical entries; may degenerate to simple routing otherwise. Identity test must be efficiently computable. Efficient for interval like subscriptions. May degenerate to identity-based routing if subscriptions do not cover each other. Covering test must be efficiently computable. Reduces routing table sizes if subscriptions can often be merged perfectly; may degenerate to covering-based routing if not. May increase the filter forwarding overhead. Allows users to trade accuracy against efficiency. Degenerates to flooding if too much imperfection is tolerated.

Table 5.1: Routing algorithms and their use-cases [MFP06]

95


Routing optimizations In order to cope with the limited scalability of the introduced basic routing algorithms, many suggestions have been made. Three categories of enhancements have been identified: Employment of advertisements, compressing of routing tables and clustering of subscriptions [MD10]. Advertisements, first introduced by Carzaniga et al. [CRW01], can be used to prevent flooding of subscriptions throughout the broker network. Basically, a publisher advertises his intent to publish and gives information about the content he will be publishing. These advertisements are used to test for overlapping subscriptions before forwarding. The advantages are that subscriptions are not propagated to subnets where no matching notifications are published. But in contrast the propagation of advertisements and the update of subscriptions after new advertisements require time and produce overhead. It depends on the application scenario whether advertisements are a suitable optimization. Another optimization approach is the management of routing tables. In order to reduce their size and eliminate some sent messages, identity-based, cover-based or merging-based routing can be employed. They all have in common that they use the respective comparative operation in order to eliminate redundant routing table entries. Identitybased routing eliminates identical redundant entries. Cover-based routing undertakes covering tests for subscriptions in order to eliminate entries that cover each other. Finally, merging-based routing tries to merge filters in the broker network. Mühl gives an overview on the respective use-cases for the different optimizations in [MFP06] (cf. table 5.1) Clustering of subscriptions respective BM-Routing is another optimization technique besides the already introduced approaches. The basic idea is to aggregate subscriptions into a manageable number of multicast groups. Matching subscriptions are afterwards disseminated facilitating a multicast mechanism, optimally IP-multicast. Riabov [RWY02] surveys the major techniques used. Most of them originate in knowledge discovery like k-means clustering. They build on the work done on Gryphon [BCSS99]. Clustering techniques do not strictly require broker-based network topologies, but may also employ tree-like structures as described in the following section. 5.3.2 Hierarchical and Rendezvous-Based Routing Hierarchical and rendezvous-based routing algorithms assume another overlay topology. They do not organize brokers in a peer-to-peer fashion, but assume tree-like structures.

96

5.3 Routing

B

Root

B

B

Sub Notifcation

B

B

B

Pub Pub

Sub

Sub

Pub

Pub

Figure 5.6: Tree-like overlay structure

Figure 5.6 shows such a tree structure. The difference to peer-to-peer based broker networks is the existence of a root node of the tree. All notifications are routed stepwise towards the root-node and spread downwards for each subtree on each level. Subscribe and unsubscribe messages are also only forwarded towards the root node. For example, JEDI [CDF98] employs such a routing algorithm. Rendezvous based algorithms employ a similar structure as hierarchical routing, but with the difference that the trees are constructed on a structured overlay network. One or more event types are grouped and assigned to a rendezvous node. This rendezvous node acts as a meeting point for the notifications and subscriptions. It is also the root of the dissemination tree and has a unique key in the overlay network. This key is calculated from the multicast group for example by a hash function. Hence, notifications and subscriptions are only routed towards this key using the structured overlay. The routing algorithm ensures the distribution and the maintenance of the dissemination tree. Examples of systems that employ rendezvous-based routing are Hermes [PB02], Bayeux [ZZJ+ 01], and Scribe [RKCD01]. This structure of the routing overlay has the advantage that it avoids flooding of any kind. But, if the fluctuation of subscriptions is very high the maintenance of the tree structure can outweigh the benefit. Moreover, the rendezvous node can become a bottleneck, because all notifications are routed through it. This load can be mitigated by a step-wise routing towards the root, as employed by JEDI [CDF98], but at the cost of latency. Scribe routes notifications directly to the rendezvous node.

97


Routing optimizations with the aim to reduce the notifications routed through the tree can be completely adapted from broker-based routing as introduced in section 5.3.1. For a thorough discussion on the required adaptations, the interested reader is referred to [MFP06]. To further improve scalability, the load on rendezvous or root nodes can be balanced. Martins [MD10] discusses different partition schemes, like clustering as described in [RWY02] or the most extreme one, to partition per publisher like in [TBF+ 03]. Moreover, tree-like routing structures reduce the size of the routing tables, because they only have to maintain subscriptions for their respective subtrees. Hence, the matching process in each node gains a speedup compared to peer-to-peer based structures. As a consequence, hybrid structures that use a peer-to-peer overlay to route between rendezvous/root nodes can be an interesting combination as suggested by Carzaniga [Car98]. In conclusion, as long as the number of notifications exceeds the number of subscriptions by far, tree-based structures have advantages over peer-to-peer structures in many application scenarios [MD10]. 5.3.3 Semantic Routing Concepts In the previous sections the routing overlays were constructed merely considering network topologies and exploiting simple similarities in subscriptions. Routing decisions did not consider any semantic relationships between subscriptions. However, as already discussed in section 5.2.5, semantic properties can provide additional value in terms of optimization potential. If the relationships between subscriptions are known, overlay structures driven by application knowledge can be constructed. Martins [MD10] surveys algorithms based on a fully meshed network that employ affinity and distance functions. These functions are used to calculate the semantic relationship between subscriptions in order to construct optimal multicast trees for the distribution of notifications. The network community which deals with gaming infrastructures, calls these mechanisms interest management [SZ99]. Boulanger [BKV06] and Liu [LBC12] give an overview on the different kinds of interest management mechanisms. An intuitive example are position notifications in a computer game. An own multicast tree is created for each player, with all nearby players as subscribers. The structure is based on the distance between these players. The farer a player is away from the publisher, the deeper he is in the tree and the longer a notification takes until it reaches him.

98

5.3 Routing

In the context of MMVE architectures specialized overlay structures have been suggested. For example VON [HCC06], Nomad [RS07], and VoroGame [BAC09] create semantic overlay structures based on voronoi diagrams that triangulate the space of the virtual world. Bharambe suggests Donnybrook [BDL+ 08] which introduces a special affinity function based on the attention of players. Generally speaking such semantic overlays employ interest management to calculate the semantic relationships between subscriptions which lead to multicast trees based on application knowledge. Yahyavi [YK13] surveys such semantic overlay substrates with their respective interest management mechanisms, whilst Liu [LBC12] and Mauve [Mau00] give a survey that focuses more on the consistency of the applications built on top of the routing substrate. However, as these architectures are very domain specific and not generally applicable, we will not discuss these architectures any further. The interested reader is referred to the cited literature for further reading. 5.3.4 Application-Layer Multicast In the previous sections we discussed many variants of the basic idea how to build one-to-many dissemination structures. In section 5.3.2 we focused on tree-like structures. Besides the event-based community, the network community studied such tree-based overlay structures extensively in the context of multicasting. The most efficient way to multicast is to use IP-Multicast [DC90]. It ensures that each message is sent exactly once over each physical link. However, multicast on the IP layer requires the support of the network infrastructure, especially of routers, which are not comprehensively available throughout the internet. Most internet service providers(ISPs) allow IP multicasting in their internal networks but not over autonomous system (AS) borders. Hence, the internet is a group of IP-Multicast islands, interconnected by unicast links. In order to cope with those current shortcomings application-layer multicast (ALM) has been developed. ALM builds multicast infrastructures on the application layer and forms an overlay using the underlying unicast links. The different approaches to ALM have been surveyed by Banerjee [BB02] with focus on their computational complexity. The design choices and their consequences that can be made for ALM protocols are thoroughly classified by Hosseini et al. in [HASG07]. We follow this classification and sketch selected popular algorithms and their respective advantages.

99


ALM algorithms can be classified according to a variety of characteristics. A selection of the most important ones is discussed here1 . Application Domain The application domain is crucial for the selection of the appropriate ALM algorithm. It defines the number of nodes participating in one group, the number of producing nodes as well as the metrics that are target to optimization. For example, a multicast tree that optimizes for latency has other requirements than a tree that favors throughput or scalability. In [HASG07] four different application domains are distinguished. Media Streaming: Streaming applications have usually one producer that distributes the media stream and a large number of consumers. Throughput is more important than latency in this application domain. Examples are live streaming of videos or video on demand services. Audio/Video Conferencing: This kind of application has more than one producer, as a conference usually has more than one participant. The group size is however small in most cases. It is characterized by an interaction driven setting. Hence, latency and throughput are equally important in this scenario. Generic Multicast Service: Algorithms in this category do not focus on a certain application domain but try to provide solutions applicable to most scenarios. They may not be the best solutions in some application domains but aim to be equally good in all of them. Reliable Data Dissemination: To distribute data, e.g. large files or databases, reliability is of essence. Therefore, the only relevant metric for this application scenario is throughput. It is obvious that these application domains all have different requirements for a multicast algorithm. These requirements are reflected in the QoS metrics that measure their applicability. We will discuss QoS in more detail in section 5.5. Moreover, the optimization of algorithms for different application domains has an effect on the design of the algorithms itself. Two of those design aspects are introduced in the following: group management and routing mechanism.

1

For the full range of characteristics refer to [HASG07]

100

5.3 Routing

Group Management Group management addresses the organization of a multicast tree. Hosseini [HASG07] identifies five aspects that distinguish algorithms regarding the management of their trees. We will discuss the three important aspects in the following. It encompasses the mechanism how new members join the group as well as the procedure of leaving. Moreover, the question of the structure of the overlay has to be answered. For simplicity reasons we will, however, omit the discussion of certain optimizations like tree refinement or support for IP-Multicast islands. P: Publisher S: Subscriber

P S

S

S

S

P

S

S

S

S

mesh-first topology

tree-first topology

Figure 5.7: Mesh-first vs. tree-first group management

Mesh-first vs. Tree-first: Two different approaches exist to build dissemination topologies: tree-first and mesh-first (cf. figure 5.7). Mesh-first approaches explicitly build a mesh of all nodes that belong to the group. Based on this mesh a routing algorithm builds a routing tree from the publisher to all subscribers, similar to some broker-based approaches, e.g. simple-routing. Of course, a tree can only be as good as the underlying mesh. In contrast, tree-first algorithms explicitly construct the dissemination tree, as exemplified in figure 5.7. The tree has to be maintained constantly, e.g. rebalanced and recovered after node failures. In conclusion mesh-first approaches are usually more robust and more suitable for multi-source applications, but at the cost of a higher control overhead. Source-specific Tree vs. Shared Tree: The construction of the tree itself can be performed following one of two premises: source-specific or shared. Source-specific trees place the root of the tree equal to node of the producer. Therefore, they are especially suitable for applications like media streaming. In contrast, shared trees aim for a cost minimal shared tree of all nodes. This kind of tree is preferable for multi-source applications like video conferencing applications. Each approach addresses one of the conflicting goals: minimize the path length to an individual node or minimize the average path length over all destinations.

101


Distributed vs. Centralized: Large scale applications usually tend to strive for distributed approaches. Hence, the distributed management of groups is feasible. But small to medium scale applications may favor a centralized approach due to performance gains or simple because of reduced programming complexity. Therefore, both design rationales still have their respective field of application. Routing Mechanism Fundamentally, a routing mechanism provides a heuristic solution to a problem from graph theory. For a given node topology (defined by group management decisions) and per node constraints, a certain metric (e.g. throughput or delay) should be optimized. Currently, different solutions to this problem can be classified with respect to their optimization objectives. Hosseini [HASG07] distinguishes four classes of routing mechanisms: shortest path, minimum spanning tree, clustering structure, and peer-to-peer. Shortest Path Tree: Shortest path trees (SPTs) are trees that minimize the path costs from the producer to each receiver. Their construction can be reduced to the single-source shortest path problem (cf. Cormen [CLRS09]). Costs can be expressed as an arbitrary QoS metric, however, delay, measured as RTT, is one of the most popular. Such trees are commonly used in ALM algorithms like SpreadIt [DBGM02] and Yoid [Fra00]. Minimum Spanning Tree: Minimum Spanning Trees (MSTs) are trees that minimize the overall costs connecting all members of a given node topology. They are constructed based on algorithms for the all-pairs shortest path problem (cf. Cormen [CLRS09]). MSTs are ALM approaches in order to connect all members with minimal cost which is favorable for multi-source applications and commonly used for centralized ALM approaches. Clustering Structure: These routing mechanisms organize nodes in clusters in order to construct trees. Algorithms like NICE [BBK02] organize the clusters themselves into hierarchical structures with “cluster-heads” that communicate on behalf of whole clusters with the layers above. These structures favor faster join procedures and smaller control overhead over the construction of perfect trees. Peer-to-peer: Peer-to-peer routing mechanisms base on structured overlay networks (cf. section 4.2). They employ forward-path or reverse-path forwarding to construct the multicast tree based on the overlay substrate. Popular examples are Scribe [RKCD01] and Bayeux [ZZJ+ 01], both discussed further in section 5.7

102

5.3 Routing

Classification of Selected ALM Approaches The discussed characteristics can be used to classify existing ALM algorithms. Hosseini [HASG07] exhaustively analyzes the proposed approaches of one decade. We will limit the classification in this work to algorithms that are decentralized and widely discussed in literature. For a full survey, the reader is referred to the original paper. The classification of Hosseini is enriched by the computational complexity estimations by Banerjee [BB02]. In table 5.2 an overview of the different algorithms is shown, as well as their characteristics and computational complexity, depending on the number of nodes n and the maximum degree of the tree d. Recent research agrees on the advantages and disadvantages of the different topologies ALM algorithms assume. Excell examined in [ETJ09] mesh against tree topologies on a theoretical level and confirms the opinion of Hosseini [HASG07] and [BB02]. Tree topologies are efficient for distributing messages, while mesh topologies have a high variance in delays. In contrast tree topologies are sensitive to node failure, while mesh structures are robust against node churn. Tree structures are suited for scalable single source distribution, while mesh topologies scale better for multi-source applications. In conclusion, the optimal routing algorithm is inherently application dependent [ETJ09].

103


Application domain

O(log n)

Max. tree degree

unbounded

O(log n)

Max. path length

O(n)

O(log n)

Control overhead

Computational complexity

Tree

media stream- delay ing

approx. bounded

ALM characteristics Topology sourcespecific

delay, throughput

Cost metric

Routing mesh-first

conferencing

Algorithm peer-to-peer sourcespecific

Bayeux [ZZJ+ 01] mesh-first

O(1)

shortest-path

O(log n)

Narada [RS02]

O(log n)

mesh-first

media stream- delay, ing throughput

cluster

sourcespecific

NICE [BBK02]

generic

O(log n)

sourcespecific

O(log n)

mesh-first

O(log n)

peer-to-peer

delay, throughput

Scribe [RKCD01]

O(log n)

sourcespecific

O(log n)

tree-first

throughput, delay

bounded

shortest-path

generic

media stream- throughput, ing delay

SpreadIt [DBGM02]

shared

O(d)

tree-first

unbounded

shortest-path

O(d)

Yoid [Fra00]

Table 5.2: Classification of selected ALM algorithms, following [HASG07, BB02]

104

5.4 Reliability

5.4 Reliability Reliability for publish-subscribe systems has been exhaustively surveyed by Mayer [MBCK12] and Esposito et al. [ECR13]. They study existing publish-subscribe systems according to their respective reliability definition. Ahwood [Atw04] and Popescu [PCEI07] perform similar analysis for ALM algorithms. Esposito defines a system as reliable if it adheres to four properties: agreement 1 , valdity, integrity, and timeliness. Agreement: If a non-faulty publisher publishes an event it is either received by all subscribers eventually or by none. Validity: If a non-faulty publisher publishes an event it reaches at least one subscriber. Integrity: Every notification of a non-faulty process is performed at most once, leading to a duplicate-free communication. Timeliness: Given a deadline, all non-faulty processes are notified before the deadline is exceeded. These requirements scratch the borders of the CAP-Theorem that limits the capabilities a DEBS can guarantee (cf. chapter 6). The agreement property is basically a consensus problem, which is not solvable in a distributed fully-asynchronous system with faulty processes and unreliable communication [AW98]. In contrast, if a partially synchronous system is assumed, a consensus can be reached [DLS88]. Partial synchrony can be assumed for DEBS, if an upper bound for communication costs exists but is not known a priori. That means the “all or nothing” property for reliable distributed systems has to be weakened in some cases, as we will discuss in the context of the CAP-Theorem. Besides the definition of reliability, a fault model models possible error cases of a DEBS. It has to cope with notification omission on a per notification basis, interruptions like crashes of nodes or links on process level and network failures like bit-flips or message tampering on the level of physical networks [ECR13]. For the evaluation of systems, above faults can simulated as follows: Notification omissions as message losses while interruptions and network crashes can be represented as link and node crashes.

1

The agreement and validity properties are liveness properties, because at a certain point in time they can still be fulfilled, meaning “something good can happen”. Sometimes they are explicitly called liveness, e.g. in Zhang [ZMJ12] or Mühl [Mü02].

105


5.5 Quality-of-Service Besides mere notification distribution with the goal to reliably satisfy subscribers, guarantees regarding the quality of the distribution may be given. They have been informally discussed for publish-subscribe systems under the well known term QoS by Behnel et al. in [BFM06]. They survey the relevant QoS metrics for publish-subscribe systems. QoS encompasses a certain terminology [KCP+ 13]: A service quality model (SQM) defines the concrete quality categories with their quality attributes (also called parameters) and quality metrics. Each category (e.g. performance) contains a certain number of attributes (e.g. latency or throughput) and each attribute is measured by a certain metric (e.g. RTT). Hence, the implementation of QoS states two challenges: on the one hand a SQM must be defined and as a result the language for the definition of subscriptions has to be extended to express the limits for certain attributes. These limits are called QoS profiles. This has been examined by Araujo and Hoffert. Araujo [AR02] suggests an extension of content-based subscription languages in order to express QoS guarantees. Hoffert [HSG07] discusses modeling of QoS policies as an extension of the Data Distribution Service (DDS) [Obj12] specification. On the other hand the architectures and algorithms have to be adapted to consider different QoS policies. Mahambre [MKB07] gives a taxonomy of existing publish-subscribe systems with focus on their QoS awareness. In the following we will briefly describe important1 quality attributes with their respective implementation for publish-subscribe. 5.5.1 Latency Latency [MKB07, BFM06] as a quality attribute can be defined as the time a notification takes to reach one consumer. Latency profiles can be expressed as an upper bound for the latency of the path from the producer to the consumer. Such profiles have an impact on the routing architecture on a global level and on the routing decisions and matching operations on a local level, as they limit the available time for local processing and cap the number of hops a path can have [BFM06]. Suitable optimizations to support

1

We restrict the discussion to those attributes relevant for this thesis. For an exhaustive introduction on QoS modeling refer to Kritikos [KCP+ 13].

106

5.5 Quality-of-Service

latency QoS profiles have been discussed in the context of general routing optimizations in section 5.3.1 Lp =

n X

Dl

(5.1)

l=1

A suitable QoS metric to measure such a path latency Lp is based on the network delay Dl for a link l introduced in Equation 4.5. Hence, path latency can be defined for n hops as in Equation 5.1. 5.5.2 Throughput Throughput [KCP+ 13] (also called bandwidth in [MKB07, BFM06]) is an attribute that describes how many notifications can be propagated over a certain path per time unit. It can either be expressed in terms of data rate (Mbit/s) or number of notifications per time unit. The throughput of a path Tp is limited by the link with the smallest bandwidth along the path p. Profiles for throughput can be implemented as an extension to advertisements and subscriptions [BFM06]. Advertisement profiles define upper/lower bounds for the data rate of the streams they produce. Subscription extensions are defined analogously. If links are congested for longer periods packets are dropped on network level, which can lead to an even heavier congestion if reliable protocols are employed in the notification service. Load shedding techniques e.g. as proposed by Jerzak [JF06], can mitigate this problem at the cost of overall reliability at times of high load. 5.5.3 Delivery A Delivery guarantee [MKB07, BFM06] is a quality attribute that is tightly entangled with the reliability of the system (cf. section 5.4). Depending on the reliability the distributed system offers1 , certain levels of delivery guarantees can be defined. At the lowest level best effort delivery can be specified which means no guarantee is given and duplicate as well as no notifications of an event may arrive. At most once delivery means that at most one notification of a certain event is delivered, but it is still possible that notifications are lost. Reliable communication with no lost notifications is defined by at

1

The grade of reliability in this context means, how many and which type of failures can be compensated to which degree.

107


least once delivery, but this guarantee still allows duplicate notifications. Exactly once delivery eliminates even this flaw and ensures a notification does not get lost and is not delivered more than once. As delivery guarantees depend on the overall reliability of the system they are influenced by the capability of the routing algorithms to deal with link and node crashes (cf. chapter 4 on overlay networks and section 5.3 on routing) as well as they require reliable communication channels. Reliable communication can be ensured by a simple acknowledgement mechanism or more sophisticated protocols like ARQ, Gossiping, FEC or LEC surveyed in [ECR13]. We omit a description of the different protocols and refer the interested reader to [ECR13]. 5.5.4 Order The order of notifications has been studied in a variety of publications. In publishsubscribe systems subscribers can express their need for ordered notifications. In the context of ALM order has been surveyed by Defago et al. in [DSU04]. Zhang et al. [ZMJ12] applies those order algorithms to content-based publish-subscribe with brokerbased routing. We only discuss order semantics in form of a brief overview. For an exhaustive introduction, the reader is referred to the previously cited surveys. Total order, the strongest order property is based on the definition of a reliable system (cf. section 5.4) and extends the four properties by a total order property that has to hold: Definition 1 (Total order): If two subscribers p and q both deliver two notifications e1 and e2 , then p delivers e1 before e2 , if and only if q delivers e1 before e2 [DSU04]. If this property holds a system is called totally ordered. As discussed in section 5.4, a reliable fully asynchronous distributed system is impossible. Therefore, either safety1 or liveness has to be sacrificed for systems that cope with byzantine failures [ZMJ12], if total order should be guaranteed. A system favoring safety prefers to throw away notifications instead of delivering them potentially out of order. A system focusing on liveness delivers notifications even if it is possible that the order 1

Safety in this case means that the order property is safe and therefore not violated.

108


property is violated. Order properties that apply only for non-faulty processes are called non-uniform order properties. Hence, despite uniform and non-uniform total order, weaker order properties have been defined. These order properties do not hold uniformly and only for parts of the distributed system. FIFO-order [DSU04] defines order on a per publisher base. This property can be combined with the fundamental order property in Definition 1. It is defined as follows: Definition 2 (FIFO-order): If a non-faulty process publishes a notification e1 before e2 , then any two subscribers p and q deliver e1 before e2 . If this criterion holds, all notifications are FIFO ordered. Local order extends this property to a group of subscribers, regardless of the publisher. Meaning that order must be ensured for all notifications addressed to the same group of subscribers. Definition 3 (Local order): A local order property is satisfied if for any two notifications e1 and e2 and any two subscribers p and q with Dest(e1 ) = Dest(e2 ), p delivers e1 before e2 if and only if q delivers e1 before e2 [ZMJ12]. Causal order combines the FIFO and local order properties by the introduction of a transitive causality relationship [DSU04]. The causality of two notifications is defined by a “preceding” relation between their two publish operations. This introduces a partial order on the set of publish operations. Definition 4 (Causal order): Causal order holds if the publication of a notification e1 causally precedes the publication of e2 with no correct subscriber delivering e2 unless it has previously delivered e1 [DSU04]. Based on these order properties different algorithms have been proposed in the context of ALM as well as in the context of publish-subscribe. Garcia-Molina [GMS91]

109


distinguishes between single-source ordering, multiple-source ordering, and multiple-group ordering. Single-source ordering is only a special case of FIFO-order with only one publisher that is trivial to solve with sequence numbers. Multiple-source ordering equals a local order property that can be extended to multiple-group ordering if the order is maintained even over multiple multicast groups. Algorithms that implement those order properties can be classified according to their synchronization mechanism. They are distinguished by the location where the sequence of notifications is determined. It can either be decided by the sender, the destination nodes or a dedicated sequencer node [DSU04]. The most common method for synchronization is the employment of sequencers. Fixedsequencer algorithms elect a single node that ensures the order of notifications. In the absence of failures this node does not change and has the sole decision about the sequence, notifications are delivered in. The sequencer adds sequence data to all notifications of the multicast group. Depending on the method, three types of fixed-sequencer algorithms can be distinguished: unicast-broadcast (UB), unicast-unicast-broadcast (UBB), and broadcast-broadcast (BB). UB algorithms like in [GMS91] send each notification to the sequencer node that adds sequence data and forwards it to all subscribers. In UBB algorithms, like the Multicast Transport Protocol (MTP) [FM90], the publisher requests sequence data from the sequencer and distributes the notification with sequence data itself. This reduces the load on the sequencer at the cost of an additional unicast message. BB algorithms require all publishers to distribute their notifications themselves, including the sequencer. When the sequencer receives a notification, it distributes a second message, containing only the sequence data. In extension to fixed-sequencer algorithms, moving sequencer algorithms allow for a group of sequencer nodes, motivated by load-balancing. However, the sequencers have to be synchronized among themselves. This is done by a token that circulates between the sequencers. The token ensures that only one sequencer at a time issues sequence numbers for notifications. Privilege-based algorithms continue the idea of a token for synchronization, only that the publishers circulate the token instead of dedicated sequencer nodes. A node that owns the token is granted the privilege to publish. Despite they are similar to moving sequencer approaches, privilege based algorithms require all publishers to know each other. However, the correct token passing between the publishers is important for the liveliness and hence requires special attention.

110


Communication history algorithms rely as well as privilege-based algorithms on publishers to provide the sequencing data. However, communication history approaches rely on timestamps added by the publishers. On the subscribing nodes, arriving notifications are buffered as long as earlier messages may arrive1 . The buffered messages are either in a partial order when they contain causal sequence data or they just bear independent timestamps. In the first case a predefined function transforms the partial order to a total order by sorting concurrent notifications. In the latter case deterministic merge [KR05] is applied. A deterministic policy merges the different streams of notifications into a global sequence. The previously discussed aspects are applied to popular algorithms. Their respective capabilities are depicted in table 5.3. We limit the classification to three algorithms that guarantee total order and are implemented in the prototype, described in part IV. In addition to the aspects discussed previously, the classification contains the fault and link model, the algorithms adhere to. The fault model refers to the fault categories discussed in section 5.4. Only a few algorithms actually cope with all fault categories including byzantine errors, most of them are limited to interruptions on process level. Some of the algorithm also require certain quality guarantees of the underlying links, denoted in the link model column. A reliable FIFO channel would be for example a connection that ensures the order of messages and guarantees the eventual delivery of messages. Guaranteeing order, on the one hand, depends on the reliability of the application and in consequence the tolerated faults. On the other hand, the required order property influences the suitable algorithms. However, the different implementation strategies suggest that each algorithm is tailored for certain application scenarios. For example, a fixed sequencer algorithm may not be suited for a high throughput application, while a communication history algorithm is. 5.5.5 Timeliness Timeliness [ECR13], also called validity [BFM06], is a quality attribute that describes an interval used to specify how long a notification stays valid. After the interval is exceeded, the notification gets invalid and may be dropped by the notification service. The interval itself can be either specified as time or as a number of messages. Notification drops

1

A threshold is defined, defining how long earlier messages may arrive. If this threshold is exceeded an algorithm-specific protocol is invoked. Basically, it reflects the tradeoff between liveliness and safety.

111


MTP [FM90]

Algorithm

fixed sequencer

fixed sequencer

Sync. nism

asynchronous

asynchronous

asynchronous

none

interruptions

interruptions

Fault model

reliable FIFO

FIFO

none

Link model

non-uniform to- multiple groups tal order

non-uniform to- multiple groups tal order

non-uniform to- single group tal order

Order proper- Group ties ning

span-

Garcia-Molina, Spauser [GMS91] communication history

mecha- System model

Deterministic merge [KR05]

Table 5.3: Classification of selected total order algorithms, following [DSU04]

112

5.6 Reconfiguration and Adaptability

because of exceeded validity intervals can help to reduce unnecessary load in a system, as well as in the consuming application. 5.5.6 Security The security of notifications can be split up in more than one quality attribute. Behnel [BFM06] distinguishes between confidentiality, authentication and integrity. Confidentiality ensures that only trusted nodes or the intended consumers can read a notification. This is usually achieved by standard cryptography, as long as the brokers are trustworthy. If untrusted brokers have to be considered content-based filtering states a likely impossible challenge, because the content must be readable on an untrusted broker. Authentication deals with the validation of the identity of producers and consumers. Integrity, however, ensures the authenticity of a notification, e.g. by the use of digital signatures. Generally, security is an active research field, as current publications suggest: on an infrastructure level, Fiege suggests in [FZB+ 04] the introduction of scopes that separates traffic by scopes and therefore creates a trusted isolated confidential dissemination channel. Of course the underlying infrastructure must be trustworthy. EventGuard [SLI11] is an infrastructure that tackles the authentication challenge by a key infrastructure and guards that authenticate operations as publish and subscribe. It also ensures confidentiality and integrity by signatures and encryption for every message. On the network level, Lagutin [LVZ+ 10] suggests Packet Level Authentication (PLA) in conjunction with traditional certificates to provide a secure publish-subscribe infrastructure.

5.6 Reconfiguration and Adaptability Middleware solutions often intend to target as many application domains as possible. Therefore, many recent solutions provide some configurability in order to adapt the middleware to the requirements of the specific application. They often employ software development techniques like component models (cf. Irmert et al. [IFMW08]) or programming techniques like policy-based design or aspect-oriented programming in order to generate modules suitable for composition. Hence, systems can be categorized regarding their level of configurability. Schreiber [SMHA12] distinguishes between configurable and adaptive middleware. Configurable middleware provides some sort of parametrization or modules that can be composed into a tailor-made middleware, providing exactly the required characteristics.

113


This configuration takes place at design-time. Once the middleware is configured and deployed, a change in the configuration requires a complete redeployment of the system. Therefore, such solutions can employ compilers to produce the composed middleware library. A very popular example of such a configurable middleware are the boost libraries1 . If a middleware configuration can be changed without a complete redeployment, it is called reconfigurable e.g. GREEN [SBC05]. A reconfiguration can take place at runtime without noticeable interruption [IFMW08], or may require a restart of the application. In contrast, adaptive middleware does not require the intervention of a developer for reconfiguration. Such a middleware reacts to changes in the environment by themselves and exchange for example components to versions with a better fit for the current environment. In order to enable a middleware to be adaptive, a constant monitoring of the environment is required as well as a rule-base which triggers adaptation processes. We will not discuss adaptive middleware in detail and focus on the different aspects of configurability, because adaptive approaches employ component models that introduce runtime overhead which hinders Hypothesis 4 of minimizing the framework overhead. In the context of publish-subscribe systems, reconfigurability is often interpreted as dealing with faults in the distributed system, making the system more reliable, e.g. in Cugola [CFMP04] or Nitto [NDM12]. These reconfigurations only add to reliability by dealing with faults and do not change the functional or non-functional parameters of the system. Hence, in this thesis such reconfigurability that only contributes to reliability is not counted as a reconfigurable system. Besides the level of configurability, the kind and place of the configuration description sets the different systems apart. For the location either a specification in the source code or an external configuration file is possible. Regarding the kind of specification, either a concrete composition, as e.g. in GREEN [SBC05], or quality policies, as in ADAMANT [HMS10], are used for configuration. A concrete composition describes the features, structure and dependencies of the different components that form the middleware. For example, which routing algorithm is used in combination with which order and delivery mechanism. Such compositions are very technical and require experts to write. If quality policies are used for specification, the developer defines the QoS guarantees that should be fulfilled by the middleware, and the decision for the best suitable configuration is performed by the middleware itself.

1

The boost libraries (http://www.boost.org) provide a large platform library for C++. Most of the libraries are source only which means they are configured in-source and compiled into the application.

114

5.7 Existing Publish-Subscribe Middleware

How the decision on the best suitable configuration is made, further distinguishes existing approaches. If a concrete composition is specified no decision has to be made, only a validation of the composition can be performed. But if quality profiles pose the system specification, an automated deduction of the configuration has to be calculated. Such a deduction can be made application-dependent, meaning taking a description of the target application into account or not. Moreover, such a deduction can be static or dynamic. A static deduction only takes predefined parameters into account (e.g. matching metadata between algorithms and quality profile), while a dynamic deduction may calculate additional parameters to support the decision. For example machine learning could be used to derive parameters dynamically for a certain scenario. Based on this overview, we will further discuss existing systems regarding their adaptability in section 5.7.

5.7 Existing Publish-Subscribe Middleware In the last two decades a variety of prototypes, standards and commercial products have been developed. This section will classify this variety of approaches according to the aspects of publish-subscribe systems, we discussed in the previous sections. We will omit a detailed description of each individual approach in favor of a more exhaustive taxonomy of approaches. In the authors opinion, the provided taxonomy will give enough insight about each approach that an individual discussion of each approach is redundant and only required for the understanding of the inner workings of each algorithm. For this level of understanding, the reader is referred to the original papers cited in the tabular overview. The selection of the surveyed approaches is based on their popularity1 and generally limited to distributed approaches. However, centralized approaches may be interesting for certain parts of their systems. Therefore, two exceptions are made. FAMUOSO [SF11] and GT [AGG09] are both centralized adaptable approaches, but as very few adaptable approaches exist, they are here included.

1

The popularity is determined by the number of citations of their original papers and the number of inclusions in surveys, published in renowned journals.

115


Taxonomy of Capabilities The taxonomy shown in Tables 5.4 to 5.9 is compiled from literature. If a capability is not explicitly discussed in the literature, it is assumed that the system does not possess the respective capability. The corresponding cell is marked with “-”. In addition to the original papers, the following surveys, thesis, and books have contributed to the taxonomy: Mühl et al. [MFP06], Wahl [Wah13], Esposito et al. [ECR13], Defago et al. [DSU04], Pietzuch [Pie04], Hosseini et al. [HASG07], Meyer [MBCK12] and Mahambre [MKB07]. The taxonomy is split into three parts. The first part deals with the basic system assumptions like the data model (cf. section 5.1), the filter model, which type of filter expressions are allowed and the class of filter algorithm (cf. section 5.2.3). Moreover, the different routing aspects are classified (cf. section 5.3). In addition to the basic routing class (broker, rendezvous or hierarchical) the supported optimizations are noted. Using the classification for ALM algorithms (cf. section 5.3.4), the topology of the overlay structure for the broker network is classified as well as the type of dissemination tree that is constructed for the publication of notifications. The overview can be found in table 5.4 and table 5.7. The second part, found in table 5.5 and table 5.8, addresses QoS and reliability features of the different approaches. QoS describes whether the single approaches employ mechanisms to support the optimization of the respective quality attribute. In the case of order, the given guarantee is also denoted. Regarding reliability, we adhere to a simple fault classification as sketched in section 5.4 and additionally mention, whether the approach supports preserving the state, either by replication or persistent storage of notifications. The third part, shown in table 5.6 and table 5.9, classifies the adaptability of recent approaches as discussed in section 5.6. We adhere to the characteristics introduced there. If information about the architecture is available, its modularity is mentioned, as most configurable approaches either employ a component architecture or modularize the code to support compiler-driven configuration. To conclude, it is noteworthy that adaptability, QoS, and reliability are sparsely addressed, mostly by specialized approaches, but not together in one system. REBECA is an exception, it has grown to cover many aspects and is the most advanced prototype regarding completeness of the surveyed features. Of course the design space is huge, but the benefit of the ability to examine tradeoffs and dependencies between the different features should pose a research goal.

116

templates

NP-IS

Expressions

Matching Algorithm

X

-

-

mesh-first

Covering

Merging

Advertise

Overlay ogy shared

mesh-first

X

-

X

-

X

X

OP-IS

shared

mesh-first

-

-

-

-

-

simple

NP-SS

conjunctive

content-based

structured

Gryphon [BCSS99]

shared

mesh-first

X

X

X

-

X

simple, flooding

OP-IS

conjunctive

content-based

structured

REBECA [Mü02]

source

Pastry-based

X

-

X

X

-

-

NP-IS

conjunctive

type/contentbased

semi-structured

Hermes [PB02]

source

Pastry

-

-

-

X

-

-

trivial

-

channels

unstructured

Scribe [RKCD01]

Table 5.4: Taxonomy of filter and routing capabilities for publish-subscribe middleware – part 1

shared

-

Rendezvouz

Tree

X

Hierarchical

topol-

-

Broker-based

Routing

content-based

content-based

Filter conjunctive

structured

tuple

Data Model

Siena [CRW01]

JEDI [CDF98]

Capability

source

CAN

-

-

-

X

-

-

trivial

-

channels

unstructured

Bayeux [ZZJ+ 01]


117


Omissions

Interruptions

Security

Timeliness

Order

Delivery

Throughput

Latency

-

-

since [CFMP04]

-

-

FIFO

X

-

-

JEDI [CDF98]

-

-

-

n.s.

X

-

-

-

-

-

Siena [CRW01]

-

-

X

minor

-

-

-

X

-

-

Gryphon [BCSS99]

-

-

-

since [JMWP06]

X

-

-

REBECA [Mü02]

-

-

X

X

-

-

-

message reduc- congestiontion aware

X

X

-

Hermes [PB02]

-

-

X

X

-

-

-

-

X

X

Scribe [RKCD01]

-

-

X

X

-

-

-

-

-

X

Bayeux [ZZJ+ 01]

Capability

Byzantine -

Reliability

QoS

State preserving

Table 5.5: Taxonomy of QoS and reliability capabilities for publish-subscribe middleware – part 1

118

-

-

Algorithms

Parameters

-

-

QoS Policies

Location

-

-

-

Validation

Deduction

App.-dependent

Decision Support

-

Composition

System Specification

-

Processing Order

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Gryphon [BCSS99]

-

-

-

-

-

X

X

X

-

components

X

-

X

REBECA [Mü02]

-

-

-

-

-

-

-

-

-

layers

-

-

-

Hermes [PB02]

-

-

-

-

-

-

-

-

-

-

-

-

-

Scribe [RKCD01]

Table 5.6: Taxonomy of adaptability capabilities for publish-subscribe middleware – part 1

Configurable Aspects

Modularity

-

-

Adaptive

Architecture

-

-

Reconfigurable -

-

Siena [CRW01]

-

JEDI [CDF98]

Configurable

Adaptability

Capability

-

-

-

-

-

-

-

-

-

-

-

-

-

Bayeux [ZZJ+ 01]


119


Merging

Covering

Rendezvouz

Hierarchical

Broker-based

Matching Algorithm

Expressions

Filter

Data Model

Capability

-

-

-

X

-

-

trivial

-

channels

unstructured

IndiQoS [CAR05]

Chord, Pastry

-

-

-

X

X

X

-

conjunctive

GREEN [SBC05]

-

-

-

-

-

-

centralized

-

templates

topic, content, topic context

semi-structured

-

unstructured

GT [AGG09]

shared

mesh-first

-

-

X

-

-

X

-

conjunctive

content-based

semi-structured

XNET [CF04]

shared

mesh-first

X

X

X

-

-

X

NP-IS

conjunctive

content-based

structured

PADRES [JCL+ 10]

-

-

-

-

-

-

-

centralized

-

-

channels, content-based

structured

FAMOUSO [SF11]

-

-

-

-

-

-

-

-

-

-

channels, content-based

structured

ADAMANT [HMS10]

Overlay ogy

Routing

Advertise

Pastry-based

configurable

topol-

source

Tree

Table 5.7: Taxonomy of filter and routing capabilities for publish-subscribe middleware – part 2

120

X

-

-

-

-

Throughput

Delivery

Order

Timeliness

Security

X

-

-

Omissions

Byzantine

State preserving

-

-

-

-

-

-

-

-

X

[CV01]

GREEN [SBC05]

-

-

informal

-

-

X

informal

X

X

X

GT [AGG09]

X

-

X

X

-

-

-

-

-

-

XNET [CF04]

X

-

X

X

-

-

-

X

X

X

PADRES [JCL+ 10]

-

-

-

-

-

X

-

-

X

X

FAMOUSO [SF11]

Table 5.8: Taxonomy of QoS and reliability capabilities for publish-subscribe middleware – part 2

X

Interruptions

Reliability

X

IndiQoS [CAR05]

Latency

QoS

Capability

-

-

-

-

-

-

X

X

X

X

ADAMANT [HMS10]


121


Processing Order

Adaptive

Reconfigurable

Configurable

-

-

-

-

-

-

IndiQoS [CAR05]

-

X

-

components

-

X

X

GREEN [SBC05]

-

X

-

modular

-

-

X

GT [AGG09]

-

-

-

-

-

-

-

XNET [CF04]

X

-

-

modular

X

-

-

PADRES [JCL+ 10]

-

X

-

template-meta programming

-

X

X

FAMOUSO [SF11]

X

X

-

-

X

X

X

ADAMANT [HMS10]

Capability

Algorithms -

Configurable Aspects

Modularity

Architecture

Adaptability

Parameters

Validation

Location

QoS Policies

Composition

-

-

-

-

-

-

-

X

source-code

-

X

-

X

-

source-code

X

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

X

source-code

X

X

X

X

-

external file

X

-

System Specification

Deduction

-

Decision Support

App.-dependent

Table 5.9: Taxonomy of adaptability capabilities for publish-subscribe middleware – part 2

122

6 | CAP Theorem Brewer stated in his keynote in the year 2000 [Bre00] that a distributed system can only satisfy two of the three properties consistency, availability, and partition tolerance, but must accept some drawbacks in the third. This theorem was proven later by Gilbert and Lynch [GL02] in many parts. In the beginning, the CAP theorem was introduced to describe the tradeoffs of web services employing a request/response protocol. This definition of a web service is abstract enough to be applied to manifold distributed systems in general, including an MMVE. Hence, the discussion of CAP in the context of event-based systems is an adequate bird’s-eye view to draw the fundamental boundaries, such systems must adhere to. First, we clarify the notion of the three terms consistency, availability and partition tolerance, before each part is discussed in-depth. Gilbert [GL12] describes the three properties as follows: Consistency Informally, consistency is the property that each client receives the correct response to each request. The implications of consistency depend on the type of service. Trivial services, for example a service that serves a constant like π, do not need any coordination for its purpose. Therefore, they are not affected by the limitations of CAP. Weakly consistent services that involve some distributed coordination still do not guarantee consistency among all replicates. As a result, they sacrifice consistency requirements in return for availability. Lenz discussed the resulting implications exhaustively in [Len97]. A distributed Web cache is an example of such a service. Atomic services guarantee atomic operations. A change by an operation is the transition from one consistent state into another. In a distributed environment, such complicated services, either cannot be specified with atomic operations or require sophisticated coordination protocols as for example provided by distributed databases.

123

6 CAP Theorem

Availability Availability is, informally speaking, the property that each request is eventually answered by a response [GL12]. Obviously, a fast response is favored over a slow response or no response at all. Lenz [Len97] and Gilbert [GL02] define availability more formally as the probability of receiving a response at a certain point in time. That definition distinguishes between availability as a success rate of a request and the response time as two different aspects. Gilbert [GL02] argues that in some way this definition is weak because it does not limit the time a response may take. However, in the context of the CAP Theorem, this definition is sufficient, as availability can be seen absolute, i.e. a system is available at 100%. One reason is that for real systems receiving a late response is as bad as getting no response at all. But, even the guarantee to receive an eventual response is impossible for some distributed systems in context of the CAP theorem, e.g. partition tolerant distributed commit protocols that cannot guarantee termination. Partition Tolerance Partition tolerance is a property describing systems that can cope with the assumption of unreliable communication. Even under heavy communication problems like a network partition into two groups, partition tolerant protocols can continue operation and are able to recover from the communication outage. Applications with Real-Time Characteristics The CAP Theorem is especially interesting in the context of distributed applications with interactive real-time characteristics, e.g. the scenario of MMVEs. Bouillot and Liu characterize such systems by the following properties in [BGS04, LBC12]: • Causality: the “happen before” relation, introduced by Lamport [Lam78] and refined by Raynal [Ray96]. • Concurrency: it has to be understood as in the database terminology. Different processes perform concurrent operations, whilst the data consistency must be ensured.

124

6.1 Partition Tolerance

• Simultaneity: if two operations are received simultaneously1 by a process, these operations are to be received simultaneously on all processes. • Instantaneity: the human perception of an operation is instant. That means no delay is perceivable by humans, despite the technical view, where a delay may be existent. Causality and Concurrency are well known properties in distributed applications and will be covered by the discussion on consistency in section 6.2. Simultaneity and instantaneity, however, are key properties for interactive media in order to simulate a realistic continuous perception of the application. As human perception is limited, e.g. the eye has a resolution of 60 frames per second. However, even a refresh rate of 24 frames per second is sufficient to create a continuous perception. These properties allow for some slack in terms of latency. Therefore, these properties provide the boundaries in terms of tolerable latency for the applicability of different consistency models. The resulting tradeoff between consistency and availability is discussed in section 6.3.

6.1 Partition Tolerance In a distributed system, the communication channels are error-prone most of the time. Routers may fail, cables break, and packets may be lost due to load shedding on overused connections. Hence, in a wide area network (WAN) scenario network partitions are a ubiquitous problem. Therefore, the decision for two out of three properties is merely a tradeoff between two. It is the tradeoff between consistency and availability. In contrast, the pre-CAP view on distributed systems, especially on traditional database systems where ACID properties matter, is a pure CA system, based on a data-center centric viewpoint [Bre12, GL12]. In a data-center the probability of network partitions is very small. Therefore, such CA architectures worked for many decades and still do to the current day. However, due to globalization, cloud computing, and the fact of single-site architectures on the decline, one has to cope with partitions to a certain degree. Merely it comes

1

Because application time in real-time interactive applications only has a certain resolution or even is discrete, two operations may happen simultaneously at the same point in application time. Therefore, to be exact, the definition of simultaneity heavily depends on the definition of time in the respective application.

125

6 CAP Theorem

down to the partition decision [Bre12]: In case of a partition, one can either cancel an operation and hamper availability or continue work and risk inconsistencies. The decision is between consistency and availability. To cope with this situation partitions must be managed actively. The partition must be detected and a special partition lifecycle must be passed. At the end of this partition lifecycle stands a recovery and compensation of potential errors a partition caused. During the lifecycle operations may be restricted, again reducing availability or information is logged to enable for a recovery, when the partition ends. Designing operations that may be recoverable or can be compensated is not always easy or even possible. Before we discuss the tradeoff between availability and consistency in detail, some words on consistency, different consistency models and some protocols are required.

6.2 Consistency As previously mentioned, consistency is a property, a distributed system can guarantee in a more or less strict way, depending on the limits defined by the CAP Theorem and the requirements of the application. A shared object may be accessed by different processes at the same time. Each process performs operations on this shared object. Consistency relates to the state of this object and how it is affected by the different operations and in which order. This problem has been discussed in many contexts. They all define a consistency model as the causal order of operations in different processes. The notion of process and operation, however, is context-specific. In distributed shared memories, operations are reads and writes to the memory [Tan95, Lam79, AW94]. In database systems, an operation is a transaction consisting of multiple statements that result in many reads and writes to memory or disks. Many transactions are processed concurrently by different processes [BSW79]. Distributed systems define an operation as sending or receiving a message or event, whilst the processes are represented by different distributed nodes [Tan95, Lam78]. For interactive multimedia, Bouillot et al. [BGS04] discuss consistency models and provide a taxonomy categorizing different consistency models. Liu et al. provide a survey [LBC12] of different consistency models in the context of virtual worlds and refine the taxonomy in [BGS04], categorizing the different models and their applicability to virtual worlds.

126

6.2 Consistency

In the following discussion about different consistency models, this work follows the most generic terminology. Therefore, we speak of operations that are performed at different processes and they have a request/reply correlation. Consistency Models Strict consistency [Tan95] or strong consistency [Vog09] defines the strictest model of consistency, a distributed system may guarantee. The origin lies, as the definition suggests, in distributed shared memory. Definition 5 (Strict consistency): In a strictly consistent distributed system, any read to an object x returns the value stored by the most recent write operation on x. [Tan95] This consistency model is the most intuitive, but also very hard to achieve. It is natively guaranteed in non-distributed uniprocessor systems on memory. Moreover, it implicitly assumes the existence of an absolute wall-clock time. If we apply this definition to our distributed scenario, it would require that an operation on node l and the corresponding state change would be visible at all nodes L, when the next operation is performed on any node. Such a guarantee is impossible in an asynchronous system [AW98], as it would require a wait-free solution to the consensus problem in an asynchronous distributed system. Therefore, some relaxed definitions of consistency requirements exist: Linearizability or atomic consistency [Mos93, HA90, AW94] is the widely accepted baseline to compare consistency models against. Linearizability is defined as a sequential order of operations in which each request is followed by an immediate matching reply. Attiya [AW94] formally defines linearizability as follows: Definition 6 (Linearizability): Given an execution σ, let ops(σ) be the sequence of request and reply operations, appearing in σ in real-time order, assuming request and response operations alternate on each process. An execution σ is linearizable if there exists a legal sequence τ of operations such that τ is a permutation of ops(σ), for each process p, ops(σ)|p is equal to τ |p, and furthermore, whenever the reply for operation op1 precedes the request for operation op2 in ops(σ), then op1 precedes op2 in τ . [AW94]

127

6 CAP Theorem

Due to the fact that request/response pairs of one process are ordered as they were atomic, linearizability is a local property. This means if an object provides a linearizable history, a composition of such objects still yields a linearizable history. A legal sequence is a sequence that does not break the specified program order and considers causal dependencies between the operations, as Lamport [Lam79] defined for sequential consistency. Sequential consistency [Lam79, Tan95, Sez05, Mos93, HA90] relaxes the definition of linearizability by the absence of the atomic property on corresponding asynchronous request/reply pairs. Lamport [Lam79] defines a sequentially consistent distributed system as a system that guarantees that the result of any execution must be the same as if the operations of all processors were executed in some sequential order and the operations of each individual processor appear in this sequence in the order specified by its program. Formally, this leads to: Definition 7 (Sequential consistency): An execution σ is sequentially consistent if there exists a legal sequence τ of operations such that τ is a permutation of ops(σ), for each process p, ops(σ)|p is equal to τ |p. [AW94] As one can see, this is a slight relaxation and allows interleaving request and response operations from different processes. Therefore, sequential consistency is not a local property. The only guarantee given is the order of each process’s operations and a global order which need not correspond to the real-time order of events. In the context of database systems and transactions, sequential consistency is known under the term serializability and is thoroughly discussed in [BSW79, Pap79, BHG86]. Weaker consistency criterions include causal consistency [HA90] and even weaker models that are omitted here. In the context of cloud computing, a certain class of weaker consistency criterions came back to the center of attention. Eventually consistent systems describe a group of consistency models for distributed systems, which according to Vogel [Vog09] subsume a certain informal notion of consistency:

128

6.3 Consistency vs. Availability Tradeoffs

Definition 8 (Eventual consistency): The storage system guarantees that if no new updates are made to one object, eventually all accesses will return the last updated value. Another class of consistency models explicitly incorporates timeliness guarantees into the consistency guarantee. These models are called deadline based consistency [BGS04] models. They inherently model a deadline for state changes into the consistency guarantee. That means they assume a partially-asynchronous system, and as long as the system stays within the boundaries of the deadline, they can guarantee liveliness and safety for consistency.

6.3 Consistency vs. Availability Tradeoffs In an asynchronous system, the ultimate decision between the liveliness and the safety of the assured consistency model can be reformulated in the informal tradeoff between availability of the system vs. the consistency guarantee of the global state. According to Brewer [Bre12], this tradeoff between availability and consistency comes down to latency. If an operation takes an unusual amount of time1 , two things can be done: either the operation can be canceled, hampering availability or continue the operation and risk a potential inconsistency. Brewer calls this the partition decision, as it potentially causes partitions of consistent states that later have to be recovered. So basically the designer can choose availability, resulting in the maintenance of partitions and the challenge of merging them. Such eventually consistent system are sometimes called BASE2 . Or he can choose consistency over availability and remain in the ACID domain of database systems and sacrifice response time. Intermediate solutions exist that only relax consistency in order to improve availability of a system, as discussed in [Len97].

1 2

This incorporates faulty behavior as interruptions, network failures, etc. BASE stands for basically available, soft state and eventually consistent and subsumes all relaxations made on database systems in order to boost their scalability.

129

6 CAP Theorem

6.4 Discussion If we apply the CAP Theorem to our scenario and the world of publish-subscribe systems, the requirements of an MMVE have to be specified according to the rules defined by the CAP Theorem. Its design must define the requirements regarding consistency, availability, and the resilience against network partitions. Choosing these requirements on a system wide level for all required event-types and the states the corresponding events manipulate might not be the optimal granularity. This motivates a middleware exploiting the individual semantics of different event-types from a theoretical viewpoint. The part of a world state represented by an event-type could be independently positioned in the CAP triangle. It is obvious that a uniform treatment of all event-types gives away some optimization potential in comparison to a customized treatment of each event-type. If it is possible to reevaluate the limitations given by the CAP Theorem on a finer granularity – in this case on the level of event-types and their affected parts of the application’s state – it can provide powerful optimization possibilities in environments with a huge number of different event-types. For example, the relaxation of consistency requirements on chat events in an MMVE may provide a huge gain in latency, because stricter consistency models are more expensive in terms of latency [Bre12]. On the other hand it is crucial that certain events like trade events must guarantee at least sequential consistency. Of course event-types with different CAP properties have to be free of inter-dependencies. If for certain event-types –respective the associated parts of the world state– consistency is only required for a certain time interval for certain nodes, the concept of a “virtual primary copy” as introduced by Lenz [Len97] could be applied. This means that only those nodes join a so called consistency isle for the time they require consistency. As this thesis focuses on the design of a distributed publish-subscribe middleware, the proposed approach is not able to guarantee any consistency models by itself. An eventdissemination system does not manage any objects’ state, on which it could guarantee consistency, it only can give guarantees on the dissemination of events. However, such systems may support the developer of distributed applications for example by providing order guarantees for event dissemination (cf. section 5.5.4). Nevertheless, even certain guarantees on event dissemination are helpful to implement consistency protocols, as they require for example ordered communication (cf. Attiya [AW98]).

130

Part III QoS-Aware Configuration of Publish-Subscribe Systems

131

“First, solve the problem. Then, write the code.” John Johnson

In this part a methodology for QoS-aware configuration of distributed publish-subscribe systems is suggested. The methodology consists of a configurable framework, a configuration model that allows to define the semantics of event-types and an automation workflow that interprets the semantics and derives the technical configuration for the framework. The discussion of this methodology is split into five chapters. Each focuses on another aspect of the approach. The first two chapters address some prerequisites. In chapter 7, some basic assumptions for the discussion of the methodology are made. With those assumptions in mind, the initial hypotheses are broken down into requirements for a reference architecture and implementation. To do so the discussion of QoS-aware configuration of distributed publish-subscribe systems is split into three parts. Chapter 9 addresses the challenges of a design-time configurable publish-subscribe framework. These challenges were formulated in hypotheses 3 and 4. The corresponding requirements for the realization of such a framework are defined in section 7.2. Subsequently, based on a basic reference architecture for publish-subscribe (introduced in chapter 8), a design-time configurable publish-subscribe framework is discussed in chapter 9. Chapter 10 addresses the definition of a reference model for QoS-aware configuration which is a possible solution to hypotheses 1 and 2. The corresponding requirements for such a model are defined in section 7.3. The fundamental idea of a multidimensional classification is illustrated for MMVEs in section 10.1. The resulting generic configuration model is introduced in section 10.2. In chapter 11, both, the model for QoS-aware configuration and the design-time configurable publish-subscribe framework are brought together to form a methodology for QoS-aware configuration of publish-subscribe. Hereby, hypotheses 5 and 6 are addressed by a workflow for the automatic configuration of publish-subscribe systems. The corresponding requirements are defined in section 7.4. The discussion of the workflow, realizing the requirements, begins with a brief formalization of the underlying problem to map a semantic description of an event-type to a technical configuration in section 11.1. The solution framework, addressing this problem, is introduced in section 11.2, before two possible workflows are concretized in section 11.3 and section 11.4.

133

With the introduction of those two workflows a complete methodology for QoS-aware configuration of distributed publish-subscribe systems is introduced in form of a reference architecture that can be implemented in the following part.

134

7 | Requirements and Limitations Before we discuss the different aspects of the proposed approach, the initial hypotheses are refined into requirements for a possible solution. The identified requirements should help the reader to concretize the challenges that result from the single hypotheses. In section 7.1, we begin by general assumptions that simplify some discussions and focus this work even more. The research goals of this thesis are threefold. First, the integration of existing approaches is addressed in form of the challenge to design a design-time configurable framework for publish-subscribe systems. The second goal –the exploitation of event semantics– is addressed by a developer-friendly and domain-specific model for the description of event semantics. The third goal that aims for the support of application developers is addressed by a workflow that combines the other two challenges to a complete methodology for QoS-aware configuration of publish-subscribe systems is suggested. In the following sections, the corresponding hypotheses are accordingly split up into requirements in order to define the required characteristics of a solution for the three top-level goals more precisely. Section 7.2 deals with the requirements for the design of a configurable framework for a distributed notification service. Section 7.3 addresses the requirements for the exploitation of event semantics in order to gain a QoS-aware configuration description. Finally, Section 7.4 brings together the technical and the application-driven perspectives by formulating the requirements of an automated decision component that deduces a technical framework configuration from an application-driven configuration description. Before the specification of the requirements, some initial simplifying assumptions and limitations are introduced. The goal of those assumptions is to simplify and focus the discussion on the relevant and novel aspects of the introduced approach.

135

7 Requirements and Limitations

7.1 Initial Assumptions For the formal discussion of the proposed framework and the following approach on QoS-aware configuration, we assume some initial simplifications. Some of them will be replaced or dropped during the remainder of the thesis, but they allow for a more focussed discussion. Only assumptions are made that may be dropped at the cost of complexity, without hampering the general applicability of the proposed approach. Network Model A network, as formally defined in section 8.2, is limited to an evenly distributed topology with a symmetric data-rate. That means the latency between each pair of nodes in the network is constant and the data-rate is the same in both directions of a link. Moreover, we only consider the overlay network of nodes that participate in the system. No underlying router or AS-level topology is considered. Hence, the result is a graph in which each link has two associated costs: latency and data-rate. This simplification introduces some imprecision for large scale WAN networks compared to a complete AS or router-level network topology. However, it does not hamper the correctness of the approach, because for example a simple LAN can be precisely modeled in this model. The current network model only simplifies the development of the required network simulator. Fault Model Basically, a reliable system (cf. Section 5.4) is assumed. That means that no interruptions like link or node failures, no network errors like bit flips or message tampering and no omissions like lost messages are considered in the following discussion. Graceful unsubscribe of course is no failure and discussed later. As a result of this assumption security as well as recovery after a failure are prepared but are omitted in the discussion of this thesis. The reason is, that for the understanding of the basic methodology for QoS-aware configuration itself failures have no influence and only add additional complexity. The configuration process happens at design-time and per se is therefore not hampered by reliability concerns. For the initial description of the methodology, this assumption allows for a focus on the configuration process and the required components.

136

7.1 Initial Assumptions

However, failures at run-time may influence the precision of previously made configuration decisions that did not consider failures. Therefore, the later introduced system model (cf. section 10.2.2) will incorporate exemplary error attributes to show how they can be easily added. Moreover, the discussion of the framework will also consider the necessary configuration possibilities, e.g. acknowledgements, in order to deal with reliability challenges. Even some of the implemented algorithms respect failures, as far as their original papers considered them (cf. the taxonomy in section 5.7). Nevertheless, the focus does not lie on the investigation of different failure scenarios and their impact on the methodology, respective the precision of configuration decisions. Only when the general applicability or relevance of the methodology requires the consideration of failures, the corresponding design decisions are discussed. Data Model Even despite the fact that filtering and therefore content-based publish-subscribe will be modeled in the framework as well as implemented in the prototype in form of structured records, the data model for the simulator is limited to unstructured data and therefore the decision component limited to the optimization of channels. This simplification is owed to the complexity of a content-aware simulator. It would require realistic workloads that reflect the value distribution for all attributes of each event-type. The usage of existing application traces is not an option for the proposed methodology, as this is not possible in all cases, especially not for the development of new applications. Therefore, a model has to be found that allows the generation of realistic workloads, based on certain semantics of the application and the event-types. The development of a developer-friendly model that allows the generation of such realistic, application-dependent workloads states an ample and complex research challenge by itself which is not focus of this thesis. Ideas for an extension of the proposed approach to support a content-aware simulation model is discussed in further work (cf. section 19.2.1). Optimization Problem The automated decision about the configuration for one channel mathematically states an optimization problem. In a multidimensional search space a minimum or maximum for a certain parameter has to be found. If the framework supports more than one configuration (i.e. one per channel), the mathematical problem for a global optimum is way more complex. Each configuration for a certain channel depends on all possible configurations of the remaining channels. In combination with the simulation approach,

137


a solution in reasonable time is not possible without simplifications, because the whole multidimensional search space has to be sampled for all combinations of channel configurations. For c channels and n possible configurations of a channel, this theoretically results in cn sample runs, one for each combination. Therefore, for this thesis it is assumed that the sum of independently optimized channels represents a reasonable good approximation for a global optimization which reduces the required number of sample runs to one. Impact on Use Cases The introduced simplifying assumptions have some implications on the use cases introduced in section 3.3. All use cases that use content-based subscriptions can only be considered for a manual configuration of the proposed design-time configurable framework. For the further discussion of the multidimensional classification and the automatic configuration workflow only topic-based use cases are considered.

7.2 Requirements for a Design-Time Configurable Framework Hypothesis 3 suggests a set of requirements for the design of a framework that allows the integration of all major aspects of publish-subscribe. They result from software engineering for the design of modularized architectures and the literature analysis on in part II. Req 3.1: Find a suitable modularization for the aspects of publish-subscribe systems introduced by the literature analysis. Req 3.2: Define a model for the interaction of the different modules. The model defines valid compositions of modules and limits their exchangeability. Req 3.3: Find a suitable description of a composition of the modules at design-time. This description is generated automatically. Hence, the configuration has not to provide a comfortable syntax but should still be human readable for debugging or the occasional manual configuration. Moreover it should be able to express all possible configurations of the framework and has to be robust to detect misconfigurations. Req 3.4: Identify minimal and stable interfaces for each module, reflecting the different existing approaches for each aspect.

138

7.3 Requirements for a QoS-aware Configuration Description

Req 3.5: Ensure that the interfaces and modules respect extensibility. It should be easy to add new algorithms to the existing framework. Req 3.6: Allow more than one composition of modules in one system in order to support event-specific optimization. That means the configuration of the behavior should be specific to event types. If we take hypothesis 4, aiming for minimal implementation overhead, into account, the set of requirements must be extended by a few design constraints. Obviously, the fulfillment of such requirements depends largely on the used programming language and is therefore discussed in part IV that describes the proof-of-concept implementation. Req 4.1: Minimize the overhead required for messages. Even though the middleware is configurable the run-time overhead should be minimal. This applies to overhead caused by algorithms and especially for overhead caused by the configurability. Req 4.2: Minimize the abstractions and the resulting indirections in order to optimize for processing speed. Req 4.3: Find a suitable programming model for comfortable extensibility with respect to Req 4.1 and Req 4.2. Req 4.4: Exploit compilers in addition programming language capabilities to fulfill Req 4.1 and Req 4.2. This is possible because the configurability is limited to design-time.

7.3 Requirements for a QoS-aware Configuration Description The requirements for a QoS-aware configuration description can be derived from hypothesis 1 that aims at easing the configuration effort for developers, by a classification. In order to describe such a classification of event-types some sort of notation is required. The classification can either be specified in a general purpose language or a domain-specific language (DSL) can be designed. As a notation is required that is developer-friendly, a DSL has some advantages, because it allows to exactly define the semantics and syntax of the language. The language can be designed with the sole purpose in mind to be a specification language for event semantics. The language can however be based on an existing markup language like XML or YAML in order to speed up the development

139


process for parsing and validation. Some of the following requirements were introduced by Wahl [Wah13] in the context of the configuration language’s design. Req 1.1: Find a suitable meta-model for the classification of the event semantics of DEBS. This meta-model should allow to derive the required system attributes for simulations and automated decision. Req 1.2: A DSL has to be designed that uses the meta-model for classification. It must be possible to describe the semantic properties of an event type in terms of the defined meta-model. Req 1.3: The DSL must support the description of the events’ payload. Req 1.4: QoS requirements should be expressible for event types. This includes the optimization targets for an automated decision on the configuration, as well as the direct specification of system attributes. Req 1.5: Modules as required for Req 3.1 must be describable in terms of the DSL in order to specify the necessary information for the automated configuration process. This includes constraints like exclusions or interdependencies between modules or their usage in the classification. Hypothesis 2 enhances the classification idea by the provision of a domain specific terminology in order to ease the configuration effort even further. Req 2.1: The meta-model defined in Req 1.1 must support a domain specific instantiation. Such an instantiation should enable domain experts to build classifications in the terminology of certain domains. Req 2.2: The DSL required for Req 1.2 has to support the modeling of different domain specific profiles that represent instantiations of the meta-model. Req 2.3: For reusability, the DSL must support an independent description of network characteristics as discussed in section 4.3 in order to be able to combine the network characteristics with domain profiles. The result is an application profile that describes the target network environment as well as a tailored classification in the domain’s terminology. In addition to these functional requirements some non-functional characteristics should be considered during the design of such a configuration model. Req NF.1: The description in the required DSL should be declarative. The logic of a configuration decision and the declaration of the configuration description should be separate. The reason is the well known principle of separation of concerns.

140

7.4 Requirements for an Automated Design-time Configuration

Req NF.2: The independency of implementation defines that the configuration descriptions and the implementation of the decision component must be separate. This ensures the reusability of the descriptions, even if the implementation of the decision component changes. Req NF.3: Extensibility is an essential requirement for maintainable languages. It should be easily possible to extend the descriptions by new dimensions and system attributes. Req NF.4: Reusability of descriptions poses another non-functional requirement. It must be possible to reuse written descriptions or parts of them in other contexts.

7.4 Requirements for an Automated Design-time Configuration Hypothesis 5 postulates an automated configuration workflow for the design-time configurable framework. The goal is to find a workflow that allows the developer to configure the framework in a declarative and easy-to-use way by specifying the semantics of each event-type. Based on this declarative description of the event semantics, an optimal configured library should be composed and compiled. Therefore, the workflow is based on the requirements suggested for the framework in section 7.2 and for a classification based DSL in section 7.3. Together, the framework, the classification based description and the automated workflow form a methodology for QoS-aware configuration of publish-subscribe systems. However, for the design of the workflow a decision has to be made regarding the method how a framework configuration is assessed regarding its suitability for a certain event-type description. Basically, two methods are possible: On the one hand an analytical or heuristic approach that requires internal knowledge is possible. On the other hand a measurement based black box approach can be taken. Both have their advantages and disadvantages, but a black box approach certainly is more flexible and easier extensible, because no knowledge of the behavior of the assessed framework configuration is required. Therefore, a measurement-based approach is taken to assess different framework configurations. As these measurements cannot be taken in the real-world for sufficiently large scenarios, simulation seems to be the best choice. As a consequence, the following requirement can be identified. Some of them were defined by Wahl [Wah13] for the design of the decision component.

141


Req 5.1: An automated workflow for design-time configuration should be able to identify possible configurations of the framework that fulfill the description of an event’s semantics. This step should span the search space for the optimal configuration. Req 5.2: The search for the best suiting composition should be done based on measurements. These measurements are gathered by simulation, as a real world measurement of a distributed system may be consuming too many resources. Req 5.3: The automated workflow should incorporate the deduction of all required simulation parameters based on the defined DSL. Req 5.4: The workflow should integrate a black box approach for the collection of measurements. The reason is the reusability and extensibility of the automated workflow in contrast to a white box approach. As the implementation is used as a black box, changes to the implementation of the configured library can be directly mirrored in the measurements. Req 5.5: An automated decision for the best suiting composition should be possible. This step takes the QoS requirements and the measurements into account. Req 5.6: All required configuration files for the compilation step should be generated automatically. Req 5.7: The compilation of the automatically configured middleware must be performed. This enables the developer to use the middleware at once after a completed configuration process. This automation eases the integration of the whole methodology in agile development processes like continuos integration. Hypothesis 6 targets the performance of such a configuration workflow. A workflow based on simulations can be quite expensive, especially if a multidimensional search space must be sampled. Therefore, the sampling process should be performed only once, with as few simulations as possible, without introducing a too large error. As a consequence, some sort of interpolation or regression mechanism is required in order to generate usable models with only a few sampling points per dimension. Suitable mechanisms are either parametric or non-parametric regressors (cf. section 11.4.1). Parametric regressors assume some sort of fixed base function that only is calibrated by parameters to fit the measurements. Thus, with respect to extensibility, for parametric regression a base function would have to be found, that covers all possible parameter growth. This is nearly impossible, if future extensions are also considered. Therefore, this approach is limited to non-parametric regressors. The resulting requirements for the automated workflow are the following:

142

7.4 Requirements for an Automated Design-time Configuration

Req 6.1: The whole sampling process should only be performed once per domain. The naive approach that samples the search space for each configuration process is very time consuming for some scenarios and should be optimized. Req 6.2: A sampling mechanism that samples as coarse grained as possible should be found. As each simulation takes time, fewer simulations reduce the time consumption of the sampling process. Req 6.3: A suitable non-parametric regression mechanism must be found that is able to cope with coarse sampling rates, without introducing unreasonable error in the decision.

143

8 | Basic Reference Architecture The notion of terms and abstractions used in this thesis is clarified in this chapter. Therefore, a basic reference architecture for distributed publish-subscribe systems is introduced. The aim is to provide a simple model that does not restrict any aspect discussed in part II. Moreover, it aims to clarify and separate concepts that are a prerequisite for the modularization of a configurable publish-subscribe framework. It also lays the foundation and provides the terminology for the remaining discussion. In chapter 9 it will be extended to form a design-time configurable framework. The literature already suggests many formal models and architectures describing publish-subscribe sytems, but they are all designed with their respective research goals in mind. For example Virgillito [Vir03] proposes a basic model based on the classical safety and liveliness properties of distributed systems. Mühl suggests in [Mü02, MFP06] a similar but more exhaustive formal model for routing semantics of publish-subscribe systems. Bittner [Bit08] introduces a similar yet simpler formal reference model. Attiya [AW98] models networks for distributed systems with the focus to describe generic distributed computing. None of those models explicitly deals with configurability. In the following chapter, the different existing modeling viewpoints are adapted to the requirements of this thesis and provide an as simple as possible formal foundation for further discussion and enhancement. The architectural paradigm of our reference model is not fixed to client/server or peer-to-peer, because the application programmer should be free to decide whether a participating node has the role of a client, a server, or both. For this reason, we speak solely of nodes that send messages to other nodes. However, the term node has to be distinguished further with respect to the layer discussed. As the research goals impact different aspects, three abstraction layers are introduced, shown in Figure 8.1: Application Layer The application layer represents the application domain. For example in the case of MMVEs this would be the virtual world populated by thousands of avatars. Each avatar is located on one application node that manages

145

8 Basic Reference Architecture

Application Layer event type event type event type

state

event

application nodes

publish(), deliver() subscribe(), unsubscribe()

Notification Service Layer Multicast Tree Multicast MulticastTree Tree

tree nodes

message route(), forward(), deliver() Overlay Network Layer

key Overlay Network Overlay OverlayNetwork Network

overlay nodes

Figure 8.1: Abstraction layers

its state and game logic. Generally speaking, all application nodes together form the distributed application. For our model it is assumed that an application uses event-based computing as its programming paradigm. That implies that an application uses notifications to communicate between application nodes in an asynchronous fashion. For one-to-many communication the publish-subscribe paradigm is assumed. As defined in chapter 5, notifications represent events that occur in the application. Notification Service Layer The notification service layer provides the API of a publish-subscribe middleware and is used by the application to send notifications to subscribers. The model describes an abstract view on publish-subscribe. A dissemination process is modeled as a multicast tree. Each multicast tree defines the logical dissemination structure for at least one notification. A logical multicast tree is the only similarity between all routing concepts discussed in section 5.3. In a broker-based routing scheme, these trees are calculated dynamically on notification basis using the routing tables. If a hierarchical or rendezvous-based routing scheme is employed, they take on a more static form and replace routing tables.

146

Overlay Network Layer The overlay network layer abstracts from the physical network and provides a flexible and uniform interface. All multicast trees use the overlay network layer for the transport of messages, based on a common KBR API (cf. chapter 4). So all instantiated multicast trees are mapped onto the overlay network, meaning each tree node has one associated overlay node. Each overlay node is addressed via a unique key, defining the routing address of the overlay-node. The reference model supports different overlay-networks, e.g. a Pastry [RD01] address space and a simple IP-based address space, each responsible for different tree nodes. No tree node can be part of two different overlay networks. The proposed system model provides an abstract view on notification services. The reasons for this three layer abstraction lies in the flexibility and generalization of the abstractions made. The overlay network layer has been introduced to narrow down the network physical network to those nodes participating in the system and make them uniformly addressable. It explicitly separates the management of the network topology from the management of the logical dissemination trees. This separation is often omitted as for example in the model of Mühl [Mü02]. That makes it difficult to reason about the topology of network nodes and the dissemination trees as the two separate concepts they are. Even though Hermes employs an overlay network for routing and builds logical dissemination trees upon, Pietzuch did not separate those two concepts in [Pie04]. The omission of this separation is fine as long as no configurable approach is followed, because the overlay substrate is fixed in those cases. However, a conceptual distinction between the overlay network layer and the notification service layer seems to be a required abstraction for a configurable integration framework. It allows for the integration of all existing structured overlay networks as well as routing paradigms for event-based systems as far as they are discussed in part II. Moreover, it also allows for a free combination of their different implementations. As a consequence, this separation is a basic prerequisite for an integration framework as postulated by hypothesis 3. The next Sections will discuss and illustrate the three abstraction layers in depth. Moreover, we will discuss their respective interface functions in section 8.2 and section 8.3. First, the application layer will be formally described. Second, the overlay network layer, and finally the notification service layer. This order is owed to fact that the notification service layer uses the definitions of the application layer and the overlay network layer.

147


8.1 Application Layer The application layer represents the application domain. A classical middleware provides services for this layer. We assume an event-based system as an application. For our purposes we define the relevant aspects of an application in Definition 9. An application has a state which changes over time. Each state change is represented by the occurrence of an event in event-based systems. A notification is communicated between application nodes in a publish-subscribe fashion to distribute the occurrence of an event. For the remainder of this thesis the terms events and notifications are used synonymously, because the difference is not relevant for further discussion. Each event has a type that can be characterized by a semantic description. This description may consist only of a type name for topic-based publish-subscribe or a schema for content-based publish-subscribe. It may even contain meta-data, e.g. the QoS-guarantees an event type requires. Depending on the notification middleware used, the description may be evaluated at run-time [AR02] or at design-time as discussed in the remainder of this part. A publish-subscribe system only provides a notification service for application nodes. Hence, it is not necessary to model a complete stateful application. However, some domain information in form of relevant application knowledge is required to describe certain application characteristics and is therefore considered in our model. We formally define an application using such a system in the following way: Definition 9: (1) An application A is defined as a tuple A := (U ,LApp ), U := {τ 1 ,τ 2 , . . . ,τ k } being the set of k event-types and LApp := {lApp 1 ,lApp 2 , . . . ,lApp n } the set of n application-nodes. (2) E t := {e1 ,e2 , . . . ,ei } is the set of i events e occurred in an application A until application-time t. (3) Each application node lApp has statet (lApp ) at application-time t after all events e ∈ E t have been processed. This state may be seen as application knowledge. (4) An event e has an origin(e), type(e), a header(e), a timestamp(e) and a payload(e). (5) An event-type τ moreover defines a schema schema(τ ) that describes the set of used attributes {a1 , . . . ,an }. (6) An event-type τ ∈ U defines a set E τ ,t ⊆ E t fulfilling the following condition:

148

8.2 Overlay Network Layer

∀e ∈ E τ : type(e) = τ . (7) Each application must define a time-stepping function incrementing application time t. A time-step may be triggered by events or wall-clock time, depending on the application. Definition 9 introduces time t as an important dimension for DEBS. The time-stepping function discretizes time which is common to some applications. For example MMVEs often employ discrete event system simulation [BCNN01, Fuj90] as their concept for world simulation. However, this definition aims not to restrict any applications, it merely gives an easy to understand notion of time. Real-time applications for example may use the system time as provided by the operation system as their application time and define their time-stepping function appropriately. So the notion of a time-step as well as time is highly application dependent. Time and the application’s notion of it plays a significant role when talking about order or delivery guarantees and should therefore modeled here. For example Cayuga [DGH+ 06] is a publish-subscribe system with a very strict time-stepping model. For the remainder of the formal discussion in this thesis, if not stated otherwise, we assume all definitions are meant at a certain point in time t. Hence, from now on, the index that indicates t is generally omitted for clarity.

8.2 Overlay Network Layer The lowest layer of the reference architecture is the overlay network layer. It abstracts from the physical network and provides a flexible and uniform communication interface. All multicast trees use the overlay network layer for the transport of messages. This abstraction is necessary as the application may run as a peer-to-peer application, or a classic client/server application without any DHT-based routing substrate. Based on [AW98], an overlay network may be formally modeled as follows: Definition 10: (1) An Overlay Network is a directed graph G = (LN et ,O). (2) LN et := {lN et 1 ,lN et 2 , · · · , lN et n } is a set of n overlay nodes in the graph.

149


(3) O := {(lN et i ,lN et j )|lN et i ,lN et j ∈ LN et )} is the set of directed edges from lN et i to lN et j . (4) M is the set of all messages sent over an Overlay Network G. (5) An edge o = (lN et i ,lN et j ) between overlay node lN et i and lN et j defines two sets: An input buffer ino [lN et j ] and an output buffer outo [lN et i ], each containing messages m ∈ M.

In order to be able to uniformly address overlay nodes independent of the actual underlying physical network we incorporate the KBR API introduced in [DZD+ 03] as the interface provided by the overlay network layer (cf. section 4.2). Each overlay node is identified via a unique key, defining the routing address of the overlay-node. D deliver[l2N et ](m)

l1N et outo [l1N et ]

route[l1N et ](m)

o send[o](m)

N et outo0 [l2N et ] ino [l2N et ] l2

f orward[l2N et ](m)

Figure 8.2: Reference model of the overlay network layer

Figure 8.2 shows how the overlay network is modeled according to Definition 10. The following functions are defined to formally represent its behavior: Definition 11: (1) A message’s destination is described by a unique key k using dest : M → K, k ∈ K, with K being the available key-space. Each key maps exactly to one overlay node: mapkey : K → LN et . keys : LN et → K returns the keys associated with one overlay-node. (2) A message terminates if it is delivered and consumed by the application, depicted by the set D containing all delivered messages. The interface and dissemination behavior of each overlay node can be defined using four functions as formalized in Definition 12. Route is actively used to start the dissemination

150

8.3 Notification Service Layer

of a message m, whilst deliver and f orward are callback functions used to deliver a message to the application. F orward is called on intermediate hops, while deliver is only called on the destination overlay-node. Definition 12: (1) route[lN et i ](m) : initially determines the routing. Adds a message m to the appropriate buffer outo [lN et i ] (2) deliver[lN et i ](m) : removes a message m from ino [lN et i ] and delivers it (add it to D), if mapkey(dest(m)) = lN et i , o ∈ O. (3) f orward[lN et i ](m) : removes a message m from ino [lN et i ] and adds it to the appropriate outo0 [lN et i ] for o,o0 ∈ O (4) send[o](m) : sends a message m over an edge o ∈ O. It removes the message from outo [lN et i ] and it is added to ino [lN et j ]. (5) A dissemination of message m with q hops is a sequence of function calls: Θq (m) := route[lN et 0 ](m), send[o0 ](m), f orward[lN et 1 ](m), send[o1 ](m), . . . , f orward[lN et q ](m), deliver[lN et q ](m).

8.3 Notification Service Layer Between the overlay network layer and the application layer, the notification service layer provides the front-end of the event dissemination middleware used by the application. Different event-types are mapped onto a key-based routing overlay network. An eventtype is represented by one multicast tree in the publish-subscribe layer. The type of routing mechanism determines how this mapping is performed. The multicast tree can either be calculated dynamically based on the routing tables, if a broker-based approach to routing is chosen, or it represents the routing structure by itself, as it is the case in hierarchical or rendezvous-based approaches. However, the multicast tree determines the next routing hops during the dissemination of an event and is applicable to peer-to-peer as well as IP-based network topologies. We define a multicast tree as follows: Definition 13: (1)A multicast tree T := (s,R,LT ree ,OT ) is defined by its root s ∈ LT ree , the set of

151


receiving nodes R ⊆ LT ree , all participating tree-nodes LT ree ⊆ LN et , and a set of links OT ⊆ O between them. (2) An event e is mapped to a message m by msg : E → M . Therefore the following property must hold: ∀m ∈ M t : ∃e ∈ E t |msg(e) = m (3) A filter f defines a predicate F (e) that determines the subset E f ,t of E t which fulfills the predicate. A function f type(f ) returns the event-type τ for a filter f . On each channel we define four functions, providing the interface for the application layer. For simplicity, we omit advertising in this API. Definition 14: (1) publishlN et (e): publishes an event e on tree T τ for τ = type(e). (2) subscribelN et (f ): adds an overlay-node lN et to the set of subscribers RT f type(f ) of tree T f type(f ) and registers a filter f . (3) unsubscribelN et (f ) : removes a node lN et for filter f from RT f type(f ) of tree T f type(f ) . (4) deliverlN et ,T (e) is a callback function consuming the event in the application, terminating dissemination by adding it to the set of delivered events D.

8.4 Summary In this chapter, a basic reference architecture for distributed publish-subscribe has been introduced. Hereby, the architecture is based on existing formalisms, but adapted to form the foundation for an extension that allows design-time configuration. For the required flexibility a three-tier architecture is suggested that abstracts the physical network to a KBR-enabled overlay network on the bottom layer. A notification service layer provides further abstraction to a multicast-based publish-subscribe interface. The top layer describes the notion of an application in the context of further discussions. With this basic reference architecture at hand, a precise discussion of the proposed extensions is possible, in order to form a reference architecture for a methodology realizing QoS-aware configuration of publish-subscribe systems.

152

9 | A Design-Time Configurable Publish-Subscribe Framework The previously discussed basic reference architecture provides the terminology and basic abstraction required to discuss a design-time configurable framework. The goal of this framework is the configurability at design-time as defined by the requirements in section 7.2. That implies that a configuration takes place, before the resulting middleware is compiled and deployed. Therefore, no state migration or any other issues rooted in the adaptability at run-time must be considered. Figure 9.1 sketches the basic components of the proposed framework. publish(), subscribe(), unsubscribe(), deliver() Notification Service Layer

Strategy

Strategy

Strategy

Strategy

Channel

Tree

Tree

Tree

Strategy Channel

Tree

Tree

Tree

Tree

Strategy Channel

…

Tree

Figure 9.1: Basic components of the notification service framework

Two important aspects were modeled in order to enable configurability. They both aim for flexibility and extensibility. On the one hand channels were introduced as a fundamental abstraction. A channel is the basic building block and manages one or more logical multicast trees. Trees manage the dissemination structure for events. And as it is possible to partition domains of attributes of events, more than one tree is allowed for a channel. On the other hand the capabilities of each tree as well as tree-spanning capabilities of the channel itself are configured by strategies, inspired by the Strategy-Pattern [GHJV94].

153

9 A Design-Time Configurable Publish-Subscribe Framework

For example the number of trees a channel has and how the events are partitioned upon those trees is defined by partition strategies. In the following we will discuss the channel abstraction in section 9.1 and cover all identified strategies in section 9.2 as well as the processing model that defines the interaction of the different strategies in section 9.3.

9.1 Channels A publish-subscribe system consists of one or more channels. Each channel has a certain behavior. This behavior is defined by a composition of strategies. This abstraction1 enables the framework to provide channels, with completely different semantics in terms of the strategies available. In this framework, the configuration Y of a channel c is a combination of strategies y i , one for each strategy type Y j . Ideally, the configuration of a channel is automatically derived from a description of the corresponding event type (cf. chapter 11). Generally, the definition of a channel based on n strategy types Y 1 , . . . , Y n can be detailed as follows: Definition 15: (1) A channel c := (U c ,T ,Y) is defined by one or more multicast trees T i ∈ T , an n n-tuple of strategies Y := (y 1 , . . . , y n ) ∈ Y i , called configuration, and a set of event-types U c ⊆ U . C is the set of all channels c. (2) A strategy-type Y consists of one or more strategies y. y is the set of all strategies y, regardless their type. Y is the set of all strategy types Y (3) A channel c can transport events of all event-types τ i ∈ U c with the same composition of strategies Y. (4) A header of an event e contains custom parts for each strategy y i accessible by headeryi (e).

×

In addition, based on Definition 13 channels are characterized by a set of multicast trees and a set of event-types. Each event type a channel can transport must adhere 1

Hermes [PB02] follows a similar abstraction, but without the configurability. They built a hybrid type/content-based system with different types that define groups on which content-based filtering is allowed.

154

9.2 Strategies

to the same combination of strategies, as these strategies define the capabilities and behavior of the channel.

9.2 Strategies Each channel can be individually tailored to specific requirements. The limit of this configurability is given by the strategy types available and their interaction model. Each strategy type defines a group of strategies which are implemented by existing algorithms. Strategies adhere to the same interface which is specified for a strategy type. They may only differ in their non-functional properties and to a certain limited degree in their functional properties, as long as they adhere to the specification of the strategy type. For example the degree of guaranteed order may differ for an order strategy, not only the performance in a certain application scenario. Currently, this framework identified seven strategy types in order to configure a channel. The cut of these concrete strategy types is motivated by the analysis of the different aspects discussed in the literature (cf. part II) and therefore driven by implementation and optimization possibilities. We omit security related aspects due to the limitations introduced in section 7.1. New strategy types can easily be added, if the corresponding semantics fit into the existing interaction model that is discussed in section 9.3. Timeliness Filter

Rendezvous

Root

Rend

Timeliness Filter

P

Root Delivery Order

Delivery Order

Brok

Delivery Order Timeliness Filter Sub

Timeliness Filter

Timeliness Filter

Timeliness Filter

Sub

Delivery Order

Delivery Order

Brok

Delivery Order

Delivery Order Timeliness Filter

Timeliness Filter

Sub

Sub

Timeliness Filter Sub

Delivery Order Timeliness Filter Sub

Partition

Figure 9.2: Strategies and their responsibility during dissemination

The currently identified strategy types are shown in figure 9.2. They have the following responsibilities: Routing: Routing is the logic of event dissemination and creates the dissemination structure for the events (cf. section 5.3). An implementing strategy must be able

155


to formulate the logic as a multicast tree, as this is the fundamental processing constraint all strategies agree on if they follow this framework. Filter: Filter permits to attach a filter predicate to subscriptions and ensures that only matching events are delivered to the application. To optimize routing decisions predicates may be aggregated upwards in the dissemination structure to filter messages as early as possible. The way is depending on the routing and filter strategy as discussed in section 5.2. Partition: Partition strategies split the dissemination of a channel’s events into two or more trees according to disjoint partition predicates over the domain of one or more attributes. This strategy type realizes BM-Routing (cf. section 5.3) by allowing to cluster events into a bounded number of multicast trees. Rendezvous: Rendezvous strategies provide the mechanism how root nodes for a dissemination are determined (cf. section 5.3.2). The simplest method for example is a single IP address. Delivery: Delivery strategies define the reliability of the message delivery, e.g. by the use of acknowledgements. Moreover, they determine whether multiple copies of messages are allowed. We discussed the implications in section 5.5.3. Order: Order strategies define the order guarantees that can be given on a sequence of events as discussed in section 5.5.4. Timeliness: Timeliness strategies discard invalid events based on a time predicate and therefore decrease the amount of messages disseminated (cf. section 5.5.5). Cutting strategies in the above manner, reflects the separation of the different aspects of publish-subscribe systems. Routing is the basic strategy type, all others “plug” into, because it defines the dissemination structure for events. As all routing algorithms currently used for publish-subscribe employ basically a tree-based dissemination structure. They all either construct a minimum-spanning-tree or source-specific-trees on a certain overlay substrate. Some routing algorithms additionally partition the problem into more than one tree which requires the partition strategy type to properly reflect their capabilities. For example, clustering algorithms (BM-Routing) require the partition strategy to construct the clusters. The rendezvous strategy type is constructed with respect to redundancy of the root node in some algorithms. Together these three strategytypes cover all aspects of routing as they were discussed in section 5.3. For exemplification of the expressiveness of this modularization, the two most different routing configurations, broker-based routing and ALM-multicast, are illustrated in section 9.4.

156

9.2 Strategies

The filter strategy type enables content-based channels. The design of the strategy type allows the integration of all approaches discussed in section 5.2, because one algorithm of each class of filter algorithms has been implemented as a proof of concept. This suggests that the design also holds for future approaches. In order to allow content-based routing in this model the filter and the routing strategy type may have some dependencies, resulting in some compatibility constraints between filter and routing strategies. However, the later described automated configuration workflow respects such incompatibilities between certain strategies and excludes the corresponding configurations. The remaining strategy types reflect the different QoS aspects discussed in section 5.5 and rely on the routing abstraction of a tree-structure. There exists a dependency between order and delivery strategies, as some order strategies may require a reliable event dissemination to properly function. However, two aspects of publish-subscribe have not been considered in the suggested strategies: security and persistency. These two aspects may be relevant to build fully reliable systems, but are not scope of this thesis, as already stated in chapter 7. Nevertheless, for all introduced strategy types a variety of strategies were implemented as part of the proof of concept to support hypothesis 3, stating that all aspects of publishsubscribe can be integrated in the proposed framework. The currently implemented strategies are described in section 13.4. Even though the strategy types are not fully orthogonal and therefore not completely recombinable, this fine granular approach has one major benefit: it avoids redundant code. Each feature that may be required by more than one strategy is extracted into an own strategy type. The other approach to a modularization would be to cut coarse grained strategy types that are freely combinable, but may introduce code redundancy. So basically the tradeoff is dependencies and constraints between strategies versus code redundancy. As code redundancy introduces software engineering issues regarding its maintenance, this framework favors dependencies and constraints between strategies. Especially as the configuration process is automated in chapter 11. In the following we will discuss the semantics of the seven strategy types in detail. 9.2.1 Routing Adhering to the basic system model, introduced in chapter 8, routing strategies must provide a multicast dissemination tree with a root node that may be the publisher itself or a broker node that acts as a root on behalf of the publisher. If we recall the two basic routing concepts introduced in section 5.3, we distinguish between broker topologies and

157


hierarchical structures, as shown in figure 9.3. Despite the differences in their structure, P

P S

S

S

S

broker-based

S

S

hierarchical

Figure 9.3: Broker-based vs. hierarchical topologies

they both rely on multicast trees to disseminate events. Broker-based topologies define logical trees depending on routing tables on each node, while hierarchical topologies organize the nodes directly into a tree structure. In other words a routing strategy that implements a broker-based system uses an overlay like Pastry as overlay network layer and defines logical multicast trees like Scribe does. Therefore, a multicast tree is the common structure that abstracts from the implementation of a routing strategy. Routing strategies define the logical structure of a channel’s multicast trees T . It has to provide the decision to which tree-nodes a message must be routed to on each tree-node. Following aspects are defined by each routing strategy building on Definition 13: Definition 16: (1) Which tree node lT ree ∈ LT ree becomes root s of the tree T . (2) Which links o ∈ O are in OT and therefore the height of the tree T . (3) Which application knowledge statet (lApp ) is used for the decision on OT . Therefore, the routing strategies define the dissemination sequence on the overlay (Θ) required to disseminate an event to all receiving nodes in R. Each routing strategy has to provide a simple interface in order to fit into the interaction model: Definition 17: (1) processHeaderlT ree (headerrouting (e)): reads relevant information from events and updates routing tables and data structures required for individual strategies and updates the routing header. T ree (2) getT argetN odeslT ree (headerrouting (e)) : E → 2L : Determines a set of target nodes Ltarget that are part of the tree and the next hop for event e.

158

9.2 Strategies

This definition enables to express all routing mechanisms discussed in section 5.3. The only difference is that the required features for each of them are split among the routing, filter, partition and rendezvous strategy types as well as the overlay network. Because many strategies contribute to realize routing behavior of existing systems, we will discuss exemplary configurations in section 9.4 after the discussion of all strategies. 9.2.2 Filter Filter strategies enable the usage of filters f i , defined by their corresponding filter predicates F (e)i , with the aim to restrict the delivery of unwanted events. Filter predicates are registered during the subscription process and exist in different flavors regarding their expressibility. As discussed in section 5.2 they range from topic to content-based filter mechanisms. An exemplary tree with a content-based filter strategy

schema(⌧ ) = {x, y, region}

P

e : {(x, 10), (y, 20), (region, N uremberg)}

y>5

y>5

y >5_x>5

B

S

x>5

S

Figure 9.4: Example for a tree with filtered attributes

is shown in figure 9.4. Each node has to maintain a filter table Ff ilter that contains a set of filters for each child node (lT ree j ,{f 1 , . . . ,f i }). Based on this table, a filter strategy decides locally which events have to be filtered, locally delivered or forwarded to certain child nodes. This decision is basically a refinement of the list of target nodes offered by the routing strategy. Therefore, definition 18 specifies filterTargetNodes. Definition 18: (1) processHeaderlT ree (headerf ilter (e)): processes relevant header information in order to update the local filter table. T ree T ree (2) f ilterT argetN odeslT ree (e,Ltarget ) : E × 2L → 2L , reduces the number of target nodes an event has to be forwarded to. (3) f ilterDeliverylT ree (payload(e)): filters events scheduled for local delivery.

159


Depending on the filter strategy, filters may be propagated upwards the multicast tree in order to optimize filtering. Moreover, the filter strategy can reduce the size of filter tables by merging or covering tests as discussed in section 5.2. 9.2.3 Partition A partition strategy defines how events are split over the multicast trees T i ∈ T c of a channel c. Therefore, it maintains a set of disjoint filters called partition table Fpartition = S {f 1 , . . . ,f i } that clusters the whole set of events of a channel c so that E j = E c. j∈Fpartition

How the clusters are calculated is specific to the strategy’s implementation. Riabov discussed different ideas which can be applied to this framework in [RWY02]. A rather simple example is a static partition table that is defined at design-time and compiled into the middleware. In contrast, a grid-based cluster algorithm [RWY02] maintains a dynamic partition table at runtime which clusters subscriptions. We distinguish these two flavors of partition strategies and call them static and dynamic partition strategies. Strategies that contain an immutable partition table are static partition strategies, while dynamic partition strategies employ algorithms that cluster subscriptions and calculate the cluster dynamically. Two basic operations are performed by partition strategies: they decide which trees are affected by a subscription by an overlapping test (cf. section 5.2) and they decide on which tree a publication is disseminated. This decision is just an evaluation of the partition predicates. Each matching predicate identifies a tree for publication. The resulting API for partition strategies is rather simple: Definition 19: getT rees(e) : E → T Calculates a list of affected trees for a given event.

9.2.4 Rendezvous Rendezvous strategies decide how the rendezvous process is implemented. In order to join a channel, a node must adhere a join procedure. This procedure is coordinated by a rendezvous node. The outcome of this procedure is a node that acts as a root node for the construction of each multicast tree a channel requires. The simplest strategy is a list of nodes that are assigned in a round robin fashion. Another possible rendezvous strategy

160

9.2 Strategies

is the calculation of the root node based on a certain hash function, if a DHT-based overlay is used. Therefore a rather simple API is suggested: Definition 20: getRoot() : returns a node that acts as a root for the construction of a multicast tree.

9.2.5 Order The order strategy type specifies the order guarantee of a channel. By the introduction of multicast trees as the general routing paradigm, event ordering on a channel is basically answered by the research on order for ALM. Therefore, we may use algorithms as surveyed in section 5.5.4, following the three different order properties defined on multicast trees: total order, causal order and FIFO order. While total order ensures order at the receiver of events, FIFO and causal order ensure the sender’s order. To loosen up the required order properties, especially in faulty environments, one may parametrize order strategies with a maximum waiting time and define a behavior after waiting for events to arrive out of order or not at all. This parametrization reflects the discussion on reliability in section 5.4. The waiting time specifies the upper bound in order to receive a partial asynchronous system, while the decision between out of order delivery and dropping models the decision between safety or liveliness of the order guarantee. Order strategies must be able to buffer messages in order to delay their delivery if they arrive out of order. Such a buffer is a set E buf f er,T that contains events, specific to each tree T . Moreover, each strategy introduces some sort of sequencing attribute aseq that contains the sequence number of an event. This may be a timestamp or a simple number, depending on the strategy. A function seq(e) returns this attribute. An event may be delivered to the application if seq(e) is exactly one step larger than seq(e − 1). This is easy for a discrete Lamport clock, but more difficult if timestamps are used. Solutions to this decision are discussed in surveys like [DSU04]. Classes of such algorithms are discussed in section 5.5.4. The following API can be defined to implement such order strategies: Definition 21: (1) processHeaderlT ree (headerorder (e)): processes header information and updates

161


private data structures. (2) receivelT ree (e): This function tries to deliver an event e but if it is not in sequence it is added to the message buffer E buf f er,t .

9.2.6 Delivery Delivery strategies implement mechanisms ensuring the delivery of events on a channel. For example, a simple ACK mechanism represents such a strategy. Therefore, delivery strategies provide the following QoS-guarantee: ∀e ∈ E c : ∀r ∈ Rc : ∃Θk (msg(e))|lN et k = r. That means for all events e on the channel c that there exists a dissemination sequence Θk (msg(e)) to all receivers r of channel c ending in r. Of course, such a guarantee can be costly in terms of latency and bandwidth consumption. 9.2.7 Timeliness A timeliness strategy works like a temporal filter using the event’s timestamp timestamp(e) and the node’s application time t to evaluate the predicate Valid, allowing delivery only if Valid t (e). Of course such a timestamp can be based on a logical clock or on a physical one and depends on the implementation of the framework. Such timeliness restrictions on events allow for optimizations by discarding invalid events as early as possible during the dissemination process. This reduces the number of messages.

9.3 Interaction Model The interaction model, in Fischer [FHL11] also called processing model, defines the interaction between the different levels of abstraction, namely channels and trees, and the strategy types that configure their behavior. If we describe the interaction model in terms of the strategy pattern [GHJV94], it would define the context for the strategies. Informally speaking, strategies describe how a certain task is done, while the interaction model defines when and where it is done. Message Types This model distinguishes five fundamental message types. Each of them represents a workflow required for publish-subscribe. They mirror the API of the system with the

162

9.3 Interaction Model

addition of control messages. Figure 9.5 shows the flow of the different message types in the tree. Control messages are used for strategy specific communication and therefore their flow is dependent on the configuration. Root

notify

publish Sub

Root

notify

1. subscribe Sub

notify

unsubscribe

Sub

2. control

Sub

3. subscribe

notify Sub

Sub

Sub

Sub

Figure 9.5: Message type responsibilities

Even in this figure the message flow of the subscription process may vary according to the routing strategy. In this case the workflow of SpreadIt is depicted. Basically all messages travel towards the root node. Their exact responsibilities can be described as follows: Publish: Publish messages represent the publication of an event and travel to the root of the multicast tree. Depending on the routing strategy they may travel directly to the root as shown in figure 9.5, or they may travel the whole tree upwards to the root as defined for hierarchical routing. If the publisher is the root of the tree e.g. when broker-based routing is employed, no publish message is sent and notify messages are generated directly. Notify: Notify messages travel from the root node to all interested subscribers. They represent the notification of the event and are duplicated depending on the number of children a node has. Subscribe: Subscribe messages express an interest in notifications constraint by a filter expression. Such a request can only be accepted or denied by the routing strategy. If a subscribe message is denied by a node, a better candidate node for a subscription may be returned by the routing strategy. Such a return message must be a control message, because other strategies like the filter strategy react to subscribe messages and update their data structures. Figure 9.5 shows such a redirection. If a subscription is accepted, the routing tables, filter tables etc. are updated.

163


Unsubscribe: Unsubscribe messages represent the inverse operation to subscriptions. The travel to the parent node of the subscriber and deregister the interest of the node. Updates of routing tables, filter tables etc. are strategy specific and are spread as control messages. Control: Control messages carry strategy dependent information and are for example used to distribute routing table updates because of a leaving node. Every strategy is able to employ control messages for information exchange and different implementations require different amounts of information exchange. Therefore, the control overhead of a configured framework is dependent on the composition of the strategies. Table 9.1 shows the dependencies between publish-subscribe operations, respective their message types, strategies and the underlying KBR methods. Each publish-subscribe message type is disseminated using some of the KBR methods discussed in section 4.2. The mapping is trivial except that notify and control messages do not explicitly require route, because they are the result of the deliver processing. Moreover, publish and notify messages are not processed during forward operations, because they do not influence strategies on intermediate nodes. All other operations may be influenced by forward calls. For example Scribe is a routing strategy that heavily relies on the manipulation of routing tables on intermediate nodes during subscriptions and unsubscriptions. Messagetype publish notify subscribe

unsubscribe

control

KBRMethod

Strategy Type Routing

route

+

deliver

+

deliver

Filter Partition Rendezvous +

Delivery Order Timeliness +

+

+

+

+

+

+

+

+

+

+

+

route

+

+

+

+

deliver

+

+

+

+

forward

+

+

route

+

+

+

+

deliver

+

+

+

+

forward

+

+

deliver

+

+

+

+

+

+

Table 9.1: Matrix for strategy application in the interaction model, based on [FHL11]

The strategy types themselves configure the behavior of processing in KBR methods, depending on the publish-subscribe message type. Their exact roles in the dissemination

164


of the different message types is discussed in the following paragraphs. We begin with abstract processes that define the workflows for the interaction of strategies. They are split into routing, forward and delivery. All processing steps are illustrated using UML activity diagrams. We omit a strict formal specification of the interaction model, because, to the author’s believe, UML diagrams are easier to follow, without loosing too much precision. We distinguish two types of activities: fixed activities and hook activities. Fixed activities are actions that are performed by one predefined strategy and are not easily changed or extended, because they fundamentally control the process. Hook activities, however, are lists of strategy hooks that can easily be extended if a new strategy type is added. Strategy hooks may depend on the success of each other, but may not exchange data directly. The list of strategies plugging into each hook activity depends on the processed message type. A coarse overview of the association of strategies and message types is shown in table 9.1. For a concrete implementation of this abstract interaction model, limited to the strategy types introduced in section 9.2, refer to the discussion on the prototype in part IV. Routing Routing of a message occurs if an API operation like publish, subscribe or unsubscribe is called. A message has to be generated, initialized, and handed over to the overlay network. Figure 9.6 illustrates this generic process. As the API methods are invoked on channels, the affected trees must be identified. This is the responsibility of the partition strategy. For each tree the dissemination workflow, shown in figure 9.7, must be applied. Tree [more trees]

Dissemination

Channel Identify trees [no more Tree]

Figure 9.6: Abstract routing process

The dissemination process initializes the header of a message and routes it to one or more targets. First, strategies may register pre-target hooks to configure message headers that apply to all targets. Afterwards the targets are identified. This is typically done by the routing strategy, based on its routing table. For each target target-filter

165


hooks are executed, allowing for the application of filters like by the filter or timeliness strategy type. All matched messages are then manipulated by the post-filter hooks, where strategies may apply target specific header configurations. This is the final step before the message is handed over to the overlay network. Pre-target hooks

Identify targets

[more targets]

[no more targets]

Target-filter hooks

[matched]

Post-filter hooks

[not matched]

route message

Figure 9.7: Abstract dissemination process

Forwarding The forwarding process is invoked on intermediate hops during routing, if supported by the overlay network and specified by the KBR API. Figure 9.8 shows the associated workflow. If a message is received during forwarding, strategies may register forward hooks to update their internal data structures. For example to optimize routing paths or enhance filter tables. Afterwards a destination check is performed, typically by the routing strategy, which decides whether the current node is the final destination and the routing should be prematurely ended. If this is the case, the forward finalize hooks are executed. Such hooks may for example acknowledge the delivery or update data structures of the order strategy. If the current node is not the final node, the message is handed back to the overlay network for further processing. Tree receive message

Forward hooks

Destination check

[final]

Forward finalize hooks

[not final]

forward message

Figure 9.8: Abstract forwarding process

Delivery The delivery process is performed on the destination node of a routed message. It may result in a delivery of the message to the local application process or the generation of a

166


new related message that is further disseminated. Figure 9.9 illustrates this process. It

message

Pre-elimination hooks

Duplicate Elimination

[not final]

Destination check [accepted]

Dissemination

[final]

Post-elimination hooks Local delivery check

[yes]

[eliminated]

Local delivery

[no]

Figure 9.9: Abstract delivery process

begins with a message that arrives through the overlay network. First, pre-elimination hooks are applied. The hooks, registered here, are executed for every arriving message. Regardless whether it will be eliminated in the next step or not. The duplicate elimination is usually performed by the delivery strategy, if configured appropriately. If the message is accepted, the post-elimination hooks are executed. Strategies register here, if they are sensitive to duplicate messages. These two hooks represent the processing of incoming messages. With the next step, the destination check, it is determined whether further messages have to be routed as a reaction. This check is typically implemented by the routing strategy type. If the current node is not the final node, the dissemination process is applied, as shown in figure 9.7. The next step is the local delivery check, also performed by the routing strategy. If a local delivery is required, the according process is applied as defined in figure 9.10. Local-filter hooks

[match]

Post local-filter hooks

deliver message

[no match]

Figure 9.10: Abstract local delivery process

167


The local delivery process executes the local-filter hooks, ensuring only valid messages are delivered. Usually timeliness and filter strategy checks are registered here. If the filters match post local-filter hooks are applied, before the message is delivered. These typically include order or delivery strategy hooks.

9.4 Configuration In the previous sections, we discussed the static and dynamic design of a design-time configurable framework for a notification service. We now briefly discuss the manual configuration of the framework using the scenario of MMVE. It is obvious that the workflow may be generally applied, not only to the development of games.

Specification of application semantics

Configuration of the middleware

Design-time

Compilation

Usage

Production

Deployment

Figure 9.11: Coarse configuration workflow

Figure 9.11 shows the coarse workflow a developer would follow in order to use a designtime configurable framework. The specification of application semantics is generally the game design. However, for the configuration of a notification service only the event semantics are relevant. If this specification has been conducted, informally or formally, a configuration has to be written. A configuration is a composition of strategies, one of each strategy type. This includes the parametrization of each strategy, as well as the definition of each event type’s schema. After this composition has been declared, the configuration may be used to compile the middleware and deploy it afterwards. The benefits of a design-time configuration workflow lie in the production phase: errors can be detected during compilation, if the framework is properly implemented. Moreover, the compiler can be employed to optimize the product. The drawbacks of such configurable frameworks, however, lie in the complexity of the challenge to find an optimal configuration. This can be done by analyzing the according papers for each strategy that is available and choose the best fitting one, but requires significant expert knowledge. The developer has not only to compare the

168

9.4 Configuration

QoS requirements of each single event type, with the guarantees that are given by the strategies, he also has to estimate the performance characteristics of the strategies alone, as well as the overall performance of a certain composition. In the following we illustrate the complexity on the one hand, but also the flexibility of the proposed framework on the other hand, using two contrasting configurations. Example Configurations In section 5.7 a taxonomy of existing systems was suggested. In this taxonomy, brokerbased architectures like REBECA, Hermes or Gryphon were described. These architectures pose a harsh contrast to another category in the same taxonomy: rendezvous-based systems like Scribe, Bayeux or classical ALM systems like SpreadIt. Therefore, we pick these two contrary configurations to discuss the flexibility of the framework. An exhaustive discussion can be found in the evaluation in part V. Broker-based Routing with Content-based Filters Systems that employ broker-based routing and content-based filters, are common among publish-subscribe systems. If we take a look at Hermes, it employs an enhanced Pastry as overlay network and defines channels using multicast trees that work like Scribe. Moreover, they employ a content-based filter scheme. This is easily configurable in the terms of the proposed framework: Pastry is configured as the overlay network. Scribe is used as a routing strategy and content-based filtering is configured by using a strategy that is able to process conjunctive filters. No partition strategy is required, because they are not supported by Hermes. The rendezvous strategy calculates the root for the tree from the topic of the channel, just as in Hermes. REBECA on the other hand employs a static overlay network and defines common routing tables on each node. The overlay is an acyclic graph, making it easy to maintain routing tables. The root of the multicast tree for each event type is the publishing node. This fact can easily represented by a rendezvous strategy. Because the underlying overlay is static, the mapping between nodes and tree-roots can also be a static map. To build the trees a routing strategy that uses one routing table for all trees can be configured, for example by calculating the minimum spanning tree over the overlay nodes. In addition a content-based filter strategy is configured that uses conjunctive filters and some optimizations like covering and merging. Other strategies that configure QoS guarantees, may be configured as the developer sees fit, even if they are not part of the original system’s capabilities.

169


ALM-based Systems ALM-based systems provide only dissemination for groups of event types. We discussed some ALM algorithms in section 5.3.4. In this case for each group a channel can be configured matching exactly the requirements for each channel. To configure such a system for example a simple tree-first SpreadIt system the overlay network is just a TCP or UDP network with a SpreadIt routing strategy, without any filter strategy. One also may configure a classical client-server system with an appropriate routing strategy.

9.5 Summary In the previous chapter, we discussed a reference architecture that answers the question how the different aspects of publish-subscribe systems, as they have been introduced in part II, can be cut into configurable strategies. The suggested framework has shown that it is possible to find a suitable fine grained modularization to integrate the aspects of existing notification services as required by hypothesis 3. The interaction model described how the processing of messages works for the introduced strategy types. In section 9.4, we briefly discussed configuration and informally exemplified two existing systems in terms of the introduced framework. More advantages and disadvantages, as well as possible extensions, will be discussed in part V during the functional evaluation. However, the configurable space that is spanned by seven strategy types with their huge number of potential implementations poses a difficult task: how to find a suitable configuration for a certain application scenario? Especially as some of the strategies may have dependencies to other strategies or exclude them. It is obvious that a manual decision requires experts that know the inner workings of each available strategy as well as the effects of their different compositions. This challenge is addressed in the next two chapters, where we will first discuss a configuration model that is motivated by application semantics in chapter 10. With such a configuration model at hand, a gap forms between the semantic description on application level and the composition of strategies on a technical level. An automated workflow that closes this gap by deducting a suitable composition based on a semantic description will be discussed in chapter 11.

170

10 | QoS-Aware Configuration When we talk about QoS-aware configuration, an application driven viewpoint for the specification of compositions is assumed. In contrast to the configuration on a technical level as required for the framework introduced in chapter 9, the application developer with his demands stands in the center of attention. Especially if the configuration usually requires expert knowledge, a more user-friendly solution should be achieved that covers most of typical configurations. If special requirements must be considered the workflow should still allow an expert to influence and fine tune certain configuration parameters. If we recall the configuration workflow in figure 9.11, the specification of application semantics is formalized in this chapter. Such a formalized description of application semantics is required to support the automatic derivation of the middleware configuration. In this chapter, we discuss a configuration model that is designed for easy usage of non-experts, driven by the language of the application domain. In section 10.1, we will discuss event semantics and QoS in a domain’s context, namely MMVEs. This informal discussion aims for the illustration and exemplification, how event semantics may be exploited to generate a QoS-aware configuration. In section 10.2 a model is suggested that enables the design of a domain-specific language for QoS-aware configuration of notification middleware.

10.1 Event Semantics Most developers follow a project or domain specific terminology and describe their requirements in the terms common to their domain. This often leads to the challenge of the translation between two terminologies, i.e. the developer’s and the middleware’s terminology.

171

10 QoS-Aware Configuration

In event-based systems we subsume this challenge under how to grasp the developer’s notion of event semantics, as the requirements for such systems are mostly motivated by the semantics of the processed events. In [FDI+ 10] and [FL10a], the author exemplifies the exploitation of event semantics for configuration in the context of MMVEs. We will follow this work and informally discuss event semantics for MMVEs to extend the use cases suggested in section 3.3. Events used in MMVEs feature many exploitable characteristics like a spatial context in which the event is valid. Such context information has already be exploited for semantic routing concepts (cf. section 5.3.3). Another relevant aspect to consider is the frequency of an event type. Position updates for example have a high frequency and only the most recent event is valid, as it invalidates all previous events.In contrast, the pickup of an item is a unique event which only occurs from time to time [FDI+ 10]. Aspects that have been discussed in the context of QoS (cf. section 5.5), also pose semantic properties a developer has to think about, when designing an application. In current literature semantics are either represented in form of ontologies if the relationships between entities should be expressed (cf. section 5.2.5) or by simple parameter-value pairs as discussed in the context of QoS (cf. section 5.5). Both forms of expression serve different purposes: ontologies should grasp complex dependencies between terms in order to deduct more meaning from filter expressions. Nevertheless, the corresponding filter expressions are still formulated as parameter-value pairs. They are only interpreted with the help of ontologies. Therefore, we stick to the current consensus in research that states to formulate event semantics as parameter-value pairs. Design-time vs. Run-time Semantics Event semantics in form of QoS guarantees or additional information that describes an event type further can be provided at two points in the development workflow. At design-time, or after deployment, at run-time. At design-time, the semantical annotations provide additional information for the configuration process and may be exploited for a better configuration decision, i.e. choose a more suitable algorithm for a described context. At run-time, such annotations help for example routing algorithms to prioritize certain messages in order to meet QoS requirements that are issued per subscription. In this work we limit the discussion to the exploitation of semantics at design-time which does not mean the suggested approach cannot be extended to support run-time semantical extensions. Both aspects are orthogonal to each other, as the design-time

172

10.1 Event Semantics

annotations lead to the decision which strategies are for configuration. The run-time annotations, however, influence the behavior of a strategy at run-time. If we take a look at the relevant semantics of MMVE, including QoS, one can notice many facets that can be seen as dimensions. In the context of data quality and QoS the multidimensionality of quality aspects is well known (cf. Kritikos [KCP+ 13]). Based on previous work of the author [FDI+ 10], this multidimensionality is exemplified for MMVEs in the following section. 10.1.1 Dimensions The semantics of each event-type encompasses e.g. order requirements, the area-ofinterest (AoI), the relationship to other events or the context in which this event is valid. All these different semantical properties should be structured for easy usage which is not the case if mere parameter-value pairs are used. In order to describe event types in an intuitive user-friendly way a classification seems a suitable choice. Hence, a multidimensional classification with orthogonal dimensions, has been outlined in [FL10a] in order to model independent aspects of the event semantics. Disjoint characteristic classes for each dimension should be found, at best in a terminology, a developer can relate to. The terminology is highly domain-specific which should be respected by the resulting classification. An event type, has to be assigned to one class along each dimension, which allows for the deduction of the associated parameter values for that class. Based on this scheme, the class of an event type is defined as the sum of the characteristics along each dimension. The power of such a multidimensional class space is that it enables the configuration of each event type along each dimension with a different strategy in order to gain a better configuration of the system. The dimensions proposed below are derived from the analysis of MMVE and may not exhaustively address all possible semantic properties events can adopt, but those relevant for the automatic derivation of a composition of strategies, as discussed in chapter 11. 10.1.1.1 Context Each event in an MMVE has a certain context in which it is relevant. We already discussed this fact for spatial context under the term of AoI in section 5.3.3. Additionally, the context may be social or defined by certain attention metrics as in Donnybrook [BDL+ 08]. In general, the context of an event in an MMVE reduces its recipients to a

173


certain subset. For the context dimension, following exemplary classes could be defined according to [FDI+ 10]: single-target Obviously, an event with only one recipient may be delivered directly which is not an option for a notification middleware or distributed over a channel with the appropriate filters, so that the routing does not generate unnecessary transmissions. An example is a private chat message between two players or the event “avatar A gives avatar B item X”. multi-target Events with a multi-target context have a defined set of explicit recipients, for example a chat message to a group of participants who have a certain relationship in the virtual world or an event whose recipients are deduced by certain metrics. For example all avatars, which are targetable by one’s cross-hair in a certain time interval. spatial An event with a spatial context is only relevant to a subset of recipients limited by spatial constraints. This class is a special case of the multi-target class. The distinction is necessary, due to the special optimized delivery a spatial context allows. An example for an event with a spatial context is the pickup of a flower in the virtual environment. Only clients in visual range need to be notified of such an event. Zolotorevsky [ZER09] discusses spatial contexts more generally and defines different types of spatial contexts: fixed location, reference object based space, or boundary function. In these terms the AoI of a moving avatar would pose a reference object based space, while a flower has a fixed location AoI. global An event which is broadcasted to all clients of the virtual world, without any restriction, is of a global context. These events are distributed without any optimization. These classes influence the values of QoS attributes that characterize an event-based application. For example a single-target class implies many potential publishers and many subscribers but a high selectivity for subscriptions. The frequency of occurrence is not modeled in this classes, but could be added by adding e.g. velocity to the classes. Or it could be explicitly modeled as an attribute for the event type. 10.1.1.2 Synchronization Some event types have certain temporal or causal interdependencies and therefore require synchronization. For example, a position update may have no synchronization

174

10.1 Event Semantics

requirements. Due to its high update rate, one out of order event only results in a small glitch of the avatar which is not nice to look at, but certainly tolerable for most MMVEs. For this dimension, certain levels of synchronization have already been discussed in the context of order guarantees in section 5.5.4. Other event types also do not need synchronization. For example, chat messages do not need an ensured order, as their order is not essential to the operation of the chat service. Therefore no order is an essential class. Event types like “open chest” and “pickup content" model a race for resources and only one, the fastest, player should be able to get it. In this case actions of any two different users must be processed on all nodes in a globally fixed order. This leads to the requirement for total order. Other order guarantees like causal or FIFO-order may also be sufficient, if the overhead for total order is too expensive. So basically the classes for order represent the different guarantees that may be given. What poses an additional property is how late events are handled. They may be thrown away to ensure the safety of the order guarantee, delivered anyway to ensure liveliness or compensated if the application supports it. A generic classification based on these properties will be discussed later during the discussion of a model formalizing the multidimensional classification (cf. section 10.2.1). 10.1.1.3 Validity Whilst synchronization describes the order of events, validity is strictly limited to one event and models a temporal context for which the event is valid. For example if we take a look at position events with a high velocity. A short validity can be specified, because a delayed event can be discarded as it is probably already out of order and replaced by a newer one. We distinguish three basic characteristics of validity: interval duration: Events of this type may be valid for a certain time interval independent of its sending or its receive time. For example, a certain action triggers an effect for three minutes. With interval validity only one event is needed. progress: The type of clock that is used to progress time for the duration interval can either be time or event triggered. unlimited: Unlimited validity means, this event type has a permanent impact on the virtual world. Events of this type must not be lost and therefore has to be delivered.

175


10.1.1.4 Security A secure event is not tampered and represents the initial event. Especially in distributed MMVE architectures, it is important to at least detect cheating clients. Prevention would be better, but in most cases it is too expensive to guarantee cheat-free operation. Nevertheless, security as a relevant dimension is not further detailed upon in this work, because we assume non-malicious nodes in the distributed system, as introduced in chapter 7. However, the extension of the following model and automation workflow with security strategies and the corresponding dimension will be sketched as a future work in section 19.1.2. 10.1.2 Examples of event semantics Based on the informal analysis of MMVE and the identified dimensions with their classes, we take a look at our initial scenario and the defined use cases in section 3.3. We extend the use cases by a classification according to the informal notion of the dimensions. Movement: Events have a spatial context, no synchronization and short time-based interval validity. Target action: Events have a spatial context, total throw away synchronization and unlimited validity. Chat: Events have a multi-target context, no synchronization and unlimited validity. Match coordination: Events have a global context, total deliver anyway synchronization and unlimited validity. These examples show the variety of optimization potential each event type has and indicate the adequateness of a multidimensional model. In the following we formalize this informal notion of event semantics to a model for the design of a DSL for QoS-aware configuration.

10.2 Configuration Model A configuration model defines the constraints and artifacts required for the description of an application and its automated configuration. Moreover, the relevant information for each strategy that is required for the decision should be expressible in this model.

176

10.2 Configuration Model

The challenge addressed by this model is to define a DSL for configuration that can be tailored to an arbitrary domain providing the terminology this domain requires, without changing the basic system model. Figure 10.1 illustrates the different phases of Middleware development

Domain modeling

Application development

Description of strategies

Description of a domain

Description of event types

Strategy Strategy description Strategy description Strategy description description

Middleware developer

Domain profile

Network profile

Domain expert

Event type Event type description Event type description Event type description description

Application developer

Figure 10.1: Overview of the configuration model

a configuration which reminds of a model driven engineering process [Sch06]. For each phase another skill is relevant for the specification of the different artifacts. This makes it easy to split responsibilities for each task among specialized developers. In the first phase, during middleware development, for each strategy a strategy description is written. These descriptions include the type of the strategy, its configuration parameters and dependencies to other strategies. For the specification of those descriptions a middleware developer that knows the inner workings of the framework is best suited. The second phase1 takes place once for each domain, the middleware should be used in. A domain expert writes the network profile that contains the relevant attributes describing the network, the application will later be deployed on. For example most MMVEs are deployed on one data-center for each continent which leads to an internet topology among other network characteristics in the network profile (cf. section 4.3). In addition, the multidimensional classification for the event types must be specified in the terminology of the domain. This leads to the domain profile and contains the classes with their mapping to the attributes of the underlying system model.

1

It is possible during this phase that some information has to flow back to the strategy descriptions. This is the case if some special parameters are required to annotate strategies for a certain domain.

177


The third phase is the application development itself and produces descriptions of each event type, including their schema, classification according to the domain model and explicit QoS requirements. This description process can be performed in the terminology of the application’s domain and therefore does neither require internal knowledge of the middleware nor the initial effort for learning a generic classification. In the following sections, we will discuss each of the artifacts and the general metamodel behind the different descriptions: the multidimensional application classification. 10.2.1 Multidimensional Application Classification The multidimensional application classification poses the core of this configuration model and realizes Requirement 1.1. Figure 10.2 shows the abstractions, introduced by the classification. Starting from a system model, we will discuss in section 10.2.2, parameters are the basic building blocks of the classification. Each class is defined by one or more parameters. Parameters are instances of attributes defined in the system model and can therefore used for direct measurement during the automated workflow. Parameters can also be derived from existing parameters by transformation rules. A dimension itself consists of disjoint classes. In order to be able to map dimensions to different strategy types, mapping rules are defined for each dimension. They define under which constraints a dimension is mapped to which subset of strategies a strategy type allows. transforms to 1 *

Parameter

*

*

instance of 1

Attribute

*

defines

maps to 1 *

Dimension

1

is classified

*

Class

1

m

consists of *

Selection Rule

1

m-dimensional Classification

multidimensional classification

system model

Figure 10.2: Abstraction of the multidimensional application classification

This flexible meta-model provides the entities to model domain specific classifications for easy configuration of event types. Extending the basic reference model on the

178


application layer, a multidimensional event-type classification Dτ is an extension of the definition of an event type τ ∈ U : Definition 22: (1) An m-dimensional classification of an event type τ is defined by an m-tuple: m

×

Dτ ∈ Di . S (2) A Dimension D is defined as a set of disjoint classes: classD . In a classification Dτ of an event-type τ , each dimension is represented by exactly one class. (3) A Dimension D moreover defines a set of rules RD for the selection of a strategy type or specific strategies as a mapping target. (4) A Dimension D is described by a set of parameters ΛD . (5) A class classD within a dimension D is specified as an n-tuple of parameters that define the dimension D. classD ∈ , with |ΛD | = n.

×

λ∈ΛD

Definition 22 allows for an extensible multidimensional classification model. Each event type is characterized by assigning it to a set of classes. Each class in a certain dimension may be described by a k-tuple, with k being the number of parameters needed to define that class. An event type is described by exactly one class in each dimension. Additionally, each dimension defines rules for the mapping to a certain strategy-type. This association between dimensions and strategies is required for the configuration workflow, described in chapter 11. Definition 23: (1) The multidimensional classification of an application A is called application classification DA and is the set of the application’s event-type classifications. It is defined by: DA :=

[

Dτ i

τ i ∈U

(2) An instance of a multidimensional classification model is referred to as Ψ . Each instance defines a specific number of dimensions, the concrete parameters ΛD for each dimension, and the number of classes classD for each dimension.

179


Definition 23 describes the notion of an application classification. It also defines the concept of instances to provide multiple classifications. This aspect allows the classification of event semantics in a domain-specific way. Such an instance is exemplified in the following section. Generic Instantiation of the Classification Based on the informal discussion for MMVE, a multidimensional classification for generic purposes can be instantiated. With respect to the discussion and the introduced limitations of this work, the classification schema for further discussion consists of three dimensions: Context, Synchronization and Validity. They are –in the author’s opinion– the most important ones and should suffice to illustrate the expressiveness of the metamodel. Subsequently this initial set of dimensions is briefly described, following the informal description of [FL10a, FDI+ 10]. Context The context dimension Dcontext defines the set of context classes. A class in the context dimension describes the characteristics of a certain event type. For an event type τ it might be necessary to deliver each event to all subscribers, i.e. in a large scale. For another event type τ 0 it might be sufficient to deliver an event only to the nearest neighbors, i.e. at small scale. Moreover, the context defines the number of events sent, i.e. the velocity of the event type. The final characteristic is the ratio between subscribers and publishers, i.e. the cardinality of the channel. A one-to-many channel, for example, can employ another routing strategy than a many-to-many channel. Validity We define validity Dvalidity as a time-dependent predicate that decides whether an event is valid at a given time or not. This defines a kind of temporal context. As a result, only valid messages are received by the application. An example is a predicate that discards events older than 5 min. In terms of Zolotorevsky [ZER09], this would be a sliding interval. Hence, validity is defined by two parameters: progress and duration. Duration models the time-dependent predicate and progress defines the notion of time. Either a wall-clock time or a logical time, triggered by events, is feasible. Synchronization The synchronization dimension Dsync describes the degree of order a certain event type guarantees, as well as the behavior for late messages. We assume that a total order of all events of a particular event type can be defined, if a behavior for failure cases is annotated, i.e. deciding between liveliness and

180


safety. Relaxations can also be specified by allowing weaker order criterions (cf. section 5.5.4). To simplify the definition for the application developer, we define classes that rely on three parameters: guarantee,synchrony and outlier handling. The guarantee parameter defines the order criterion. Synchrony is the time, an event may take for arrival, before it is considered an outlier1 . Outlier handling addresses how late messages are handled. They could be dropped, delivered out of order or compensated. Validity and context can be modeled as one dimension as they represent a spatiotemporal context (cg. [MBE10, ZER09]). However, we decided to separate those two aspects due to the classes and parameters introduced. This is a deliberate design choice –the author believes– is easing classification. The generic instance Ψ generic of the multidimensional classification is formally described as follows: ∀Dτi ∈ DA : Dτi ∈ Dcontext × Dvalidity × Dsync with the following symbols for the different classes:

classDcontext ∈ Dcontext , classDsync ∈ Dsync , classDvalidity ∈ Dvalidity , We elaborate the details of the introduced semantic class spaces in section 10.2.3. However, to be able to do so, we first have to discuss the system model that defines the abstract view on DEBS. Each class is modeled in terms of parameters that are either instances of, or can be mapped to system attributes that concretize the characteristics of the application in a technical notion. Before the specific class spaces are discussed, the system model that defines the system attributes is introduced.

1

The synchrony parameter transforms the consensus problem from an asynchronous to a partial synchronous system model (cf. section 5.4) and therefore to a solvable problem.

181


10.2.2 System Model The system model, in the context of simulations also known as simulation model, is the abstract view on a DEBS that defines the possible parameters for the simulation of such systems. We call these simulation parameters system attributes Λsystem to distinguish them from parameters that are part of a multidimensional classification. The system model is based on the reference model introduced in chapter 8. The network consists of nodes that have input and output buffers and are interconnected by links (cf. section 8.2). The characteristics of nodes and links are described by attributes in terms of the discussed network characteristics (cf. section 4.3). These are called network attributes Λnet . Based on these attributes a simulated network can be created with each node running an instance of the middleware. Attribute Network attributes Number of Nodes Delay Jitter Drop Chance Header Size Upstream Rate Downstream Rate Queue Size System attributes Publishers Subscribers Payload Event Frequency Node Fluctuation System metrics Order Percentage Duplicate Percentage Path Latency Overall Latency Control Overhead Message Loss

Description The overall number of network nodes. The constant delay between two nodes. The percentage of variation in delay. Chance a message gets lost. Size of the network header. Data-rate of a node’s upstream. Data-rate of a node’s downstream. Size of a node’s input and output buffer. Number of publishing nodes. Number of subscribing nodes. Size of the event’s payload. Number of events published per second and publisher. Chance that a node leaves gracefully. Percentage of correctly ordered messages. Percentage of duplicate events. Duration a message takes to reach one subscriber. Duration a message takes to reach all subscribers. Percentage of control messages. Amount of lost messages. E.g. by data-rate shortage.

Table 10.1: Important system attributes and metrics

182


The remainder of identified attributes define the characteristics of a publish-subscribe application1 also based on the reference model. They configure the behavior of one or more arbitrary event channels distributing events over the middleware. This parametrization is used for the measurement of system metrics Ωsystem that represent QoS requirements and may be used for the decision of an optimized composition. For the purposes spanned by the requirements and limitations in chapter 7, table 10.1 shows the system attributes and metrics that form the system model. Network attributes define the characteristics of the nodes and their interconnecting links. We assume evenly distributed nodes which results equal characteristics for all links between nodes. The application attributes model unstructured records as already discussed as a limitation for this thesis. For each simulation performed, the attributes have to be set to values derived from the multidimensional classification or explicit parameter definitions. This attribute combination with its corresponding values for the metrics provide the foundation for the decision about the optimal configuration. This process is detailed in chapter 11. 10.2.3 Class Spaces The class along a certain dimension classifies an event type and implicitly defines parameters that can be used to derive configuration decisions. In the following, we discuss the three class spaces for the discussed generic application classification. For each class space the used parameters as well as their corresponding system attributes are described. Context class space A possible generic classification for the context dimension that is as independent as possible from a certain domain, uses three parameters that span the class space: velocity, scale and cardinality, as Definition 24 specifies. Velocity is just a scalar value and defines the number of events one publisher generates per second. Classes can be chosen in any granularity in the domain of the scalar. Depending on velocity, data-rate and message size, scale answers the question for how many nodes can be distributed in the given velocity. Classes distinguish by the percentage of these maximum servable nodes.

1

For the identification of those application attributes, a literature analysis on QoS for publishsubscribe and a case study of MMVE in chapter 3, respective their event semantics in section 10.1, was performed in the context of a thesis that implemented a corresponding simulator [Bon13].

183


Cardinality models the ratio between publishers and subscribers, leading to classes like one-to-many, few-to-many or many-to-many. Definition 24: (1) A context class classDcontext := (velocity,scale,cardinality) is defined by three properties: (2) Velocity is a numerical value that defines the messages per second. (3) Scale is defined as the number of nodes, a channel is designed for. It is derived from parameters and calculated by: upstream scale := weight ∗ (velocity∗messagesize) . (4) Cardinality defines the ratio between publishers and subscribers: publisher . cardinality := subscriber Table 10.2 shows an exemplary application independent classification for the context parameters with their mapping to system attributes. This example chose a relatively coarse discretization of the parameters. A finer granularity is also possible. The decision about the granularity is a tradeoff between user-friendliness and expressiveness. Possible classes are obviously the permutations of the parameter values. Velocity addresses the frequency of events’ occurrences, scale the number of potential subscribers and cardinality the ratio between publishers and subscribers. The selectivity of subscriptions, as it was informally introduced in section 10.1.1.1, is omitted. It would require a content-based filter model which was excluded for the discussion about configurability (cf. chapter 7). Velocity

Scale

Cardinality

low

5 msg/sec

small

upstream 0.1 (velocity∗messagesize)

one-to-many

1 scale

medium

15 msg/sec

medium


few-to-many

0.3∗scale scale

high

30 msg/sec

high


many-to-many

1

Table 10.2: Classification of the context dimension

Validity class space The validity dimension spans a class space describing the temporal context of events. Zolotorevsky [ZER09] defines two types of temporal contexts: event interval and sliding interval. These types may be modeled as a predicate, e.g. a sliding interval: V alidt (e) :=

184


timestamp(e) > t − 5 defines all events valid at time t, which are younger than the last 5 time-steps of application time. All events not covered by the predicate are invalid and therefore may be discarded during the dissemination process. This leads to the following definition: Definition 25: (1) Each validity class classDvalidity := (progress, duration) is defined by two properties: progress and duration. (2) Duration defines a validity predicate Valid t (e) := timestamp(e)−currentt ime < threshold which must hold for e at application-time t for the application to receive the event. (3) Progress defines the type of clock used for time progression: a wall-clock defining a sliding interval, or a event triggered logical clock defining an event interval. Table 10.3 shows an exemplary classification, respective their mapping on system attributes. The resulting classes are all valid parameter combinations. None defines no validity check at all. Unlimited validity guarantees the delivery of all events. Time or Event classes exist with all duration parameter values, like time-long or event-short. Depending on the progress chosen, the duration is either defined in milliseconds or in discrete time-steps. Progress none time wall-clock time event discrete logical time

Duration short 50000 ms / 2 steps medium 100000 ms / 5 steps long 300000 ms / 10 steps unlimited guaranteed delivery

Table 10.3: Classification of the validity dimension

Synchronization class space Synchronization requirements are specified by three parameters: guarantee,synchrony, and outlier handling. Definition 26 spans a three-dimensional class space, describing synchronization capabilities in a generic way.

185


Definition 26: (1) Each synchronization class classDsync is defined by a 3-tuple: classDsync := (guarantee,synchrony,outlierhandling). (2) guarantee defines the order guarantee that is given. For example total order, FIFO order or none at all. (cf. section 5.5.4) (3) synchrony specifies the time in milliseconds after that an event is considered as an outlier. (4) outlierhandling defines how outlier events are handled.

Based on this definition, a strategy can be chosen that implements the specified order guarantee with a maximum waiting time that delivers events anyway or throws them away, if the waiting time is exceeded. Compensation means the application will be notified that the effects of certain events must be reverted and reapplied when an earlier event arrives, as suggested e.g. by Jefferson in [Jef85]. Obviously, this behavior is not always and sometimes only to a certain extent possible. An application that cannot revert the effect of events, compensating classes may not be chosen. Table 10.4 shows an exemplary class space for the order dimension. The guarantee parameter is mapped to a strategy selection rule, only allowing order strategies that implement the corresponding guarantee. The synchrony parameter is defined as a scalar in milliseconds with a granularity of three classes. Outlier handling is mapped heterogeneously. Compensate requires a certain implementation effort and must be considered designing an algorithm. Throw away and deliver anyway, however are mapped to strategy parameters configuring the strategy behavior. Guarantee total order FIFO order causal order local order none

strategy strategy strategy strategy

selection selection selection selection -

Synchrony low 50000 ms medium 100000 ms high 300000 ms

Outlier handling throw away strategy parameter deliver anyway strategy parameter compensate strategy selection

Table 10.4: Classification of the synchronization dimension

186


Summary The discussion of the three dimensions gives an insight into the possibilities a classification generally allows and also gives a generic classification scheme, used for the remainder of this thesis. The reason a generic scheme is favored over a domain specific scheme for MMVEs is that it enables the discussion of a variety of application scenarios not limited to an MMVE. Based on the semantic classification DA , we now have an exact definition of the semantics of each event type in an application. Next, we discuss the different artifacts that have to be generated for a complete configuration. 10.2.4 Artifacts The artifacts required to fully describe an application A consist of one domain profile, one network profile, and one or more event type descriptions and strategy descriptions. Such an application description ΓA represents all required information necessary to automatically deduct the technical configuration of the introduced framework. It can be formally defined as: ΓA := (Γdomain ,Γnet ,

[ y∈y

Γy ,

[

Γτ )

τ ∈U

Examples for each single artifact can be found in the corresponding section of part IV discussing the proof-of-concept implementation. In the following, we discuss each artifact, beginning with strategy descriptions. 10.2.4.1 Strategy Descriptions For each strategy y implemented in the scope of the framework, a strategy description Γy has to be written by the middleware developer. The description specifies the required information for the usage of the strategy during the automated configuration workflow. Γy := (Y ,Yrequire ,yexclude ,classname,Λconf ) is defined as a tuple that contains the following information: Strategy type Y : This element defines the strategy type, the described strategy is implementing. Set of requirements Yrequire : This set contains all strategy types that are required for the operation of this strategy.

187


Set of exclusions yexclude : This set defines all strategies that are incompatible to this strategy. Name of the implementing Class classname: This name points to the implementing class. Set of configuration parameters Λconf : Defines a set of configuration parameters that may be used to configure the strategy. These strategy parameters may be static or can pose variables that are targets for the mapping of classification parameters, as for example done for the definition of the synchronization dimension. Of course the sets that define the requirements, exclusions, and the parameters may be empty sets. This information is sufficient for the description of strategy types and poses the additional effort a developer has to expend in order to make a strategy available for automated configuration decisions. 10.2.4.2 Domain Profiles Domain profiles are the artifact that collects domain specific information. The separation of domain specific information from the actual description of event types has the advantage that the reusability of domain knowledge is easily possible. They contain a instantiation of the multidimensional classification for the described domain. This enables the formulation of the classification in the terminology of the target domain. Moreover, parameters may be defined that further abstract from system attributes. In the domain model they are called terms. In addition to a classification of event-types, they allow explicit specification of parameters in the terminology of the modeled domain. Normally the usage of the classification is sufficient for most cases, but if the last bit of precision is required, the possibility to explicitly define terms is a huge gain. The last part of a domain profile are QoS-requirements. They provide the possibility to define upper and lower bounds for different QoS metrics that are the default for the modeled domain. Formally, a domain profile Γdomain := (Λterms ,Ψ domain ,Ωlimit ) is a tuple that contains a set of terms Λterms , the instance of a multidimensional classification Ψ domain and a set with default limits of QoS-metrics Ωlimit . Each parameter, let it be part of Λterms or part of Ψ domain must contain a mapping rule to deduct a system attribute or another parameter from it. This constraint ensures that all parameters are finally deducible to system attributes that can be used for simulation.

188

10.3 Summary

10.2.4.3 Network Profiles Network profiles (Γnet ) describe the properties of the underlying physical network. We adhere to the system model, introduced in section 10.2.2. They contain a set of parameters Λnet . These parameters define the characteristics of the network. The limitations in chapter 7 constraint the discussed networks to evenly distributed networks. As a result each node, link, and queue in the network has the same system attributes. That means each λi ∈ Λnet describes all nodes, respective edges, in the network. Therefore the network profile is rather simple. Extensions that would consider different topologies, as discussed in section 4.3.1, and their implications will be sketched in further work. 10.2.4.4 Event Type Descriptions The event type description Γτ is the artifact that contains all relevant information about a single event type. It depends on the domain profile and the network profile to fully describe an application. It defined as a tuple: Γτ := (schema(τ ),Dτ ,Λτ ,Ωτ ), the event type description for the event type τ contains its schema describing the attributes used. Moreover, the classification Dτ of the event type has to be noted in the description. It classifies the event-type according to the multidimensional classification Ψ domain defined in Γdomain . These elements sufficiently describe an event-type. Optionally for the fine tuning of the requirements and semantics, Λτ is a set of parameters that further refine the set of parameters defined on domain level, i.e. Λterms in an event-type-specific notion. The same refinement is possible for QoS-requirements with a set of specific QoS targets defined in Ωτ .

10.3 Summary In the previous chapter, we discussed a configuration model that enables the specification of event semantics in a developer friendly way. The developer is able to specify his requirements in a domain specific way and therefore promises to speed-up the development process. In order to enable this speed-up, a multidimensional classification was introduced as a meta-model that can be instantiated in a domain-specific way in order to create a terminology that is tailored to the peculiarities of a certain domain. Moreover, different

189


artifacts were introduced to reflect the different roles which take part in the configuration process. This reference model can be used to define a concrete DSL, which is described in the context of the proof-of-concept implementation in chapter 15. Moreover, in the context of the methodology for QoS-aware configuration of publishsubscribe systems, the configuration model provides the means for a developer-friendly description of the event semantics on application level. That leaves one part of the methodology for discussion: The translation between the semantic description and a strategy combination is still an open issue. The resulting challenges, as well as two workflows that solve these challenges in form of automated configuration, are discussed in the following chapter.

190

11 | Automating Design-Time Configuration The automation process for design-time configuration, discussed in this chapter, allows not only to automate the configuration process, but also the derivation of the decision from simulations rather than simple heuristics as in most existing configurable approaches (cf. section 5.7). We begin the discussion with the introduction of the problem statement that results from the gap between the domain-specific application description and the configuration of the middleware, motivated by technical means. After the problem is formulated, the basic idea for the solution is sketched in section 11.2. This idea is concretized by the introduction of two possible workflows that realize the basic idea. The development of both workflows was supported by the work done in Wahl [Wah13]. One workflow represents the naive approach that performs simulations as required. The other workflow tries to minimize the simulations required, as well as it decouples the simulation effort from the actual decision process.

11.1 Problem Statement The configuration of a channel is a combination of strategies, one for each strategy type. Ideally, the configuration of a channel is automatically derived from the description of the corresponding event type. The description Γτ of an event type τ contains all relevant information. Hence, the basic problem is to find an optimal mapping from the event-type description to a combination of strategies. The schema of the event type schema(τ ) can be directly used for the generation of the corresponding configuration code. It is not further required for the configuration decision. Dτ , the classification of an event-type, is for the generic instantiation Ψ generic defined as a triple (Dcontext ,Dvalidity ,Dsynch ). A classification like this, limits the appropriate strategies

191

11 Automating Design-Time Configuration

y i for each strategy type Y j based on the selection rules RDk for each dimension Dk . A strategy for a particular strategy type is typically only influenced by one dimension, but generally might be influenced by multiple dimensions. Moreover, a particular strategy for strategy type a might conflict with a strategy of type b. Consequently, it is not enough to define an independent mapping function for each dimension. Instead, we need a mapping function that determines the best possible combination of strategies for a given description. To assess the quality of a certain mapping to a combination of strategies, for each QoS-requirement ωi ∈ Ωτ the combination of strategies Y must score minimal or maximal, depending on the definition of ωi . Definition 27: (1) For an m-dimensional classification and n strategy types, the combination of m strategies Y is determined by a mapping function map : Di → Y 1 × Y 2 . . . × Y n . (2) map delivers a valid combination of strategies Y, for which all ωj ∈ Ωτ are satisfied.

×

Definition 27 specifies the optimization problem to solve during the decision for an optimized channel configuration. In the next section we briefly discuss a possible approach for the configuration process to solve the problem and exploit the potential provided by the multidimensional classification.

11.2 Basic Solution Framework The problem stated in Definition 15 can be addressed in many ways. The mapping function can employ heuristics or analytical models to deduct an optimized solution. The problem with heuristics or analytical models is their definition. On the one hand, heuristics can e.g. be defined as a set of rules. On the other hand, analytical models require the specification of cost formulas that model the middleware’s behavior. Both approaches have in common that they have limited extensibility and only reflect a model of the middleware implementation, as they view the middleware as a white box with knowledge about inner workings. In order to gain better extensibility and be more robust to changes in the middleware, this approach to automation sees the middleware as a black box. This view, however,

192

11.2 Basic Solution Framework

Simulation

Simulation

Process application description

Identify simulation parameters

Simulation

Find optimal configuration

Configure library

Simulation

Simulation

. . .

Figure 11.1: Basic solution workflow

requires to use measurements to describe the behavior of the observed system. As real world deployment is way too expensive for a decent usability of the approach, simulation is the tool of choice. Figure 11.1 illustrates the basic workflow for a simulation based approach to solve the problem statement. First, the application description has to be processed in order to generate valid configurations that pose candidates for an optimized configuration. Second, all relevant system attributes must be identified in order to parametrize the simulations. After these two preparatory steps, simulations can be executed in parallel. Obviously this step is the most time consuming part of the approach and therefore is the target for further optimization, even beyond parallelization (cf. section 11.4). Each simulation instance measures all metrics for exactly one point in the hypercube that is spanned by the system attributes. The measurements are afterwards used for the decision. The QoS-requirements score each configuration candidate, i.e. strategy combination by comparing the measured metrics with the imposed limits of the QoSrequirements. Simulation Model The simulation model employed for the execution of the simulations is strongly based on the system model suggested in section 10.2.2. The system attributes define the parameters for the simulation, while the metrics pose the measurements performed. As shown in figure 11.2, the simulation model consists of nodes with incoming and outgoing queues. Node attributes describe the behavior of the nodes like the size of the queues. These nodes are interconnected by links that are described by link attributes,

193


Application Queue Node Application

Link

Node Queue

Application Link Queue

Node

Figure 11.2: Structure of the simulation model

like delay, jitter, etc. Finally the application is modeled as required by the system model and defined by the application attributes. Considering the simplifying assumptions in section 7.1, the simulation model assumes an evenly distributed network, resulting in constant network and node attributes. It is assumed for the simulation model that node resources like CPU and memory are not a limiting factor for the middleware. Only the network introduces bounds for the decision about configurations. This is a simplification resulting from the initial assumptions for the optimization problem, namely, each channel is optimized independently. In the introduced scenario of MMVEs that means a maximum of about 30 messages per second per client is processed. This, however, results in a workload, current industry strength messaging middleware like ZeroMQ1 handles without noticeable processing overhead (cf. Dworak [DESS11]). As a consequence, a simple simulation model that only considers the system attributes already listed in table 10.1 is sufficient for the automated decision process. In the remainder of this chapter, this basic solution framework is concretized by two workflows.

11.3 Naive Workflow The first described workflow does not take any optimizations that could speed up the decision process. Required simulation data is generated exactly on demand, meaning during the configuration process. Therefore it is called the naive workflow. This workflow is suitable for projects where relatively few event-types have to be configured and the

1

http://www.zeromq.com

194

11.3 Naive Workflow

requirements stay stable during the development process of the application. In the naive workflow each change in the requirement leads to new simulation runs.

Domain profile

Strategy description Strategy description Strategy description

Event type Event description type Event description type description

Identify candidate configurations

Network profile

Deduct system attributes

Execute required simulations

Select QoS-optimal configuration

Generate custom middleware library

middleware

Figure 11.3: Overview of the naive workflow

Figure 11.3 illustrates the different steps of the naive workflow in a coarse granularity. First, the required input parameters for the simulation have to be generated. This consists of the identification of candidate configurations and the deduction of system attributes. The simulations, parametrized with the deducted system attributes, are then executed for each candidate configuration. After all relevant simulations have been run, the candidate is chosen that provided the best results in the simulations regarding QoS-requirements defined in the event-type description or domain profile. With the best suiting candidate configuration identified, the custom middleware is configured and compiled. After the compilation process completes successfully the library is ready to use. In the following, each step in the workflow is detailed individually. Identification of Candidate Configurations The identification of valid candidate configurations Y candidate := {Y 1 , . . . ,Y n } that consists of n valid strategy combinations must be performed for each event-type τ individually.

195


Initialize parameters

Identify valid strategies

Deduct candidate configurations

Eliminate invalid candidates

Figure 11.4: Candidate identification process

The identification process is illustrated in figure 11.4. Starting from the classification of the event-type Dτ , first all event-type specific parameters in Λτ and domain-wide terms Λterms are initialized to their appropriate values. The values are deducted starting from the classification for each dimension in Dτ . Based on the initialized parameters all selection rules in RDm are applied for all m dimensions. If a rule matches a strategy, it is added to the set of valid strategies yvalid . This set is grouped by strategy type Yi into sets of valid strategies for each strategy type yvalid,Yi . These grouped valid strategies are then combined into the uncleaned candidate set Y combinations = yvalid,Y1 × . . . × yvalid,Yn for the n strategy types. If any yvalid,Yn is empty, a single placeholder strategy is used. The final step cleans the combinations that violate constraints defined in the exclusion set yexclude or does not honor required dependencies to other strategy types defined in Yrequire . This step is done for all strategy descriptions Γy . The final set is formally defined as follows: Y candidate := Y combinations \{Y ∈ Y combinations |∀Γy : ∃yi ∈ Y : yi ∈ yexclude ∨ violates(yi ,Yrequire )} The funciton violates(y,Yrequire ) hereby checks if a strategy y violates any requirement in Yrequire and returns true if a requirement is not met. System Attribute Deduction Besides the identification of candidate configurations, the simulations require values for a certain set of system attributes in order to perform the simulations for each candidate configuration. For each artifact that is part of the application description, the network profile, the domain profile, and the event-type description, a deduction process is performed in order to fill all required system attributes. Figure 11.5 shows the dependencies between the different artifacts and the deduction of system attributes. However, since the description allows to overwrite parameters in different artifacts, a prioritized deduction process is required to determine all required system attributes [Wah13]:

196

11.3 Naive Workflow

Domain profile

System attributes

Network profile

Event-type description

Figure 11.5: Location of parameters for the deduction of system attributes [Wah13]

1. All system attributes Λsystem are initialized with the values of the parameters in Λnetwork defined in the network profile Γnet . 2. Λsystem is extended by the values of the properties, defined in Λclass by the multidimensional classification instance Ψdomain that is part of the domain profile Γdomain . 3. In the next step event-type specific parameters from Λτ are added to Λsystem . They may overwrite existing values. The deduction process assumes that explicitly assigned values by the developer have a higher priority than those that form the classification. 4. To complete the event-type specific deduction, the set of domain-specific terms Λterms is consulted and all there defined terms that were used in event-type description Γτ are transformed and overwrite the values in Λsystem . 5. In the final step, all required system attributes that still have no value are assigned with default values defined in Λterms . If the deduction process finishes and not all required system attributes are set, the configuration process is canceled. Otherwise it is continued. Execution of Simulations With both, the candidate configurations Y candidate and the system attributes Λsystem , the simulations can be executed. For each candidate configuration Y ∈ Y candidate a simulation is executed using the attributes defined in Λsystem . The naive configuration workflow is blocked until the last candidate has been measured. The time consumption of the proof-of-concept simulator implementation is evaluated in section 17.8. However, it is expected that the simulations for larger scale may take at least hours to complete. This expectation makes the naive workflow only suitable

197


for non-agile development workflows, where application descriptions do not change very often. Selection of the QoS-optimal Configuration If all simulations completed successfully, an optimal candidate configuration Y optimal can be selected with respect to the event-type’s QoS-requirements Ωτ . The process is performed in two steps [Wah13]: In the first step all candidate configurations are selected that fulfill the limits defined by the metrics in Ωτ . The resulting set is called Y suitable and defined as follows: Y suitable = {Y i ∈ Y candidate |∀ωj ∈ Ωτ : fulfills(Y i ,ωj )} In the second step, all suitable candidate configurations are filtered to select the combination Y optimal that shows the optimal QoS characteristics with respect to the requirements Ωτ imposed for a certain event-type τ . Y optimal = {Y i ∈ Y suitable |∀ωj ∈ Ωτ : optimized(Y i ,ωj )} For metrics that correlate positively with QoS characteristics, optimized(Y,ω) for the metric ω is defined as follows: optimized(Y k ,ω) ⇔ ∀Y l ∈ Y suitable : ∀ωY l : ωY k = max(ωY k ,ωY l ) Metrics that correlate negatively are defined analogous, only that they use min(ωY k ,ωY l ) to measure the optimum. If the set Y optimal contains more than one element, the decision is made by the priority of the requirements in Ωτ . The first defined requirement in the event-type description Γτ has the highest priority. Generation of a Custom Middleware Library Finally, after the optimal configuration has been identified, the generation of the custom middleware takes place. First the optimal combination of strategies for each event type is translated into configuration files for the middleware. They consist of the schema for all event types, as well as the configuration for each channel. The configuration of each channel consists of the strategy combination and the definition of configuration parameters Λconf for each strategy. After all configuration files have been generated,

198

11.4 Optimized Workflow

they are combined with the source code of the library and compiled. The result is a custom-built middleware library.

11.4 Optimized Workflow The optimized workflow shifts the effort for the elicitation of the required simulation data from the time of configuration to a separate workflow. This enables a fast configuration process at the cost of a full space-filling experiment for each domain the middleware is adapted to. Hence, this workflow is suited for large agile projects, where the application description changes often and requires frequent reconfigurations. We discuss the foundations of space-filling experiments and the resulting meta-models in section 11.4.1. The process for the generation of the meta-models required for the optimized workflow is illustrated in section 11.4.2.

Domain profile

Strategy description Strategy description Strategy description

Identify candidate configurations

Event type Event description type Event description type description

Network profile

Meta Meta Model Meta Model Meta Model Model


Evaluate suitable Meta-models

Select QoS-optimal configuration

Generate custom middleware library

middleware

Figure 11.6: Overview of the optimized workflow

We begin with the discussion of the configuration process itself, as depicted in figure 11.6. Compared to the naive workflow, figure 11.6 shows another input: meta-models. These pre-measured meta-models replace the simulation execution during the decision process. Therefore, the simulation step of the naive workflow is now represented by the evaluation of meta-models. At the most part, all other steps stay the same.

199


Initial Steps In the initial steps, the artifacts of the application description are processed to derive two sets: the set of candidate configurations Y candidate and the system attributes Λsystem . Both sets are determined in a similar way as discussed for the naive workflow in section 11.3. The only difference is that for Λsystem value ranges are determined instead of values in order to span the parameter space for the generation of meta-models. Evaluation of Suitable Meta-Models The difference of the optimized workflow lies in the evaluation of pre-computed metamodels instead of the repeated execution of simulations for each configuration process. The addressed problem is the same, as for the naive workflow. The candidate configurations Y candidate have to be scored according to the QoS-requirements Ωτ and Ωlimits . To be able to do so, all metrics defined in those two sets need measurements for all candidate configurations in Y candidate . Therefore, a selection of meta-models is performed and only the models for the correct metrics and candidate configurations are considered. The determined system attributes are passed to each selected meta-model and a prediction for the modeled QoS-metric is returned. This prediction is used instead of the exact measurement in the naive workflow. Remaining Configuration Steps In the remaining steps the optimal configuration Y optimal is determined from Y candidate , based on the predictions from the appropriate meta-models. Afterwards the custom library is configured, compiled and ready to use. Both steps stay the same as for the naive workflow. 11.4.1 Space-Filling Experiments and Meta-Modeling Before we discuss the generation of the meta-models, a short excursus on space-filling experiments and meta-modeling is given. The idea to separate the simulation from the actual configuration process introduces an new challenge. Now a whole domain has to be simulated in order to cover all possible configuration candidates and system attribute values. This requirement spans a high dimensional hypercube that has to be sampled. Mathematically a system can be described as a function with s input parameters (x1 , . . . ,xs ) ∈ I from an Input space I and an output y as illustrated in figure 11.7. If the parameter space is reasonably large, the exhaustive experimental evaluation for all

200


parameter combinations is not feasible. Therefore, in many cases the goal is to find a sufficiently approximate model yˆ = g(~x) that describes the system, but runs faster [FLS10]. Such models are called meta-models as they are the model of a model. In our case they provide a model to approximate the system model. They often are regression or interpolation methods as linear models, neuronal networks, gaussian processes, etc. x1 .. .

system y = f (~x)

y

xs metamodel yˆ = g(~x)

Figure 11.7: Mathematical representation of physical systems and their meta-modeling [FLS10]

Such Meta-models are employed for a variety of tasks: among others they can be used for preliminary studies and visualization or prediction and optimization. According to Fang et al. [FLS10], the creation of an experiment using meta-models can be split into two phases: design and modeling. Design of Computer Experiments During the design phase the goal is to find a set of n points in the input space I that the approximate model to find can be optimally constructed. The basic approach is to uniformly scattered points in I. Such a design is called space-filling design or uniform design. Following methods are popular to find a set of uniformly scattered points. Random Sampling: The easiest, but still reasonable way is to choose random values within its codomain for each parameter xi . Of course the non deterministic way of choosing the parameter value can lead to areas in I that are only sparsely sampled or even not at all. This may reduce the overall quality of a meta-model for the sparsely sampled areas. Systematic Sampling: A deterministic way to determine samples from an input space I is to systematically select the samples for each parameter according to a formula. Most of these systematic methods pick samples in certain intervals for each parameter xi , traversing their whole codomain. Latin Hypercube Sampling Latin Hypercube Sampling (LSH), originally introduced by McKay et al. [MBC79], is a sampling method that combines both systematic and random sampling. The codomain for each parameter xi ∈ I is systematically

201


divided into equally probable intervals. Then randomly a sample point is selected for each interval. The number of experiments results from the combination of these sample points. Modeling of Computer Experiments The second phase captures the modeling aspects of computer experiments. This includes the identification of the so called hyper-parameters that best instantiate the meta-model. To be able to assess the quality of a certain hyper-parameter combination, validation methods are employed during this phase. A possible method for the optimization of hyper-parameters is grid search [BB12]. For an exhaustive grid search every combination of hyper-parameters is tested by a validation method like cross validation. The optimal combination is then used for further analyses. This brute force approach can be quite expensive. Bergstra et al. [BB12], however, suggest a randomized grid search where a fixed number of combinations is chosen randomly and assessed by cross validation. Both authors argue that this method is comparable to an exhaustive search, but way faster. Selected Meta-Models Many different meta-models have been introduced in the literature. They range from simple linear interpolation models to sophisticated stochastic methods like gaussian processes. In the following, we shortly sketch the different methods that will be applied in the remainder of the thesis. Decision Trees: Decision trees [Qui86] are a method that can be employed for regression and classification problems. They build a tree, in which each node models a decision for one parameter. The assessment for the quality of a decision is based on the expected information gain with respect to the input data. Due to their structure, they are especially good estimators if the function to approximate contains jumps. Ensemble Methods: Ensemble methods combine more than one meta-model to enhance the quality of the overall prediction. Extremely randomized trees [GEW06] are one method that employs a forest of randomized decision trees. Gaussian processes: Gaussian processes, originally discussed for machine learning by Rassmussen in [RW06], are a regression method for the prediction of multidimensional functions. A gaussian process is a special form of a stochastic process1 1

A stochastic process describes a collection of random variables that are indexed over a totally ordered set that mostly represents time.

202


that poses a generalization for the gaussian distribution. It ensures that each random variable is normally distributed. A gaussian process can be seen as a generalization of a multivariate gaussian distribution for infinite dimensions [RW06]. The advantage using such a non-parametric meta-model as an estimator is the fact that the estimation has an attached probability. This allows for reasoning about the quality of the estimation with respect to the input values and their variance. This may also be used to further generate input data, if the quality of the estimation is non-satisfactory. Rassmussen also argues that gaussian processes are flexible regarding their hyper-parameters. Therefore, they may be adapted for the estimation of a variety of functions. Functions with one hard jump can be difficult to estimate with gaussian processes. For detailed discussion of the introduced design and modeling methods as well as certain meta-models, the interested reader is referred to Fang et al. [FLS10] for general information on computer experiments or Rassmussen [RW06] for in-depth discussion of gaussian processes. Geurts [GEW06] describes extremely randomized trees, while Quinlan [Qui86] gives an introduction on decision trees. 11.4.2 Generate Meta-Models The interesting part of the optimized workflow is the generation of meta-models that can be calculated with the least computational effort that still provides an exact enough approximation of the middleware’s behavior. The described workflow allows the integration of arbitrary meta-models, but we limit the discussion to the methods introduced in section 11.4.1. Figure 11.8 shows the required steps for the generation of meta-models to be used for configuration. First, the required input parameters to train the model have to be determined. This step is performed similar to the naive workflow and consists of the generation of candidate configurations and the deduction of system attributes. The only difference is that all combinations of strategies that match the selection rules of the domain profile are considered and for each system attribute the whole codomain has to be measured. After the input parameter space is spanned, this space is sampled using a systematic sampling method to reduce the required experiments. These experiments are then performed the same way as for the naive workflow. Finally, the meta-models have to be parametrized and selected.

203


Network profile

Strategy Strategy description description Strategy description

Domain profile


Deduct valid configurations

Sample system attribute space

Execute relevant Simulations

Parametrization and selection of meta models

Meta Meta Model Meta Model Meta Model Model

Figure 11.8: Meta model generation

Deduct input parameter space The input parameter space for the computer experiment is deduced in two steps. First by the identification of all candidate configurations Y candidate based on all selection rules RDm for all dimensions m, defined in the domain profile Γdomain . The exact process is analogous to the naive workflow and therefore not formulated in detail again. These candidate configurations span one discrete dimension of the parameter space. Second the system attributes have to be identified. They are deduced from the parameters defined for the network Λnetwork , and the parameters defined for the domain Λterms . The difference is, for each parameter the whole codomain is considered to span the input parameter space. Each attribute introduces another dimension in the parameter space. The upper and lower bounds of each parameter are defined by the developer who writes the domain profile. Sample Input Parameters This step corresponds the design phase for computer experiments. It is obvious that a full sampling for the parameter space, spanned by the system model defined in section 10.2.2,

204

11.5 Summary

would require a not reasonable time1 . Therefore, techniques for sparse sampling must be employed as discussed in section 11.4.1. The tradeoff between sparseness of the sampling and the introduced error is examined during the evaluation in section 17.7. Parametrization and Selection of Meta-Models The modeling step for computer experiments is the logical next step and starts with a randomized grid search to identify suitable hyper-parameters for the different metamodels. The quality of the parametrized meta-models is assessed by the usage of a cross-validation using well-known error metrics like mean squared error, etc.

11.5 Summary This chapter discussed the missing link between the configurable publish-subscribe framework and the QoS-aware configuration model. To link those two parts and form a complete methodology, a basic solution framework for the automated generation of a configured framework is introduced in section 11.2. The suggested solution framework is based on the idea of black box simulations and employs parallel simulations. Based on this solution framework two concrete workflows are described. A naive workflow that generates the required measurements for the configuration decision on demand and an optimized workflow that uses space-filling experiments and meta-models (cf. section 11.4.1) for the sampling of whole application domains and the reduction of required measurements. The naive workflow is discussed in section 11.3, while the optimized workflow is described in section 11.4. With the discussion of the automatic configuration workflow, this part provides the reference architecture for a complete methodology for QoS-aware configuration of distributed publish-subscribe systems. In the following part, this reference architecture is prototypically implemented. This prototype serves as a proof of concept and will be evaluated in part V.

1

The time consumption is quantified during the evaluation in section 17.8

205

Part IV M2etis: Prototypic Implementation

207

12 | M2etis: Architecture “Controlling complexity is the essence of computer programming.” Brian Kernigan

This part describes the proof-of-concept that implements the previously described methodology and design for QoS-aware configuration of publish-subscribe systems. It is structured into the different components the resulting prototype called Massive Multiuser Event Integration System (M2 etis)1 consists of. First, the overall architecture of M2 etis is introduced and each component is briefly described. Figure 12.1 shows the corresponding overview. The main components, the prototype consists of, are the configurator, the simulator, the library itself, the application description, and the strategy repository. Each of those components consist of different artifacts, modules, or layers and implement parts of the models introduced in part III. The configurator component consists of the M2 etis Quality-of-Service-aware Semantics Modeling Language (MATINEE) and the M2 etis Adaptive System Configurator (MAESTRO) as well as a number of parametrized meta-models. MATINEE implements the model for QoS-aware configuration suggested in chapter 10. This language defines the syntax and artifacts required for an application description and the specification of strategies’ capabilities in the strategy descriptions. MAESTRO is the component that controls the whole configuration workflow, as introduced in chapter 11. Its responsibilities start with parsing MATINEE artifacts in order to interpret the application description and finishes with triggering the compilation of the M2 etis library. During the process it employs all other components shown in figure 12.1: the simulator, the strategy repository

1

The source code of M2 etis is available under Apache 2 license at https://code.google.com/p/ m2etis/.

209

12 M2etis: Architecture

Descriptive Artifact

Application Description

M²etis Configurator MATINEE

Meta Meta Model Meta Model Model

Event Event Event Type Type Description Type Description Description

Layer

Domain Profile

Conceptual Module Strategy Template Class Compiled Library

Network Profile

MAESTRO

Parametrized Meta Model

Strategy Repository M²etis Simulator Application Simulator

System Attributes

Strategy Strategy Strategy Description Description Description

Application

Strategy Strategy Strategy

M²etis Library

API

M²etis Instance M²etis Instance M²etis Instance

M²etis Instance Notification Service Config Overlay Network

Network Simulator Network

Design-time

Run-time

Figure 12.1: Architectural overview of M2 etis

and of course the library itself. The result is a custom M2 etis instance that may be used for an application. The library itself is structured in three layers, reflecting the basic reference model defined in chapter 8. Additionally, a number of strategies are implemented which are part of the strategy repository. Each strategy in the repository has a corresponding strategy description. The configuration of a certain M2 etis instance is described in a configuration artifact which contains the configured strategy combinations for each channel. The simulator component uses compiled M2 etis instances for simulation. Therefore it consists of two parts: the application simulator, and the network simulator. Both are configured by an artifact that defines the system attributes for the simulation. The network simulator provides a simulated network for the M2 etis instance, while the application simulator feeds M2 etis with events that approximate the workload of the target application. In the following sections, each of these components is discussed in more detail in order to give the interested reader insight into the inner-workings of the prototype.

210

13 | M2etis: Library The M2 etis library was implemented with the help of some theses. The core of the library was first suggested by Held in [Hel10], but has been refined and extended since. A number of built-upon thesis contributed to the currently available strategies. Pehlivanov implemented the available routing strategies in [Peh12], Vallery the concept for filter strategies in [Val13], and Baer extended the library by some order strategies in [Bae13]. The library is implemented in C++ using the C++11 standard1 . It employs some of the newly introduced language features, such as lambda expressions. Moreover, one of the core concepts used, is policy-based design [Ale01], which allows, informally speaking, to implement a design-time strategy pattern that results in minimal overhead at runtime. This pattern is used in a variety of modern C++ libraries to provide extendible and flexible architectures with a range of configurable features that only marginally hamper run-time performance. PubSubSystem m2etis::pubsub NetworkController m2etis::net «use»

m2etis::message «use»

m2etis::util NetworkCallbackInterface

NetworkInterface

m2etis::wrapper

Figure 13.1: Namespaces and interfaces of the M2 etis library

The coarse layer architecture of M2 etis is depicted in figure 13.1. The layered reference model, introduced in chapter 8, can be found in the m2etis::net and m2etis::pubsub

1

C++11 is standardized in ISO/IEC 14882:2011.

211

13 M2etis: Library

layers. The former abstracts from the physical network and provides a KBR-API in form of the NetworkController interface. The network abstraction further introduces different wrappers in the m2etis::wrapper namespace that each implements another overlay network. This abstraction was introduced to enable a uniform interface for all overlay networks. The m2etis::pubsub namespace realizes the notification service layer, using the KBRAPI to provide a publish-subscribe API for applications. It implements the design-time configurable framework as described in chapter 9. The m2etis::message namespace encapsulates the different message data types used by the different layers. Finally, m2etis:util contains helping data structures like a high precision clock, a queue implementation or other data structures commonly used. Before we detail the two layers of the library, some general design decisions, namely the thread model, the message structure, and the configuration method are explained. Thread Model m2etis::pubsub

m2etis::net

m2etis::wrapper

receive event application threads

routing queue

routing thread

send message

routing of message

deliver event

delivery of message

forwarding of message

delivery thread

[deliver message]

[forward message]

delivery queue

connection threads

accept messages

receive message

Figure 13.2: Thread model of the M2 etis library

M2 etis uses a simple scheduler that coordinates the internal threads. Figure 13.2 illustrates the different involved threads and their responsibilities. The respective processing actions are detailed in section 13.3. In order to deploy the library in multithreaded environments all incoming messages are buffered in a routing queue and processed by a routing thread until the message is handed over to the operating system for sending. Incoming messages are processed by connection threads (one for each connection) and put into a queue. This delivery queue is processed by the delivery thread that takes over

212

processing until a message is delivered to the application or forwarded by again handing it over to the operating system. Message Structure M²etis Message Network Header

Strategy Header Strategy Header

Strategy Header Strategy Header

…

Payload

Figure 13.3: Logical structure of a message

Messages in M2 etis, implemented in m2etis::message, depend on the configuration of a channel, because each strategy needs its own header information. Figure 13.3 shows the logical structure of such a message, as it is during processing. It consists of a network header, required for the overlay network, containing source and destination as well as overlay related information. A number of different strategy headers, depending on the configured strategies and the payload that is defined by the event type the message represents. Which strategy headers are part of a certain message is derived from the channel configuration at compile time and basically instantiates a new message type via template parameters. The result is a fixed header size, known at compile time, which significantly simplifies and speeds up the serialization effort. To reduce the size on the wire, redundant information, e.g. source or destination addresses, are stripped during serialization. This makes the message concept of the prototype both, flexible and resource conserving, without loosing the required configurability. Configuration Method The configuration of the M2 etis library is performed by instantiating channels with the intended strategies as template parameters, sketched in listing 13.1. The depicted example configures a channel with the SpreadIt routing strategy and the simulation network adapter. Moreover, it makes the channel available by registering it at a factory. It is notable that all this happens at compile time by exploiting the compiler’s template

213

13 M2etis: Library

engine for code generation. For further information on the used programming technique, the interested reader is referred to Alexandrescu’s Modern C++ Design [Ale01].

1 2 3 4 5 6 7 8 9 10 11 12 13 14

// Channel Configuration : typedef Channel < ChannelType < SpreaditRouting < net :: NetworkType < net :: OMNET > > , NullFilter < SimulationEventType , net :: NetworkType < net :: OMNET > > , NullOrder < net :: NetworkType < net :: OMNET > > , NullDeliver < net :: NetworkType < net :: OMNET > > , NullPersistence , NullValidity , NullPartition < SimulationEventType > >, net :: NetworkType < net :: OMNET > , S im u l a ti o n Ev e n tT y p e > ExampleType ;

15 16 17 18 19 20 21 22

// Register Channel at the Factory : template struct ChannelT < Example > { ChannelT ( /* ... */ ) { map . push_back ( new ExampleType ( Example , /* ... */ ) ) ; } };

Listing 13.1: Excerpt of a channel’s configuration

It is obvious that complex configurations generate an enormous number of code-lines which could be error prone and should be simplified. Normally a number of macros are employed to ease writing the template code, however, because in the proposed methodology this code is solely generated by the configurator, no such syntactic sugar has been defined, yet. Nevertheless it is merely a routine piece of work to do so.

13.1 Overlay Network Layer The overlay network layer unifies the different possible overlay networks behind a common KBR API. Figure 13.4 depicts the involved namespaces and major classes. The m2etis::net namespace contains all required classes for the unified API. The central role hereby is fulfilled by the NetworkController that provides all relevant KBR methods:

214

13.1 Overlay Network Layer

m2etis::net NetworkType

NetworkController

NetworkFactory

NetworkType

NetworkType «interface»

NetworkCallbackInterface

«interface»

NetworkInterface

m2etis::wrapper

NetworkWrapper

Figure 13.4: Simplified network abstraction layer of the M2 etis library

route, deliver, and forward, as discussed in chapter 4. Additionally a factory is provided for easy instantiation of networks. The template parameter NetworkType defines the used Key for the Network. For example a simple TCP overlay network would require IPv4 or IPv6 addresses as keys, while a Pastry overlay would require SHA1 hashes for identification. Currently the prototype supports IPv4 and SHA1 Keys. The networks themselves are implemented in the m2etis::wrapper namespace and implement the NetworkInterface. In order to deliver messages upwards, they use the NetworkCallbackInterface that is implemented by the NetworkController. This describes the basic implementation of the network abstraction in the M2 etis library. Currently supported overlay networks are UDP, TCP1 , and an adapter to use simulated overlay networks, as provided by the Oversim simulation model [BHK07]. Hence, structured overlay networks like Pastry are only available in the simulator, due to the lack of a mature C++ library that implements them.

1

The UDP and TCP implementations are not overlay networks in a stricter sense. They only provide the IP address space for the unified KBR API.

215

13 M2etis: Library

13.2 Notification Service Layer The notification service layer, implemented in the m2etis::pubsub namespace, realizes the design-time configurable publish-subscribe framework as suggested in chapter 9. m2etis::pubsub Partition Strategy Rendezvous Strategy

PubSubSystem

Channel

Routing Strategy Filter Strategy Order Strategy Delivery Strategy Timeliness Strategy

Tree

MessageBuffer

TreeFactory Scheduler

NetworkController

Figure 13.5: Simplified notification service layer of the M2 etis library

It provides the API for the application. Figure 13.5 depicts important classes of the notification service. The PubSubSystem provides the entry point and API for the library. It contains the instances of the configured channels. Each channel consists of one or more Tree instances. Additionally some helping structures are noteworthy. A MessageBuffer contains the messages during their manipulation by each strategy. Timely actions are coordinated by a Scheduler that also manages the different threads. The most important aspects are the different strategy types that are each represented by a template parameter. How the classes that implement the different template parameters work together in the prototype is discussed in the following section.

13.3 Processing Model The processing model describes the fixed parts of the behavior of the Channel and Tree classes. It realizes the interaction model, suggested in section 9.3 and is based on previous work, published in [FHL11]. The strategy types can only customize within the

216

13.3 Processing Model

boundaries of this “behavior skeleton” that defines where and when something is done, but not what exactly. This lies in the responsibility of each strategy. Three different processing workflows have to be discussed, as illustrated in figure 13.2: routing, delivery, and forwarding. Depending on the message type (cf. the matrix in table 9.1) each of these workflows behaves differently. In the following we discuss the different processing workflows for each message type. Each strategy type with its interface call is denoted as an action in the following diagrams. Publish and Notify A publish message is triggered by a publication of an event on a channel. In figure 13.6, the processing of publish messages and the influence of the different strategies is shown. On the level of the channel the partition strategy decides which trees are affected by the publication. If more than one tree is affected by the publication, the remainder of the process is performed for each tree. The header of the message is initialized by the Tree

Channel

[more Targets]

Patition Strategy getTrees

Routing Strategy processHeader

Timeliness Strategy isValid [valid]

publish

Timeliness Strategy processHeader [Tree available]

Routing Strategy getTargetNodes

Order Strategy processHeader Delivery Strategy processHeader

[not valid]

route publish

Initialize

Update target-specific

[no more Targets] [no more Tree]

Figure 13.6: Routing publish messages

routing and timeliness strategy. Afterwards the routing strategy determines the target nodes for the message. Usually this is the root of the tree if no replication is involved. For each target, a check if the message is still valid is performed and if that is the case the header is completed with target specific information from the order and delivery strategy. Finally the message is handed over to the overlay network for transport. On the receiving node, the delivery process of a publish or notify message is illustrated in figure 13.7. First the routing and timeliness strategy update their internal data structures before the delivery strategy decides the acceptance of the message. This acceptance step eliminates duplicate messages if the strategy is configured accordingly. Accepted messages update the internal data structures of the order strategy. It is possible that the message has to be disseminated further, especially if it is a notify message. The decision is performed by the routing strategy.

217

13 M2etis: Library

If further dissemination is necessary, the routing strategy identifies the relevant target nodes that are then filtered according to the filter table of the routing strategy. For each target the validity is checked by the timeliness strategy, in order to eliminate too old messages. All valid messages are then updated with target-specific information by the order and delivery strategy before they are handed over to the overlay network as a publish or notify message. deliver

publish or notify

Tree


Delivery Strategy acceptMessage

Timeliness Strategy processHeader [accepted]

Update data structures

Routing Strategy isFinalDestination

[no]


Timeliness Strategy isValid [more Targets]

[yes]

[valid]

Filter Strategy filterTargetNodes

Order Strategy processHeader

Order Strategy processHeader Delivery Strategy processHeader

[not valid]

route Update target-specific

[no more Targets] [not accepted]

Disseminate further Routing Strategy isSubscriber

[no] [yes]

Timeliness Strategy isValid

Filter Strategy filterDelivery

Order Strategy receive

deliver to app.

[valid] [match] [not valid]

[no match]

Local delivery

Figure 13.7: Delivery of publish and notify messages

If no further dissemination is required or all targets have been served, the routing strategy determines whether local delivery is necessary. If the local node is subscribed, validity is checked by the timeliness strategy and the local filter table is queried for a matching filter. In the case both filters are passed, the order strategy receives the message to ensure order guarantees. If the message is in the correct sequence, it is delivered to the application. If not it is buffered until the sequence is correct or the liveliness criterion is applied and it is dropped or delivered out of order. These last decisions on ordering are not depicted in figure 13.7, because they are strategy dependent and therefore not part of the generic workflow. Notify messages represent notifications about publications. Therefore, they are generated during the processing of the delivery of publish messages as shown in figure 13.7 and travel down the dissemination tree. Their delivery process is similar to the process for publish messages. All differences in the handling of the two message types are strategy dependent and have no influence on the abstract process. Subscribe, Unsubscribe and Control Subscribe messages and their inverse unsubscribe messages are important to distribute the wish for certain notifications. Figure 13.8 shows how the proposed framework handles the generation of such messages and how they are disseminated. Subscription operations

218

13.3 Processing Model

Tree

Channel



Patition Strategy getTrees (un)subscribe

[more Targets]

[Tree available]

Delivery Strategy processHeader


Filter Strategy processHeader

route (un)subscribe

Initialize


[no more Targets] [no more Tree]

Figure 13.8: Routing of subscribe and unsubscribe messages

are performed for a certain channel. Therefore, the partition strategy identifies one or more affected trees. For each tree the following procedure is conducted. First, the message is initialized for all potential targets. This is influenced by the routing and filter strategy. Second, the affected target nodes are identified by the routing strategy. Third, for each target specific initializations are made by the order and delivery strategy. Finally, the subscribe or unsubscribe message is handed over to the overlay network. On the receiving end, figure 13.9 describes the corresponding process. The process is identical for subscribe, unsubscribe and control messages. The only difference is the additional distinction between a successful and an unsuccessful subscription. Unsubscribe and control messages follow the successful path. When a message arrives, the delivery strategy eliminates potential duplicates by an acceptance check. Afterwards all relevant data structures are updated depending on the message type. The routing strategy performs the check for further dissemination. If so it identifies the target nodes in a next step, filters them and for each target that is found valid by the timeliness strategy, the target-specific headers are updated and the message is handed over to the network. Finally, the process for forwarding subscribe and unsubscribe messages is depicted in figure 13.10. It consists of update operations for the data structures of affected strategies. deliver

Tree [Subscription not succesful]

(un)subscribe


[no]

Timeliness Strategy isValid [more Targets]

Delivery Strategy acceptMessage [succesful]

[accepted]



Filter Strategy filterTargetNodes

Filter Strategy processHeader

Delivery Strategy processHeader

[not valid]

route control [no more Targets]

Order Strategy processHeader Update data structures

[valid]



[yes]

Disseminate further

[not accepted]

Figure 13.9: Delivery of subscribe, unsubscribe and control messages

219

13 M2etis: Library

Tree

forward

[Subscription not succesful]


[no]

(un)subscribe

[yes]

(un)subscribe Routing Strategy processHeader

[succesful]

Filter Strategy processHeader Order Strategy processHeader Update data structures

Figure 13.10: Forwarding subscribe and unsubscribe messages

Again a check for a successful subscribe is introduced, if the message is a subscribe message. After the updates took place, the routing strategy determines whether the message is further forwarded or terminated. All these processes provide the workflow in which boundaries strategies unfold their different behaviors.

13.4 Available Strategies The current state of the prototype includes a set of different strategies that are briefly described in the following. As long as they are merely implementations of published algorithms that fit into the framework without any modifications, the reader is referred to the original paper for a detailed description of the algorithm. For the partition and rendezvous strategy types no special algorithms, besides the default, i.e. one partition and one rendezvous node, were implemented. Therefore they are omitted in the following overview. Routing Strategies Five routing strategies are currently available. A strategy for a client/server configuration called DirectRouting which forms a star topology around the root node. Two different ALM strategies: SpreadIt [DBGM02] and Scribe [RKCD01]. SpreadIt is implemented in two different variants. The first one, realizing the original paper, sends publish messages directly to the root node and is called SpreadItRouting. The second one implements hierarchical routing as discussed in section 5.3.2 by routing publish messages upwards through the tree. It is called HierarchicalSpreadIt. Scribe implements a rendezvous-based routing mechanism and relies on Pastry as the underlying overlay substrate. The strategy is called ScribeRouting. The fifth routing strategy available is called DirectRouting and produces a topology where each publisher is root node for his

220

13.4 Available Strategies

events and all subscribers are directly connected to it. That means if all participating nodes do publish events, a fully meshed network emerges. Filter Strategies Three different filter strategies are currently implemented. A simple brute force algorithm [MFP06] called BruteForceFilter that supports all boolean filter expressions. A filter algorithm based on decision trees which is restricted to equality predicates. It is called DecisionTreeFilter and follows Aguilera [ASS+ 99]. Finally a filter for general boolean expressions that is more sophisticated than the brute force algorithm is implemented. It is called GeneralBooleanExpressionsFilter and follows Bittner [Bit08]. Order Strategies For ensuring order, three different strategies are available in the proof-of-concept. The first is called MTPOrder and implements the Multicast Transport Protocol specified in RFC 1301 [FM90]. The second is based on the algorithm of Garcia-Molina and Spauser suggested in [GMS91] and called GMSOrder. Deterministic Merge as suggested by [KR05] is the third available algorithm. The corresponding strategy is called DetMergeOrder. For an overview on the capabilities of those algorithms refer to section 5.5.4 Timeliness Strategies To filter outdated messages, currently only one strategy is available: TimeTimeliness. It allows the definition of a wall-clock based validity interval. All messages older than this interval are dropped. Delivery Strategies The delivery of messages can be ensured using two different strategies: the AckDeliver strategy uses acknowledgements for each message, while the NackDeliver strategy uses a negative acknowledgement mechanism as defined in RFC 40771 . These two strategies implement the most basic algorithms to ensure the delivery of messages. More advanced mechanisms are currently not available but their integration is merely routine work.

1

available at http://tools.ietf.org/html/rfc4077

221

14 | M2etis: Simulator The M2 etis simulator is based on the discrete-event simulation framework OmNet++1 and was implemented by Bonrath in the scope of his thesis [Bon13]. It employs a framework for overlay networks, namely Oversim [BHK07]. Oversim includes a variety of popular overlay networks, like Pastry, Chord, or CAN. Together OmNet++ and Oversim provide a discrete-event simulation framework including a model for the overlay network stack up to structured overlay networks. Another helpful feature of discrete-event simulation in general and especially OmNet++ is its determinism which allows for repeatable simulations with identical results2 . In figure 14.1 the coarse architecture of the simulator TupleFeeder

m²etis Instance

Oversim

OmNet++

Figure 14.1: Layers of the simulator architecture

is sketched. The M2 etis instance connects directly to the KBR API provided by Oversim. This allows to specify all network attributes directly in the simulator configuration. An application’s workload is generated by the TupleFeeder component. It feeds the M2 etis instance with events that are processed by M2 etis.

1 2

OMNeT++ is an extensible, modular, component-based C++ simulation library and framework, primarily for building network simulators. (from http://www.omnetpp.org) OmNet++ allows to specify the random seed, if a non-deterministic behavior is wanted.

223

14 M2etis: Simulator

A simulated distributed application consists of n TupleFeeders and M2 etis instances, deployed on a network simulated by Oversim. Despite the simple architecture of such a simulator, the integration of a real-world library into a discrete-event simulation has some pitfalls. For example the thread model, respective the scheduler of M2 etis is not compatible with a discrete-event simulation. Therefore, the M2 etis scheduler has been adapted to support discrete time-steps for simulation. Otherwise, the simulation results could be skewed depending on the load on the simulation machine. For an exhaustive discussion of all challenges and pitfalls that result from the integration of a real world application into OmNet++, the interested reader is referred to Mayer [MG08]. The network, the TupleFeeder, and the overall application can be parametrized according to the system attributes, suggested in section 10.2.2. Such parametrized simulations generate all required QoS metrics for the configuration workflow.

224

15 | M2etis: Configurator With the M2 etis library and the simulator at hand, one component is still missing to complete the prototype: the configurator. As shown in the architectural overview (cf. figure 12.1), the configurator consists of two parts: MATINEE, the DSL for the description of event semantics, and MATINEE, the component that controls the whole workflow. Both parts were prototypically implemented during Wahl’s thesis [Wah13]. In the following sections, a coarse overview is given on MATINEE as well as the basic architecture of MAESTRO. Moreover, the use cases, introduced in section 3.3 are used to exemplify a MATINEE specification. For a comprehensive specification of the MATINEE language, refer to [WFL14].

15.1 MATINEE: M2etis QoS-aware Semantics Modeling Language MATINEE is the realization of the model suggested during the discussion on QoS-aware configuration in chapter 10. The language is a DSL based on YAML ain’t Markup Language (YAML)1 . It provides a more simple and easier to read syntax than e.g. XML. Moreover, it can be easily parsed and transformed into Python data structures which is an advantage for the implementation of MAESTRO. The basic elements of MATINEE are attribute-value pairs and lists. Both elements can be hierarchically arranged and form the basis for the language. For conciseness we omit a complete syntax definition, which can be found in [WFL14], and rely on examples to briefly explain MATINEE. All artifacts are exemplified and discussed in the following.

1

http://www.yaml.org

225

15 M2etis: Configurator

Strategy Description The artifact for the description of strategies consists of four different sections. The first one is the metadata description. It contains all relevant data that describes the artifact itself. These metadata section must be specified for all artifacts, not only for strategy descriptions. 1

--- ! Strategy

2 3 4 5 6 7 8

Metadata Description : Identifier : MTPOrder Description : Protocol to guarantee atomic multicasts . Author : Andreas M . Wahl Date : 08 -15 -2013 Version : 1.0

9 10 11

Classification : - Type : Order

12 13 14 15 16

Compatibility : Requirements : - Delivery Exclusions : []

17 18 19 20 21 22 23

Configuration : Information Class : " m2etis :: message :: MTPOrderInfo " Parameters : - " m2etis :: net :: NetworkType < m2etis :: net :: OMNET > " - 1000000 - " m2etis :: pubsub :: order :: LateDeliver :: DROP "

24 25

...

Listing 15.1: Exemplary strategy description [Wah13]

Listing 15.1 shows an exemplary strategy description. Besides the metadata section, it contains the classification. It specifies which strategy type the strategy is associated with. If and exclusions or dependencies exist, they are defined in the compatibility section. The configuration section defines all details needed to generate the appropriate code for the configuration of the M2 etis library.

226

15.1 MATINEE: M2etis QoS-aware Semantics Modeling Language

Domain Profile The domain profile is the artifact that describes the multidimensional classification instance for a certain domain. Hence, all dimensions have to be defined with their transformation rules to deduce system attributes for simulation. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Parameters : - Scale : Values : [ small , medium , large ] Default : medium Transformations : - When : True Transform : - small : 0.1*( Upstream /( Velocity * PacketSize ) ) - medium : 0.5*( Upstream /( Velocity * PacketSize ) ) - large : 1.0*( Upstream /( Velocity * PacketSize ) ) #... Strategy Selection : - When : True Select : Type is " Routing "

15 16 17 18 19 20

Classes : - Global announcement : - Scale : large - Velocity : low - Cardinality : one - to - many

21 22

#...

Listing 15.2: Parts of the definition of the context dimension [Wah13]

Listing 15.2 shows parts of the context dimension’s definition as formally suggested in section 10.2.3. The parameter section defines parameters with their codomain. The transformation rules define how possible discrete values are mapped to numerical ones. The example shows the definition for the scale parameter. The strategy selection section defines which strategy type and under which circumstances poses a candidate for configuration. Finally, the classes section defines the possible classes that may be used for the classification of event-types. The whole domain profile is exemplified in Listing 15.3. It contains the parameter definitions section that specifies parameters with their codomain ranges that are typical for

227


the domain. It is also possible to define deduction rules that allow to deduce parameters based on formulas containing other parameters. The second section in the example are the dimension definitions. They contain the definition of the multidimensional classification schema as exemplified in listing 15.2. Finally the quality of service requirements section contains the default optimization targets in the order of their relevance. It is possible to specify allowed ranges which are considered during the estimation process. 1

--- ! Application

2 3

# ...

4 5 6 7 8 9 10 11 12 13 14 15 16 17

Parameter Definitions : - PublisherSubscribers : Type : int Minimum : 1 Maximum : 1024 Default : 64 Deduction via : - ceil ( PI *( VisualRange **2) * PlayerDensity ) - PlayerDensity : Type : float Default : 0.5 Deduction via : [] #...

18 19 20

Dimension Definitions : #...

21 22 23 24 25 26

Quality of Service Requirements : - Average Latency : Minimum : 0 Maximum : 300 #...

Listing 15.3: Parts of an application profile [Wah13]

Network Profile The network profile only contains a list of parameters that characterize the network and are used for the generation of the simulation configurations. Listing 15.4 shows a simple

228

15.1 MATINEE: M2etis QoS-aware Semantics Modeling Language

example for a DSL-Network with asynchronous data-rate. Currently allowed parameters reflect the system model as suggested in section 10.2.2. 1

--- ! Network

2 3

# ...

4 5 6 7 8

Parameters : - Hop - to - Hop Latency : 30 ms - Downstream : 6 Mbps - Upstream : 1 Mbps

9 10

# ...

Listing 15.4: Exemplary network profile [Wah13]

Event-Type Description The most interesting artifact for each developer is probably the event-type description. Each of those artifacts describes one event-type based on a certain domain profile and network profile. 1

--- ! EventType

2 3

# ...

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Payload : - EntityID : Type : int - PositionX : Type : float Minimum : 0.0 Maximum : 5000.0 Step : 0.1 - PositionY : Type : float Minimum : 0.0 Maximum : 5000.0 Step : 0.1 - Region :

229


Type : string Size : 8

19 20 21 22 23 24 25

Classification : - Context : Large - scale movement - like - Synchronization : None - Validity : Short - lived timer

26 27 28 29

Parameters : Scalars : [] Presets : []

30 31 32 33 34 35 36 37 38 39 40

Quality of Service Requirements : - Average Latency : Minimum : 0 Maximum : 300 Correlation : negative - Average Loss : Minimum : 0.0 Maximum : 0.5 Correlation : negative # ...

Listing 15.5: Exemplary event-type description for the movement use case [Wah13]

Listing 15.5 shows such an exemplary description for a movement event-type. The payload section defines the schema for the event-type. Additionally to the type and the name of each attribute, the codomain as well as the precision can be specified. Currently this feature is not used, yet. However, it can be used to generate content-aware workloads if the capabilities of MAESTRO are extended to consider the content of events. The classification section contains the classification of the event-type according to the multidimensional classification schema as specified in the domain profile. Optionally some explicit parameter definitions can be added in the parameters section. This can be used to fine-tune the resulting simulation parameters. Finally, the quality of service requirements define the QoS-metrics that should be considered for the decision about the optimal configuration.

230

15.2 MAESTRO: M2etis Adaptive System Configurator

15.2 MAESTRO: M2etis Adaptive System Configurator

Interaction layer Command-line Interface

Analyzer and Visualizer

Core layer Measurement Coordinator Meta-Modeller

Configurator

Code Generator

Infrastructure layer Artifact Manager

Persistency Manager

Figure 15.1: Architecture of MAESTRO [Wah13]

The MAESTRO component coordinates the configuration process. It uses the artifacts written in MATINEE to perform simulations and deduce the final configuration for the M2 etis library, according to the both workflows suggested in chapter 11. MAESTRO is written in Python and uses MongoDB1 as its persistence backend. In figure 15.1 the architecture of MAESTRO is depicted. It shows a three-layered design. The interaction layer consists of the command-line interface and the analyzer/visualizer module. It provides all means of user interaction. The user can perform analyses and trigger configurations. The command-line interface provides the necessary commands for an easy usage of the whole methodology. It also enables the integration into automated workflows like continuous integration. The analyzer and visualizer module allows to generate reports of performed measurements as well as interactive visualization of measurement data.

1

MongoDB (from “humongous”) is an open-source document database, and the leading NoSQL database. (http://www.mongodb.org)

231


The core layer contains the core modules of MAESTRO. This is where the configuration logic resides. It consists of the meta-modeler, the configurator, the measurement coordinator, and the code generator. The meta-modeler is responsible for the management of different meta-models. It uses the implementations provided by the SciKit-learn1 library for the different meta-models. The responsibilities of this module cover the whole lifecycle of the computer experiments. From parametrization over estimations until persisting any results, meta-model related tasks are managed by the meta-modeler. The measurement coordinator is responsible for the production of simulation results. It coordinates a number of computing nodes and manages the whole simulation lifecycle from deployment of the simulation instance to the collection of the results. The code generator generates all needed configuration files from simulator configurations to the final configuration of the M2 etis library. Finally the configurator module itself contains the logic of the whole workflow and controls each step using all other modules with their respective responsibilities. The infrastructure layer provides services required by most of the other modules. The artifact manager is able to parse and persist all MATINEE artifacts, while the persistency manager encapsulates the database used for the storage of parametrized meta-models and measurements. It also provides the basic I/O operations for the artifact manager.

1

SciKit-learn is a machine learning library for Python. (http://scikit-learn.org)

232

Part V Evaluation

233

“You can’t control what you can’t measure.” Tom DeMarco, Software Engineer and Author

With a prototype at hand, the first step of the initial hypotheses’ validation is finished. It shows by the construction of a proof of concept that the reference architecture can be implemented. In chapter 16 we will discuss the level of fulfillment for each requirement defined in chapter 7 with respect to the implemented prototype. Moreover, the prototype will also be classified according to the introduced taxonomy in section 5.7 in order to ease the comparison of capabilities with existing approaches. Even though this discussion contributes to the validation of the hypotheses, measurements are required to further validate the single hypotheses. This quantitative evaluation begins in section 17.2 with a basic examination of the resource consumption of M2 etis in order to argue about the applicability of the approach in the intended application domains. Hypothesis 5 states a limitation of the precision for the methodology, defined by the description of the application. As the introduced methodology relies on a simulation model, in section 17.3, the limitations introduced by the simulation model are quantitatively examined. In section 17.4 the scalability of selected configurations is measured by simulation. These measurements contribute to the validation of a correct implementation of the different strategies, as at least the magnitude of the different metrics can be compared to the original papers. Moreover it gives a notion of the influence of different parameters and strategies on the scalability of M2 etis. Hypothesis 4 stated that the limitation of the configurability to the design-time can reduce the run-time overhead compared to run-time configurable approaches. The validation of this hypothesis is quantitatively validated in section 17.6 by a comparison of important performance metrics between M2 etis and other existing configurable approaches. The quality of the decisions by the automated workflows is quantified in section 17.7, validating hypothesis 6 that states an application domain can be sampled once and used for multiple decisions without unreasonable error. The error introduced by the employed sparse sampling method is quantified as well as the time required for decision making, based on simulations. This time consumption is basically the limitation of the applicability for the automated workflow. Finally, in chapter 18 this part concludes with a final discussion on the degree, to which this thesis was able to validate the initial hypotheses.

235

16 | Comparison of Capabilities The reference architecture and its proof of concept implementation provide a variety of capabilities which a compared to other existing approaches and validated against the requirements. Thus, this chapter is split into two sections. First, we discuss to which degree the different requirements that have been introduced for this work in chapter 7 have been fulfilled. Afterwards, the taxonomy for publish-subscribe approaches suggested in part II is applied to the implemented prototype to reason about the capabilities, QoS-aware configurable systems can provide.

16.1 Validation of Requirements In this section the requirements introduced in chapter 7 are briefly compared with the suggested solution. The level of fulfillment is described and further work is identified that either eliminates some of the initial assumptions as well or extends the capabilities of the prototype. Requirements for the Design-time Configurable Framework We begin with the requirements resulting from Hypotheses 3 and 4 that were addressed by the design of the framework for design-time configurable publish-subscribe systems. Req 3.1 requires a suitable modularization that mirrors aspects discussed in current literature. A strategy based modularization was found and described in section 9.2. The corresponding prototypic implementation of this modularization is discussed in chapter 13. The modularization allows for configurations ranging from simple client/server systems to fully distributed applications as well as all major publish-subscribe paradigms by the usage of different filter strategies. It even allows for different paradigms in one system by the introduction of the channel concept as described in section 9.1 that fulfills Req 3.6. However, the current framework does not support any semantical routing or filter

237

16 Comparison of Capabilities

concepts as described in section 5.2.5 and section 5.3.3. Moreover, the important aspect of security was excluded for this work and should be considered for a production ready framework. For all other identified aspects, different strategies have been implemented (cf. section 13.4). The implemented selection of strategies was chosen based on the discussion in part II. Each aspect of publish-subscribe systems that was discussed in part II closes with a classification of existing algorithms. Following this classification, one algorithm for each class was implemented as a proof-of-concept in order to validate hypothesis 3 regarding the integration of all major aspects of publish-subscribe. A corresponding interaction model as required by Req 3.2 is suggested in section 9.3. Both aspects of the framework are designed for extensibility, as required by Req 3.5, and therefore promise to be suitable for the integration of future approaches. The interaction model employs a hook system for easy extension by new strategy types and strategy types themselves are merely new template parameters in the current implementation. The description of a composition is meta-template code instantiating the correct type of a channel which fulfills Req 3.3. Req 3.4 is hard to validate, as interfaces cannot be designed to hold for ever, but the currently implemented strategies were chosen to cover a large variety of different algorithms in order to distill as stable interfaces as possible. The decision for design-time configuration allows for the generation of custom headers described by template parameters during the compilation process as proposed by hypothesis 4. This addresses Req 4.1 aiming for minimal message overhead. Req 4.2 requires a minimal overhead resulting from abstractions. This forms a tradeoff between extensibility and processing speed. The chosen approach that uses C++ policy-based design and simple template-meta programming exploits the optimization potential of the compiler (Req 4.4) in order to minimize the impact of the required extensibility by Req 4.3. We will quantify the results in chapter 17 as far as currently possible. Requirements for QoS-aware Configuration A developer-friendly, application-specific way of describing event-semantics was postulated by Hypothesis 1 and 2. The resulting requirements are addressed in chapter 10 introducing a multidimensional classification scheme that can be transformed to parameters suitable for the final configuration decision. This multidimensional classification forms the metamodel for classification as required by Req 1.1. The meta-model allows for domain specific instantiations as required for Req 2.1. A generic instantiation was introduced in section 10.2.1 as well as an informal discussion of a domain specific model for MMVEs in section 10.1. This meta-model is used for the design of a declarative DSL (Req

238

16.1 Validation of Requirements

NF.1), formally described in chapter 10 and further detailed in chapter 15. This DSL fulfills Req 1.2 - Req 1.5 providing a full description of all required aspects for QoSaware configuration. Moreover, it allows the independent description of domain profiles (Req 2.2) and network profiles (Req 2.3) for reusability required by Req NF.2 and Req NF.4. However, the network model is rather simple, because it does not reflect network topologies with their different aspects as discussed in section 4.3 about network characteristics. The extensibility of the language demanded by Req NF.3 is addressed by the different profiles that allow the specification of arbitrary parameters and instances of the meta-model that then can be used to classify event types. Requirements for the Automated Configuration Workflow The automated workflow for the configuration with its steps is described in chapter 11. Its prototypic implementation is sketched in chapter 15. Req 5.1 requires the automated identification of all possible configuration and is fulfilled by workflow step that identifies candidate configurations. Req 5.3, demanding the automated deduction of simulation parameters, is addressed by the simulation model and the deduction of system attributes during the workflow. However, the realism and expressiveness of the measurements is limited by the simulation model and the precision of the simulator. The prototype uses OMNet++, a well known network simulator. Therefore, it is assumed that the simulator itself approximates a network with a sufficient precision and does not require further validation. Moreover, all configurations are measured with the same simulator. Hence, a comparative analysis is still significant, even if the precision of the absolute values are not precise. We quantitatively discuss the simulations in section 17.4. The simulation model itself currently does not cover the whole configuration capabilities that are provided by the framework. The model does not cover structured records, as it would require some sort of workload model to simulate the value distribution of the different events of an application. The usage of simulations for the acquisition of measurements for the configuration decision is required by Req 5.2 and realized in the workflow by the execution of parallel simulations. Req 5.4 requires a black box approach for the simulator which is satisfied by its design that uses the actual implemented library, as illustrated in chapter 14. The optimal candidate configuration is identified based on the QoS requirements defined in the event type description and the gained measurements during the simulations, as required by Req 5.5. The quality of the decision for the optimized workflow is quantified in section 17.7. However, only a local optimum for each channel is calculated

239


by the described workflow. The initial assumption that this locally optimizing workflow approximates the global optimum sufficiently, is not further verified. The automated optimization of parameter values that tune the different strategies is also currently not covered by the workflow. Both challenges are part of further work. The final generation of the configuration files for the middleware (Req 5.6) and the compilation of the custom configured library (Req 5.7) are sketched in chapter 15 about the prototypic implementation of the configuration component.

16.2 Classification of m2etis In section 5.7, a taxonomy for DEBS has been introduced. A variety of existing approaches were classified according to the different aspects that have been discussed in the context of the state of the art (part II). In the following this taxonomy is applied to the proposed approach in general and especially to the implemented proof-of-concept M2 etis. Hence, each aspect is discussed separately. Data Model M2 etis incorporates structured records. As template based programming and C++ is used for implementation, event-types are templates that have introspection methods for attribute access by name. With this realization, the advantages of tuple-based models, namely a small footprint on the wire, are combined with the expressiveness of structured records. Filter For the realization of filters, the prototype employs a common structure for expressions. Internally they are represented as expression trees that can be constructed at runtime. For performance reasons and to exploit the optimization capabilities of compilers even more, expression templates 1 could be an alternative implementation for a filter representation. Due to the configurability, currently topic and content-based filter models as well as a combination of both are supported by M2 etis.

1

Expression templates heavily use templates to compose an expression tree at compile-time. Each instance of such a template is a single type at run-time. For further information refer to [Vel95].

240

16.2 Classification of m2etis

Expressions: The configurable approach allows for flexible filter expressions that depend on the filter strategy’s capabilities. The prototype currently supports both, conjunctive and general boolean expressions. Matching algorithm: In consequence of the different expressions supported, many different matching algorithms may be implemented as filter strategies. Currently M2 etis supports one algorithm for each class of filter approaches (NP-IS, NP-SS, and OP-IS), as discussed in section 5.2.3. With this variety of filter strategies, the proof-of-concept strongly suggests that current filter approaches can be integrated in the proposed framework and already provides more different filter capabilities than existing systems surveyed. The only configurable system with a comparable range of different filter capabilities is GREEN. Routing The routing capabilities of M2 etis with its current strategy repository provides hierarchical and rendezvous-based routing mechanisms. Optimizations: The current prototype does not support any routing optimization like covering, merging or advertisements, but an extension of the existing routing strategies to support optimizations is merely implementation effort and integrates seamlessly in the interaction model. Overlay topology: M2 etis supports mesh-first, peer-to-peer as well as tree-first topologies. If a tree-first topology is employed the overlay network layer is configured just as a TCP or UDP network. For peer-to-peer topologies currently Pastry is supported in simulations. Tree: For the tree construction, currently implemented routing strategies construct source-specific trees. In this category, M2 etis is not that versatile as for example REBECA which had the implementation of different routing mechanisms as one of its main goal. M2 etis only provides enough different routing algorithms to show that different classes of routing mechanisms integrate into the framework easily, as exemplified in section 9.4, while discussing the configuration of the framework. Quality of Service QoS in publish-subscribe has been mainly discussed at run-time, but rarely at design-time to fuel configuration decisions. Theoretically, M2 etis supports both, but the current

241


state of the implementation focuses on the design-time phase of QoS. No implemented strategy explicitly supports QoS policies that influence e.g. message flow at run-time. Hence, the following characterization is about the support of design-time QoS. Latency: Latency as a QoS metric that is measured during the configuration process and that serves as an optimization target is supported by the approach and the current prototype. Adaptive routing algorithms that optimize routes according to latency requirements can be integrated in the proposed approach, but are currently not implemented. Throughput: Throughput is another QoS metric that is considered during the configuration process, but not at run-time. Analogous to latency, strategies that respect throughput requirements could be implemented, but are not currently available. Delivery: The prototype supports simple ACK and NACK strategies to guarantee the delivery of messages. These two implementations cover the two basic mechanisms. More sophisticated protocols can be easily added to the strategy repository. Order: M2 etis currently supports three different order strategies that all provide total order with respect to the underlying fault model. These strategies cover multiple and single group ordering as well as different synchronization mechanisms. Hence, the framework’s design suggests an extensible architecture that allows an easy integration of further order mechanisms. Timeliness: Timeliness is supported in form of wall-clock driven validity of events. Other forms can easily be integrated as validity strategies. Security: Security is currently not supported by M2 etis, but could be integrated in the multidimensional model as well as into the framework as another strategy type. In the category of QoS, the proposed approach provides a large variety of policies that are considered. Compared to approaches that employ QoS policies for configuration, it is the most complete set of QoS support available. ADAMANT lacks the support for timeliness and FAMUOSO cannot provide delivery and order guarantees. GT covers the whole set, but is a centralized solution. Reliability In chapter 7, we assumed a reliable system as the foundation of this work. However, the implemented prototype supports interruptions and omissions, which is comparable to most existing approaches. Only Hermes additionally acts state preserving over errors.

242

16.2 Classification of m2etis

None of the surveyed systems cope with byzantine errors. Even though M2 etis did not put the focus on reliability, it provides the basic robustness, common among existing research prototypes. Adaptability and Architecture The focus of this work was to provide a configurable system. Therefore, reconfiguration as well as adaptiveness is not supported. The system design itself and the usage of policy-based design are targeted towards configuration. Reconfiguration and adaptivity can only be added as an extension as long as they do not affect multiple strategies. With template-meta programming, M2 etis provides a highly effective configuration method which is, besides this work, only employed by FAMUOSO, a centralized solution. All other configurable approaches employ a component-based architecture that adds to the overhead generated by the adaptability. System Specification M2 etis provides two possibilities to specify the system behavior. Either a composition can be specified, describing the concrete combination of modules, or a QoS-aware method is provided. The former is the preferred method by most configurable approaches. Only GT, FAMUOSO and ADAMANT provide a QoS-aware configuration method that allows to specify QoS requirements. All approaches besides ADAMANT require a configuration in the source code. M2 etis configures the composition in source, but uses a DSL for QoS-aware configuration. Decision Support The decision support of M2 etis allows the deduction of valid configurations. The requirements can be specified in an application-dependent way which are translated to application independent parameters. The deduction process employs simulation-based optimization. Similar application-dependent configurations are only supported by the ADAMANT project. They employ neuronal networks to deduct the composition of the system. All other approaches merely support the validation of configurations. Only GT employs a basic rule-based deduction mechanism to map QoS-requirements to the final composition.

243


Discussion Besides the ADAMANT project no existing approach employs QoS-aware, applicationspecific configuration in combination with a deduction process that goes beyond simple rules. In contrast to this work the ADAMANT project targets mobile platforms and employs neuronal networks for their configuration and adaptation process. A framework that uses a strategy-based design implemented by template metaprogramming can – to the authors knowledge – only be found in the FAMUOSO project which was first publicized after this work and is designed for embedded systems. To conclude, M2 etis enhances the available capabilities regarding QoS-aware configuration and shows that the usage of strategy based design for configurability is feasible.

244

17 | Quantitative Evaluation The comparison of capabilities in the previous chapter gives an insight into the features of the proposed approach as well as its current limitations. Some of the points discussed are underpinned by a quantitative evaluation in the following. The simulations for the quantitative evaluation were performed to large parts within the scope of Wahl’s thesis [Wah13]. Hence, some measurements and explanations have been taken from the before mentioned thesis. According to Hanemann [HLMS06] four aspects are relevant to characterize the performance of distributed systems. These are availability, loss and error, delay, and throughput. We primarily will discuss delay and throughput, as these two metrics are the most relevant for large scale, highly responsive systems as MMVE. However, where necessary, availability and error metrics are also considered. First, the evaluation setup is introduced that was used for all measurements taken. Afterwards the M2 etis simulator is employed to perform some performance measurements for selected configurations of M2 etis, derived from the use cases introduced in section 3.3. Additionally, a small real world measurement is taken to classify the prototype among existing messaging solutions. The section’s goal is to give the reader a brief insight into the scalability of the proposed approach. Second, the quality of the automated workflow is quantified. Both, the required time consumption and the introduced error by the regression methods are considered.

17.1 Evaluation Setup All measurements for the quantitative evaluation have been performed on a small computing cluster of 24 nodes at maximum. Each node consists of a quad-core CPU and eight gigabytes of RAM. The CPUs were a mixture of Intel Core 2 Quad Q9450 (2,66 GHz per core), Intel Xeon E5335 (2,00 GHz per core) and Intel Xeon X5450 (3,00 GHz

245

17 Quantitative Evaluation

per core). The operating system on all processing nodes were OpenSUSE 12.2 (64 bit) and the deployed kernel version was 3.4.47-2.38. During the simulations four instances of the simulator run on each processing node. They were coordinated by a central node via SSH and were used exclusively. The deployed simulator was the prototype, as described in chapter 14. Real world performance evaluations were performed in the same testbed.

17.2 Resource Consumption of M2etis We begin the quantitive evaluation of the prototype with a basic analysis of the resource consumption of M2 etis. For approaches that employ heavily templated C++ code, binary sizes may get out of hand very fast, because each instantiation of the template poses another type for the language and is initially treated independently by the compiler. Therefore, in the following the required size for binaries is examined under different circumstances. First, the size of compiled M2 etis binaries1 with one configured channel is addressed. In figure 17.1, the size for the five currently available routing strategies is depicted. No additional strategies were configured. 5,24%

Size%of%Library%[MByte]%

5,22% 5,2% 5,18% 5,16% 5,14% 5,12% 5,1% 5,08% Ro ect Di r

u3

ng%

D

tR cas ad o r ctB ire

ou

3n

g%

S cal chi rar e i H

Ro

it ad pre

u3

ng%

Ro ibe Scr

u3

ng%

Ro

S

it ad pre

u3

ng%

Figure 17.1: Size of the library with different routing strategies

This size can be seen as some sort of baseline, because the measured configurations are minimal ones that do not offer any QoS guarantees. The rather large size of at least

1

The library was compiled with the gcc 4.7 compiler and the -O3 optimization option.

246

17.2 Resource Consumption of M2etis

five MByte is a result from the extensive usage of templated C++ code (own and boost library code) that in parts is instantiated multiple times and therefore has a huge impact on the size of the library. 350' 299,846'

300'

266,853'

Size%[kByte]%

250'

221,267'

200' 150' 100' 50'

55,895'

50,512'

48,506'

8,796' 0'

er'

Filt rce eFo t u Br

io cis De

Filt ree nT

n Ge

o lBo era

er' nE

lea

x

o ssi pre

n sF

r' de Or rge e M

r' ilte t De

S GM

r' de Or

MT

r' rde PO

er' eliv kD Ac

20,307'

er' eliv ckD a N

20,071'

a eV Ti m

lidi

ty'

Figure 17.2: Size of additional strategies

The additional size required for the different strategies that are currently available is shown in figure 17.2. The sizes were calculated as follows: First, a library was configured with a DirectRouting strategy and the respective strategy to measure. Second, from the measured combined size, the size of a pure DirectRouting configuration was subtracted. However, the actual size required for each strategy may vary, because depending on different configurations, the compiler may be able to optimize the generated code more or less which may result in a variation from the depicted values. Nevertheless, the numbers give a notion of the magnitude of the size, strategies add to the binary. The influence of multiple channels on the size of the library is depicted in figure 17.3. The size increases linearly with the number of channels configured. The corresponding measurements were taken with a configuration consisting of the DirectRouting and AckDeliver strategies. The size of 70 MBytes also marks the absolute maximum of the binary that is currently possible, as it would contain all permutations of the strategies currently implemented, whether they are reasonable or not. However, for current hardware that is typically required for MMVE, the discussed size requirements are no problem at all. On resource constrained embedded systems, the usage of M2 etis in its current implementation requires some size optimization. It even raises the question whether an adaption of the current implementation to meet with strict memory constraints of mobile devices is even possible. The answer to this question requires some further work and a possible roadmap is suggested in section 19.3.

247


80"

Size%of%the%Library%[MByte]%

70" 60" 50" 40" 30" 20" 10" 0" 1"

10"

20"

30"

40"

50"

60"

70"

80"

90"

100"

Number%of%Channels%

Figure 17.3: Influence of the number of channels on the library’s size

Besides the size of the binary, run-time resource consumption is also an interesting topic to evaluate. For a basic notion of the resources, the current prototype requires for operation, a small scenario has been measured. The setup consists of one to three nodes. Each node acts as a publisher as well as a subscriber and publishes with a frequency of 2000 Hz. This high event frequency was chosen to examine the system under high load. 10.000"

1.000"

100"

10"

1"

1"Node"

2"Nodes"

3"Nodes"

Outgoing"Messages"(msg/sec)"

1.730"

1.740"

1.740"

Incoming"Messages"(msg/sec)"

1.730"

3.480"

4.150"

Memory"ConsumpBon"(Kbyte)"

5.316"

5.648"

5.916"

8"

27"

30"

150"

300"

CPU"UBlizaBon"(%)" Average"Path"Latency"(μsec)"

Figure 17.4: Resource consumption of M2 etis

Figure 17.4 depicts the corresponding results. The following metrics were measured: outgoing messages, incoming messages, memory consumption, CPU utilization, and the

248

17.3 Limitations of the Simulation Model

average path latency. Outgoing and incoming messages describe the number of messages that were actually sent and received by the application. Despite the target of 2000 messages per second only 1740 messages were sent because of scheduler effects. On the receiving end, it is noticeable that for one and two nodes the incoming messages correspond exactly to the outgoing. So no messages were lost. However, the measurement with three nodes shows a huge portion of lost messages. This fact is also reflected by the missing latency measurement. The latency in this setup exploded and exceeded any usable range. The reason for the breakdown of those two metrics is the saturated data rate of the link between the nodes. The memory and CPU utilization increases moderately with the number of messages that have to be processed. In the two node scenario, the rendezvous node has to process about 6000 messages per second with a promising CPU utilization of 27%. In conclusion, the measurements look promising for a prototype that did not undergo any profiling for performance optimization, yet.

17.3 Limitations of the Simulation Model In section 11.2, we discussed that current high-performance messaging middleware is not bound by CPU or RAM resources, but only by network bandwidth for reasonable event frequencies. That means the overhead in terms of latency and throughput is neglectable. For example, ZeroMQ1 is a middleware that proofs that it is possible to build systems with minimal constant processing overhead (cf. [DESS11]). This assumption leads to the proposed simulation model that does not consider a node’s resources besides the available data-rate. In this section, the limitations of this assumption are discussed in order to quantify the precision of the current simulation model. For this discussion, samples from simulations are compared to corresponding real-world measurements. The compared configuration consists only of a direct routing strategy and the setup was one publishing node with a variable number of subscribers. To reflect a LAN scenario the link latency for the simulations was 180 µsec. Figure 17.5 shows the results for two values of the message frequency and two different numbers of subscribers. The measured metric is the average path latency. The simulations as well as the measurements of M2 etis are invariant to the event frequency, as long as it does not exceed the bandwidth (cf. figure 17.4). However, a

1

http://zeromq.org

249


1200# 1046#

1042#

Average'Path'Latency'[μsec]'

1000#

800# 658#

656# 600#

400#

341#

389#

340#

389#

200#

0# S ,#3#

z 0H s,#1 em2

ub

s# -o

ula Sim

Hz 10 n,#

S ,#3#

ub

s#

b s# b s# #S u #S u z,#3 z,#3 5H 5H 2 2 # # , , s n e-o m2 ula Sim

em2

z 0H s,#1

#S u ,#12

b s#

ula Sim

-o

#S u ,#12

n

Hz ,#10

b s# em2

z 5H s,#2

#S u ,#12

b s#

ula Sim

-o

#S u ,#12

n

b s#

Hz ,#25

Figure 17.5: Simulations vs. real-world measurements

certain overhead between M2 etis and the simulation is noticeable. This overhead can be explained with the processing time, M2 etis requires to distribute the messages. Moreover, the overhead increases with a growing number of subscribers. As a consequence, it seems that the precision of the simulation model decreases with a growing number of subscribers. It is dependent on the number of copies of a message, a node has to distribute. Nevertheless, the overhead is about one millisecond for twelve subscribers and one publisher with one link between them in a Gbit LAN testbed. In a WAN scenario, where link latencies are at least factor ten larger, this small processing overhead produces less imprecision of the simulation. This imprecision is even further reduced the larger the scale of the simulation, because more and more nodes are involved in the distribution and the load on one node decreases. However, in order to gain a simulation model that has an error that is invariant to the simulated scenario, the resource consumption of nodes should be incorporated. This topic is discussed as part of further work in section 19.2.3.

17.4 Simulated Scalability of Selected Configurations In the following section, selected configurations that approximate the used cases introduced in section 3.3 are simulated. We distinguish two simulated scenarios: a one-to-many scenario that reflects the match coordination use case and a many-to-many scenario that approximates the use cases movement, target action, and chat. Generally speaking, these

250

17.4 Simulated Scalability of Selected Configurations

two cases define two different load characteristics. In the first case one node distributes events to a large number of subscribers. A perfect scenario for multicast trees. The second scenario copes with nodes that are all publisher and subscriber at the same time which induces stress on the root node of a multicast tree resulting in reduced scalability. Three metrics are discussed for these scenarios: average path latency, average event loss, and number of received events. These metrics cover the most important aspects of high performance messaging middleware systems. System Attribute Values Before we discuss each result, the value ranges of the different system attributes that were sampled during the simulations are specified in the following, if not stated otherwise. In respect of the limitation of the simulations’ precision and the MMVE use case, a WAN network is assumed. Table 17.1 shows the exact values used during the simulations. Besides the value ranges for the number of nodes and the event frequency, the fixed values for network specific attributes and simulation parameters are shown. The resulting system attribute space consists of 4080 measurements. The simulation parameters define the overall wall-clock time simulated, the initialization periods until the first messages are sent and the limits for the latency and event loss metrics. If those limits are exceeded the simulation is canceled in order to optimize the required simulation time. The values are hereby set to values that reflect the boundaries for fluid gameplay in a typical MMVE.

System attribute Number of nodes Event frequency (Hz) Payload (Byte) Delay (ms) Jitter (%) Upstream rate (Mbit/s) Downstream rate (Mbit/s) Header Size (Byte) Drop chance (%) Queue size (MiB) Simulated time (s) Waiting time until first subscribe (s) Waiting time until first publish (s) Limit for max. latency (ms) Limit for event loss (%)

Value range [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30] [1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30] [8, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256] 30 10 10 10 48 0 10 60 5 2 301 50.1

Table 17.1: Values for each system attribute used in simulations, following [Wah13]

251


17.4.1 One-to-many distribution

0.6

Average Path Latency [s]

60

Average Event Loss [%]

0.5 0.4 0.3 0.2 0.1 0.0 20 Event 15 10 Frequ ency 5 [Hz]

0

20 10

0

60 70 80 30 40 50 s 0 10 20 Number of Node

(b) Central server - avg. event loss


0.04 0.02 0.00

0

60 70 30 40 50 es 0 10 20 Number of Nod

0.02 0.04 0.06 20 Event 15 10 Frequ ency 5 [Hz]

80

(c) Scribe - avg. path latency

0

60 70 80 30 40 50 s 0 10 20 Number of Node

(d) Scribe - avg. event loss

0.06



0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 20 Event 15 10 Frequ ency 5 [Hz]

30

0.06


0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 20 Event 15 10 Frequ ency 5 [Hz]

40

0 20 Event 15 10 Frequ ency 5 [Hz]

60 70 80 30 40 50 es 0 10 20 Number of Nod

(a) Central server - avg. path latency

50

0.04 0.02 0.00

0

60 70 30 40 50 es 0 10 20 Number of Nod

80

(e) SpreadIt - avg. path latency

0.02 0.04 0.06 20 Event 15 10 Frequ ency 5 [Hz]

0

60 70 80 30 40 50 s 0 10 20 Number of Node

(f) SpreadIt - avg. event loss

Figure 17.6: Performance for one-to-many distribution

252


The simulated one-to-many scenario reflects use cases where one publisher distributes events to an increasing number of subscribers, like the match coordination discussed in section 3.3. The plots in figure 17.6 illustrate the behavior of three different routing strategies in terms of average path latency and average event loss for a payload of 256 Byte and a symmetric data rate of 2Mbit. The central server strategy saturates the link of the root node very fast, as all subscribers are linked directly to the root node. This behavior can be noticed by the fast increasing average path latency in figure 17.6(a), respectively the jump of the event loss at the same time (cf. figure 17.6(b)). Scribe that uses a DHT-based overlay shows a stair-like latency increase depending on the number of nodes shown in figure 17.6(c). This behavior reflects the number of hops required for routing the subscription messages in the DHT and therefore determining the height of the tree. The tree-like structure of Scribe prevents the saturation of the root’s link illustrated by figure 17.6(d). SpreadIt, also a tree-like multicast algorithm shows the expected logarithmic behavior (cf. figure 17.6(e)). However, the latency increases relatively fast because the number of allowed children for each node was limited to two which is too small for the used data rate. If we take a look at the corresponding throughput metric, depicted in figure 17.7, it shows the cumulative number of events received during the simulation, averaged over the participating nodes. The saturation of the data-rate can be observed in figure 17.7(a). Besides data-rate saturation, the received show a linear growth with the event frequency which is exactly the expected behavior. The logarithmic increase of the metric along the number of nodes is caused by the one node that only publishes events and its impact on the average. To conclude, the performance of the examined algorithms is exactly as expected according to the description of their respective original papers. 17.4.2 Many-to-many distribution The many-to-many distribution scenario mirrors use cases where all participating nodes act as publisher as well as subscriber. This is the case for the movement, target action, and chat use cases, introduced in section 3.3. It is obvious that the overall number of processed messages increases, because every publisher sends with the respective event frequency. Therefore, we take a look at measurements performed with 10 Mbit data rate and 256 Byte payload. Figure 17.8 shows the average path latency and the average event loss for three different routing strategies.

253


1200

Number of Received Events


1200 1000

1000

800

800

600

600

400 200 0 20 Event 15 10 Frequ ency 5 [Hz]

0

60 70 80 30 40 50 es 0 10 20 Number of Nod

(a) Central server - number of received events

400 200 0 20 Event 15 10 Frequ ency 5 [Hz]

0

60 70 80 30 40 50 s 0 10 20 Number of Node

(b) SpreadIt - number of received events


1200 1000 800 600 400 200 0 20 Event 15 10 Frequ ency 5 [Hz

]

0

60 70 80 30 40 50 es 0 10 20 Number of Nod

(c) Scribe - number of received events

Figure 17.7: Throughput for one-to-many distribution

The central server strategy runs relatively fast out of data rate depending on the event frequency. As long as the data rate of the root’s link is sufficient, the latency is relatively stable at the round-trip time of a message which is exactly the expected behavior, shown in figure 17.8(a). An event is sent to the server and gets distributed. Beyond the saturation point the behavior is unpredictable and the event loss increases drastically as illustrated in figure 17.8(b). The direct broadcast strategy does circumvent the bottleneck of a central server and each publisher distributes its events directly to all subscribers. The resulting average path latencies are much more promising and lie around the latency of one hop, as shown in figure 17.8(c). The small fluctuations in the graph are owed to queue effects of the prototype. Figure 17.8(d) illustrates that the same 10 Mbit are totally sufficient for this routing strategy.

254


60



0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0

0

5

30 20 10

0

5

30 20 25 10 15r of Nodes Numbe

(b) Central server - avg. event loss

0.06



0.034 0.032 0.030 0.028 0.026 0.024 0.022 0.020 0.018 0.016 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0

40

0 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0


(a) Central server - avg. path latency

50

0.04 0.02 0.00

0

5

0.14

0.04 0.06 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0


(c) Direct broadcast - avg. path latency

0.02

0

5


(d) Direct broadcast - avg. event loss


0.06


0.12

0.04

0.10

0.02

0.08

0.00

0.06 0.04 0.02 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0

0

5


(e) SpreadIt - avg. path latency

0.02 0.04 0.06 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0

0

5


(f) SpreadIt - avg. event loss

Figure 17.8: Performance for many-to-many distribution [Wah13]

255


In contrast, SpreadIt shows a higher average path latency in figure 17.8(e), but also no event loss (cf. figure 17.8(f)). The relatively high latency can be explained with a suboptimal parameter configuration of SpreadIt. Again the strategy allows only two children per tree node which pushes the height of the multicast tree beyond the necessary.

50000



18000 16000 14000 12000 10000 8000 6000 4000 2000 0 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0

40000 30000 20000 10000

0

5


(a) Central server - number of received events

0 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0

0

5


(b) SpreadIt - number of received events


50000 40000 30000 20000 10000 0 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0

0

5


(c) Direct broadcast - number of received events

Figure 17.9: Throughput for many-to-many distribution

The corresponding throughput metrics for the many-to-many scenario are shown in figure 17.9. They show the expected growth. The number of received events grows linearly with the event frequency and the number of nodes, because each node also acts as a publisher. The data-rate saturation can again be observed in figure 17.9(a). Both measured scenarios, one-to-many and many-to-many, have shown that the behavior of the different routing algorithms largely depends on the chosen system attributes. One can reason about the influence of each parameter and try to build analytical models for each dimension. However, the number of dimensions is high and their interdependencies are not always obvious which makes a strong case for non-

256


parametric models. Hence, these simulations not only validate the correct implementation of the different routing strategies, but also underpin the assumption to use non-parametric estimation to build models of the framework’s behavior. 17.4.3 Impact of Strategies on QoS Metrics It can be expected that strategy types influence the results of different QoS metrics. The extend of influence may depend on the chosen scenario and system attributes. In the following the influence of the order strategy type is illustrated to give a notion of the possible impact of strategies. 0.25

60 50 Average Event Loss [%]


0.20 0.15 0.10

Spreadit No Order Spreadit DetMerge Order Spreadit GMS Order Spreadit MTP Order

0.05 0.000

10

20

30 40 50 Number of Nodes

60

70

Spreadit No Order Spreadit DetMerge Order Spreadit GMS Order Spreadit MTP Order

40 30 20 10 00

80

10

(a) Latency impact

20


60

70

80

(b) Event loss impact

300

Numver of Received Events

250 200 150 100 50 00

Spreadit No Order Spreadit DetMerge Order Spreadit GMS Order Spreadit MTP Order 10

20


60

70

80

(c) Throughput impact

Figure 17.10: Impact of order strategies on SpreadIt routing (one-to-many)

The impact of different order strategies for the routing strategy SpreadIt, an ALM routing algorithm, is exemplified in figure 17.10. The figure shows the impact on three important metrics: average path latency in figure 17.10(a), average event loss

257


in figure 17.10(b), and the number of received events in figure 17.10(c). Basically the shape of the QoS metric is always defined by the routing strategy, i.e. at least for all currently identified strategies. In this case a logarithmic behavior as it is the case for most efficient routing algorithms. However, the different order strategies shift the graph, as they introduce a certain overhead. Generally speaking, observations show that other strategy types behave the same way, they also only shift or scale the shape of the graph. In this case MTP Order requires a noticable overhead, indicating that the fixed sequencer is not the root node of the tree, resulting in an overhead of one round trip time to get a token. Moreover, MTP Order also saturates the network at a scale of 60 nodes, as indicated by the loss and throughput metrics. GMS Order that also employs a fixed sequencer chose a more suitable node as sequencer resulting in nearly no overhead. The overhead of DetMerge Order which employs a communication history mechanism depends largely on the chosen parameters of the algorithm. In this case the tolerable latency of the algorithm was 70 ms which is obviously too short for the scale of the scenario, i.e. the tree constructed by SpreadIt gets too large to allow a delivery of an event within 70 ms. Therefore, in this case the order guarantee would be violated by the DetMerge Algorithm. As a result, for a correct decision of the best performing strategy the number of out of order messages would be required. This example shows that the decision problem requires a variety of QoS metrics to find an optimal configuration. On the other hand, it validates the fundamental assumption that different QoS requirements have impact on the performance of the system and that the impact is largely dependent on the scenario. However, it also shows that choosing the right parameters for the configuration of a single strategy can have an impact on the decision and should be further explored (cf. section 19.2.2).

17.5 Required Effort for Configuration The required effort to configure a system varies significantly among different approaches. In this section the required effort to configure M2 etis using the MATINEE language is compared to similar approaches that are able to specify QoS requirements as configuration. Referring to the taxonomy in section 5.7, three currently existing approaches support QoS-aware configuration: ADAMANT [HMS09], FAMUOSO [SF11], and GT [AGG09]. In the context of the ADAMANT project, a DSL for QoS-aware configuration, called DQML, was suggested in [HSG07, HSG08]. It provides a model-driven approach to specify

258

17.5 Required Effort for Configuration

QoS policies for the different entities defined in the DDS standard. The FAMUOSO prototype allows to specify the QoS policies as a C++ Template parameter, which is then compiled into the configured system. The GT project does allow to configure some QoS policies in code, but provides no clear developer-friendly model. However, all three approaches require technical expertise for configuration, because all QoS policies require numerical limits for the different metrics. The MATINEE language, however, abstracts the numerical specification in a domain profile and the final configuration is only a classification of the different event-types. This separation of concerns between the domain expert and the application developer cannot be found in any existing approach. 90" 80"

Eﬀort&for&Conﬁgura.on&(LoC)&

70" 60" 50" 40" 30" 20" 10" 0"

FAMUOSO"

DQML"

m2e7s"

MATINEE"

Figure 17.11: Lines of code required for configuration

In figure 17.11, the lines of code (LoC), required for DQML, FAMUOSO, and MATINEE are depicted. GT was excluded, because the lines of code that are necessary for configuration cannot be separated from the overall initialization code. The values for FAMUOSO and ADAMANT are only approximate values. The value for ADAMANT is based on the publications on DQML [HSG07, HSG08] because no source code is available. The values for FAMUOSO are derived from an analysis of the source code available. The scenario examined is a simple client-server application with 25 relevant parameters. FAMUOSO as well as the M2 etis library require the most lines of code, because the overhead by the C++ template code is a significant disadvantage against DSLs, as the values for MATINEE and DQML suggest. A manual configuration of M2 etis for one channel requires about 83 LoC which is the most compared to the other approaches. But as the configuration of the M2 etis library is usually embedded in a workflow that

259


generates the necessary code based on a MATINEE configuration, the large number of LoC required are usually no problem and were not target for optimization. DQML offers a GUI to model the QoS requirements. That makes the comparison in lines of code difficult. But, the gui translates each parameter assignment in the GUI into one LoC of a configuration file (cf. [HSG08]). Hence, the lines of code for DQML depend on the number of parameters specified. In the related paper, Hoffert argues that for a simple client/server scenario about 25 parameters are required which represents the value adopted for figure 17.11. However, even against model-driven approaches, the classification approach of MATINEE has a twofold advantage. On the one hand, the classification of parameters into dimensions allows the domain expert to control the complexity of the model an application developer has to deal with. On the other hand, classes are easier to understand than mere QoS parameters, as long as they are in the terminology of the developer. So, the required lines of code for MATINEE depend on the number of dimension, the used classification offers. For the introduced model instantiation in section 10.2.1, this simple scenario requires 4 LoC for the classification and 3 LoC for the optimization target. Nevertheless, this easy configuration for the application developer is bought at the cost of complexity for the definition of a domain model. But the larger effort for the definition of a domain model is amortized very fast if more than one application is developed. This comparison of the required configuration effort shows some promising results regarding the developer-friendliness of MATINEE, as it was postulated by hypothesis 1.

17.6 Performance Impact of Configurability In section 5.6, we discussed the different types of adaptability. Three types of systems were identified: configurable, reconfigurable, and adaptive systems. Whereby, adaptive systems are a special case of reconfigurable systems, as they only automate the reconfiguration process. Hypothesis 4 stated that the limitation to design-time configuration can significantly reduce the run-time overhead. In the following sections we quantify and compare the run-time overhead of the prototype with different existing types of systems. The limited existing approaches that are configurable, as introduced in section 5.7, are only sparsely evaluated. GT [AGG09], a centralized middleware, implemented in C#, does not provide any measurements for evaluation. FAMOUSO [SF11] which uses, similar to M2 etis, C++ Template-

260

17.6 Performance Impact of Configurability

Metaprogramming for the implementation of the library, also did not perform any measurements to evaluate their approach. Only the GREEN [SBC05] and the ADAMANT [HMS09, HSG10] project provide measurements that quantify the capabilities of their respective prototypes. Both approaches, GREEN and ADAMANT, are reconfigurable and therefore suitable to study the performance impact of design-time and run-time configuration. In section 17.6.1, M2 etis is compared with the performance of run-time configurable approaches. Section 17.6.2 compares M2 etis with different production-ready middleware solutions that do not offer configurability in order to further quantify the performance impact of a design-time configurable approach. 17.6.1 Design-Time Configuration vs. Run-time Configuration GREEN [SBC05] employs a component based architecture that is reconfigurable at runtime. In [SBC05] the possible throughput in GREEN is measured. A simple client/server setup is used with one sender and one receiver. Using this setup, GREEN reaches an average throughput of 313.97 messages per second. M2 etis currently processes an average of 8801 messages per second for the exact same setup in our testbed. However, our testbed offers faster ethernet connectivity (100Mbit vs. 1Gbit) and GREEN was measured on an embedded system with a 206 MHz ARM processor and 64MBytes RAM. That makes it difficult to compare the results. Unfortunately, the source code of GREEN is not available to repeat the measurements in our testbed. Nevertheless, for a throughput of 2000 messages per second in the above setup, M2 etis has a memory footprint of 5.316 KB and a CPU consumption of 7.5% on the Intel Core 2 Quad Q9450 machines. This resource consumption suggests that even with the limited resources, used during the GREEN evaluation, M2 etis probably would show better throughput characteristics than GREEN. In [HSG10], Hoffert evaluates the ADAMANT system in a scenario that can be recreated with the available resources. The evaluated scenario used nodes with a 3 GHz 64-bit Xeon processor and 2 GB of RAM in a 1GBit switched LAN which is similar to the hardware in our testbed. The varied parameters were message loss, event frequency, and the number of subscribers. The ADAMANT project modified their network code to produce a 5% message loss. The M2 etis prototype was altered accordingly to also drop 5% of messages in the UDPWrapper. For the event frequency 10 Hz and 25 Hz were measured, each for 3 and 15 subscribers. Each measurement consisted of 20.000 messages with 12 Bytes payload on a reliable channel. ADAMANT used its NACKcast

261


protocol which implements a negative acknowledgement mechanism. M2 etis used a normal acknowledgement algorithm (AckDeliver strategy). Both timeouts were set to 1ms. In each measurement all 20.000 messages reached their destination. 10000"

9000"

8000"

Average'Path'Latency'(μsec)'

7000"

6000"

5000"

4000"

3000"

2000"

1000"

0"

s" s" s" s" s" s" s" s" s" s" s" s" "l o s "l o s "l o s "l o s "l o s "l o s "l o s "l o s "l o s "l o s "l o s "l o s ,"0% ,"5% ,"0% ,"5% ,"0% ,"5% ,"0% ,"5% ,"5% ,"5% ,"5% ,"5% s s s s s s s s s s s s b b b b b b b b b b b b "S u "S u "S u "S u "S u "S u "S u "S u "S u "S u "S u "S u "12 "15 "12 "15 "15 "15 z,"3 z,"3 z,"3 z,"3 z,"3 z,"3 Hz, Hz, Hz, Hz, 0H 0H 5H 5H Hz, Hz, 0H 5H s,"1 s,"1 s,"2 s,"2 ,"10 ,"10 ,"25 ,"25 T,"1 T,"2 ,"10 ,"25 s s s s . . . . T T N N e e e e . . . . A A e e e e AN AN m2 m2 m2 m2 AM AM m2 m2 m2 m2 AM AM AD AD AD AD

Figure 17.12: Comparison of M2 etis and ADAMANT - average path latency

Figure 17.12 shows the average path latency for the different parameters. For each latency the standard deviation is given, as each measurement was at least taken 3 times. Unfortunately, in the 25Hz, 15 subscriber case no result for ADAMANT was available. M2 etis was measured twice. One time with 0% message loss, and one time with 5% message loss. The comparison shows the latency impact generated by a reliable channel. However, the results show a clear performance advantage of M2 etis. As both test environments were as similar as possible –limited by the description in the paper– the results can be interpreted as a clear advantage of design-time configuration compared to the run-time adaptable approach, ADAMANT employs. Some imprecision could be introduced by the used acknowledgement algorithm. However, the magnitude of the difference suggests a significant processing overhead caused by the library’s architecture. Both run-time adaptable systems, GREEN and ADAMANT, show a certain overhead in terms of latency compared to M2 etis. This fact suggests the limitation to designtime configuration as a suitable method to reduce run-time overhead while maintaining

262

17.6 Performance Impact of Configurability

configurability, as initially stated by hypothesis 4. That leaves the comparison of designtime configuration against non-configurable solutions for evaluation, in order to quantify the remaining overhead introduced by configurability. 17.6.2 Design-Time Configuration vs. Non-Configurable Solutions Dworak [DESS11] performed an evaluation of existing production-ready messaging middleware solutions in 2011. The goal was to select a suitable high performance solution for the deployment at CERN. The evaluated middleware solutions were RDA1 , Ice2 , Thrift3 , ZeroMQ4 , YAMI45 , and Qpid6 . None of them offers any configurability. They were deployed on the CERN internal network that is switched with 1 Gbit. The environment was recreated in our testbed, as far as its parameters were known, to provide comparable results. Dworak performed two tests, one testing the throughput of the system and one the multicasting capabilities in terms of latency: Test 1: Measure the processed messages per second with a payload of 4 bytes. The setup is strictly client/server with one publisher and one subscriber. Test 2: Publish 400 messages a 4 bytes up to 10 subscribers and measure the latency. Figure 17.13 shows the message throughput for the different middleware solutions. The results for all systems besides M2 etis were taken from the original paper [DESS11] and extended by the measurements for M2 etis. M2 etis reached an average of 8801 messages per second for the simple setup in test 1 and therefore ranks right behind Thrift, the messaging solution originally developed at Facebook. This result underpins the validity of Hypothesis 4, stating that limiting configurability to design-time minimizes the introduced overhead. The configurability of M2 etis seems not to hamper the throughput of the system compared to not configurable industry-strength systems.

1 2 3 4 5 6

RDA stands for Remote Device Access and is the currently employed middleware at CERN http://www.zeroc.com http://thrift.apache.org http://www.zeromq.org http://www.inpirel.com/yami4 http://www.amqp.org

263

17 Quantitative Evaluation RDA Ice msg/sec

7100

Thrift 7500

9000

ZeroMQ YAMI4 Qpid m2etis 8000 3750 3500 8801

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 RDA

Ice

ThriC

ZeroMQ

YAMI4

Qpid

m2eDs

Figure 17.13: Throughput comparison for test 1 [msg/sec]

The results for test 2 were not that promising. Figure 17.141 shows that M2 etis scales linear with the number of subscribers, but takes much more overall time, i. e. factor three the time of IceStorm, for the delivery of the 400 messages. The fastest delivery performed ZeroMQ whose nearly constant delivery latency can be explained by the usage of message batching that aggregates events to reduce the number of messages disseminated. The performance of M2 etis can be explained by the usage of unoptimized queuing data structures that use locking. Moreover, an off the shelf network library2 is employed that does not exploit OS-specific optimizations for low latency sockets and message processing. In conclusion, the reduction of processing latency in M2 etis requires some work in order to be competitive. However, the nearly constant latency of ZeroMQ suggests that the simple system model holds for certain systems, i.e. resource consumption can be neglected for the processing on each node. Hence, for a precise system model further work could be necessary by considering resource consumption on nodes (cf. section 19.2.3).

1

2

The graphs for the systems beside M2 etis were redrawn from the original paper [DESS11] and lack a certain precision, because the raw data was not available. However, the precision is sufficient for a comparison. M2 etis uses boost::asio, which is applicable to almost all communication problems at the cost of performance in low latency situations.

264

17.7 Quality of configuration decisions

1000$

Average'Path'Latency'[ms]'

100$

m2e/s$ ZeroMQ$ 10$

YAML$ IceStorm$ Qpid$ RDA$

1$

0,1$ 1$

2$

3$

4$

5$

6$

7$

8$

9$

10$

No.'of'Subscribers'

Figure 17.14: Cumulative latency comparison for test 2

17.7 Quality of configuration decisions The quality of configuration decisions for the naive workflow is only dependent on the realism of the simulation. The realism depends largely on the system attributes chosen and how good they reflect the respective scenario. The decision about the attribute values is up to the developer and for that reason cannot be validated. However, the system model itself could introduce a skew if it does not reflect the behavior of the system to a sufficient degree. Due to the lack of access to a large enough computing cluster to validate the system model, the simulation results are assumed as the ground truth against which we validate the optimized workflow. The optimized workflow uses the meta-models and the model selection method implemented in the machine learning library scikit-learn 1 . In the following we compare three meta-models for regression: gaussian processes, decision trees and the extra tree ensemble. For the selection of the hyper-parameters that instantiate the meta-models, random grid search as suggested by Bergstra et al. [BB12], is applied. The value ranges considered for the grid search of the hyper-parameters were chosen according to the

1

Scikit-learn (http://scikit-learn.org) is a machine learning library that implements common methods for classification, regression and clustering problems as well as dimensionality reduction, model selection and preprocessing steps like normalization and feature extraction.

265


recommendation in the documentation of scikit-learn. During the grid search 50 randomly selected value combinations are compared using cross validation in order to select the optimal hyper-parameter combination. The measured data set consisted of 162 measured points (the validation set of 512 points) in three dimensions (number of nodes, event rate, and payload). As error metrics we examine the mean squared error (MSE) as well as the minimal (min. SE) and maximal squared error (max. SE). In addition, the coefficient of determination (R2 ) is denoted. 0.14

Measured Data Gaussian Process

0.034

Decision Tree ExtraTree Ensemble

0.032 0.030 Average Path Latency [s]


0.12 0.10 0.08 0.06

0.024 0.022 0.020

0.04 0.020

0.028 0.026

0.018

5

10

15

20

25

30

0.0160


5

10

Number of Nodes

(a) SpreadIt

15

Number of Nodes


20

25

30

(b) Direct broadcast

Figure 17.15: Regression for different routing strategies [Wah13]

Figure 17.15 shows the three regression methods fitting the measurements for SpreadIt routing (figure 17.15(a)) and direct broadcast routing (figure 17.15(b)). For clarity, the plots show only one of the three measured parameter dimensions. The other two dimensions are fixed at 20 Hz event rate and 256 Byte payload. The respective error metrics can be found in table 17.2. In both cases the gaussian processes deliver a very good fit. They have a very high R2 score and the lowest MSE. As a result, for this slice of the multidimensional space, the gaussian process would be the meta-model of choice. The decision tree is obviously not a very exact meta-model, because of the steps induced by the tree levels, even though its MSE does not look that bad. Routing strategy SpreadIt

Direct broadcast

Meta-model Gaussian process Decision tree ExtraTree ensemble Gaussian process Decision tree ExtraTree ensemble

MSE 2.9170361868 · 10−6 3.47864307644 · 10−5 8.2813309367 · 10−6 4.75765537866e · 10−7 1.76265322788 · 10−6 1.1617677725 · 10−6

min. SE 1.12663672723 · 10−13 3.29105203428 · 10−14 2.31124283983 · 10−14 3.45689947694 · 10−16 2.59748825517 · 10−10 2.42695820052 · 10−13

max. SE 2.23437641524 · 10−5 0.000119934596859 6.59819670571 · 10−5 5.76759352055 · 10−6 9.84122393319 · 10−6 1.13281909423 · 10−5

Table 17.2: Quality of meta-models for the average path latency

266

R2 0.997183848479 0.966416645648 0.995177759864 0.978431884317 0.920092764812 0.947333003927

17.7 Quality of configuration decisions

However, the overall precision of the regression is promising, which underpins hypothesis 6 stating a domain can be sampled without introducing unreasonable error. Influence of Sparse Sampling

0.14

0.14

0.12

0.12 Average Path Latency [s]

Durchschn. Latenz [s]


This section examines how far the training set of the meta-models can be reduced without introducing a too large error. A central server routing strategy in a scenario where the data rate is not sufficient serves as example for this examination. When the links are getting saturated, events pile up in the queues, resulting in unpredictable behavior, because messages are dropped if the queues are full. Fitting graphs with such random spikes is much more difficult and requires more sample points than a monotone behavior, as discussed above.

0.10 0.08 0.06 0.04 0.020

0.10 0.08 0.06 0.04


5

10

15

Number of Nodes


20

(a) Dense sampling

25

30

0.020


5

10

15

Decision Tree Entscheidungsbaum ExtraTree Ensemble

20

25

30

Number of Nodes

(b) Sparse sampling

Figure 17.16: Central server - average path latency [Wah13]

In figure 17.16 and figure 17.17 we compare two three dimensional data sets (again number of nodes, event rate, and payload), a dense and a sparse sampled training set for two metrics, the average path latency and the average event loss. The dense training set consisted of 3568 data points and the sparse sampled one of 162. The respective graphs show a much larger error than in the previous case. Especially in the sparse sampled case, the spike is not fitted appropriately by any meta-model. However, the spike indicates a saturated data-rate and therefore a larger scale than about 13 nodes is not possible in this scenario. Therefore, the error before this saturation point is more relevant. The ensemble method seems to provide the best fit for the average path latency in figure 17.16.

267

17 Quantitative Evaluation 60


40 30 20 10 0 100


60 Average Event Loss [%]


50

70



50 40 30 20 10

5

10

15

20

Number of Nodes

(a) Dense sampling

25

30

00

5

10

15

20

25

30

Number of Nodes

(b) Sparse sampling

Figure 17.17: Central server - average event loss [Wah13]

Figure 17.17 plots the according average event loss metric for the above scenario, again in the same two sampling densities. The graph shows a large jump at the saturation point. For this metric, the reduced training set has large impact on the precision of the meta models. Hence, it can be important to identify such jumps and perform a dense sampling around these areas. Nevertheless, the configuration decision is only marginal influenced by this decision. As long as the data-rate is sufficient, the graphs are monotone and can be easily approximated. Only when the saturation point is reached, the measurements get jumpy and can hardly be estimated with a sparsely sampled parameter space, but any configuration beyond this points is useless. Therefore, the parameter spaces can be cropped at this point or iteratively sampled in order to increase the precision of estimations. For a non-saturated link, the error depending on the size of the training set is illustrated for the three meta-models in figure 17.18. The full data set consists of 4080 measurements. 512 points are selected for the cross-validation leaving a set of 3568 samples for training. The employed ratio of this full set is depicted along the x-axis and the mean absolute error on the y-axis. It is noticeable that for this case the error does not significantly improve with a larger training set than about 20 percent of the full set. The analyzed routing algorithm was SpreadIt which shows a logarithmic behavior. This indicates that as long as no link gets saturated, a set of more than 713 measurements for a three dimensional model does not improve the error significantly for logarithmic behavior.

268

17.8 Expenditure of Time for Configuration Automation 0.007

0.0012

0.006

0.0010 Mean Absolute Error

Mean Absolute Error

0.005 0.004 0.003 0.002

0.0006 0.0004 0.0002

0.001 0.0000.0

0.0008

0.2

0.4 0.6 Number of Training Samples

0.8

1.0

0.00000.0

(a) Decision tree

0.2


0.8

1.0

(b) ExtraTree ensemble

0.00010

Mean Absolute Error

0.00008 0.00006 0.00004 0.00002 0.000000.0

0.2


0.8

1.0

(c) Gaussian process

Figure 17.18: Mean absolute error depending on the ratio used from the full training set (SpreadIt routing)

17.8 Expenditure of Time for Configuration Automation The time consumption for the generation of simulation data is a huge drawback of the decision process. In order to quantify the time consumption of the current prototype and to foster the fact that a sparse as possible sampling of the hypercube is feasible, the simulation effort is analytically formulated and underpinned by measurements. Analytical Simulation Effort The number of simulations required for the sampling of a complete domain depends of a variety of parameters: the number of sampling points sλi for the i-th system attribute

269


λi ∈ Λsystem and the number of candidate configurations |Ycandidate |. Therefore, the total number of required simulations is defined by: |Λsystem |

X

sλi · |Ycandidate |

i=0

That means for each system parameter all candidate configurations must be measured for all sample points. For the measurement of the naive workflow the number of required simulations depends on the number of candidate configurations |Ycandidate | and the number of event-types |U |. The number of required simulations is therefore defined by: |U | X

|Ycandidate |

i=0

For each event-type all candidate configurations are measured with the correct system attribute values that are fixed in this case. In both cases the problem worsens with the number of available strategies |y|. The corresponding computational complexity is O(|y|!). Additionally in the first case the complexity of the simulation model, i.e. the number of system attributes and the sample rate additionally increases the number of simulations by a factor.

20000

Simulation Time [s]

15000 10000 5000 0 30 Even25 t Fre20q 15 10 uency 5 [Hz] 0

0

5


Figure 17.19: Simulation duration for direct broadcast [Wah13]

In order to give a notion of the magnitude for the time a simulation takes, figure 17.19 shows the simulation time for an exemplary simulation set. The examined configuration is direct broadcast routing without any other strategies. The network parameters are 1

270

17.8 Expenditure of Time for Configuration Automation

Gbit/s and a constant latency of 30 ms per link. The payload size is 256 Byte. Even for this small area of measurement an increasing simulation time can be observed. The variation in the simulation times can be explained with the load on the simulation nodes, because the simulation cluster was not exclusively available. Nevertheless, the graph indicates a correlation between the number of messages simulated, the number of simulated nodes and the simulation time. This observation is backed by the measurements taken in Bonrath’s thesis [Bon13]. They indicate that the simulation time increases exponentially with the number of nodes and linear with the number of sent messages. Speedup by Parallelization In section 11.2 the basic solution framework already introduced parallelization of the simulations. Especially in the case of the optimized workflow, parallelization is inevitable to gain a tolerable time consumption for the measurement. Following the Law of Amdahl [Amd67] the maximal speedup scales linear which is assumed here, because each simulation is totally independent of all other simulations. Hence, even the effort for merging the results, i.e. the sequential part of the computation, is neglected in this approximation, as it is only a database operation writing the measured values. t n

ti

max(t0, , tn)

p 1

n

Figure 17.20: Speedup of simulation duration [Wah13]

Figure 17.20 illustrates the maximal speedup gained by parallelization. The graph shows the required simulation time t if n simulations are performed on p processing nodes. The theoretical speedup is limited by the number of simulations n and the simulation with the longest simulation time max(t0 , . . . ,tn ). The second bound is especially interesting for the naive workflow, as it indicates that even if only a few simulations must be performed the time consumption can still be high, if only one simulation is complex enough. However, the real speedup gained by parallelization stays below the maximum due to scheduling problems of the current simulation manager that result in idling nodes. The

271


measurements taken for the meta-model used in section 17.4.2, roughly 4080 simulations were necessary that took about 12 hours on 76 simulation nodes [Wah13].

272

18 | Discussion In the previous part, the requirements as well as the initial hypotheses have been addressed with respect to the initial simplifying assumptions. The aim was to validate them for all three parts of the methodology: the framework, the QoS-aware configuration, and the automated workflow. The validation was performed in two parts: An argumentative comparison of the capabilities, the reference architecture and its implementation M2 etis provides, with similar existing approaches. Partly, as far as possible in the narrow scope of this thesis, the hypotheses were further underpinned by quantitative experiments. First, the general resource consumption of the prototype was quantified. It shows a promising scalability, even if further size optimization is required for the usage in embedded environments. The limitation and precision of the suggested simulation model was evaluated in section 17.3. The examined small scale LAN scenario showed relevant impact of the nodes’ resources which is not considered in the simulation model. With increasing scale and for WAN scenarios, this processing overhead in the magnitude of microseconds has less and less impact. Therefore, the proposed simulation model seems to have a reasonable precision for decisions about MMVEs or similar scenarios. Nevertheless, a simulation model with varying precision, depending on the scenario, should be further refined to be generally applicable (cf. section 19.2.3). The results of some selected configurations and their simulation results were discussed in section 17.4. These plots gave an insight into the different characteristics of strategies and how they behave regarding different parameters. It validated the general assumption that different strategies are best suited for different scenarios. The measurements also illustrated some of the most influential parameters for the three most important metrics (latency, throughput, and event loss). The benefits of QoS-aware configuration were quantified in terms of the required lines of code. The comparison to similar approaches showed that the classification approach, as proposed in hypothesis 1 significantly reduces the configuration effort. However, to

273

18 Discussion

quantify the production boost for developers which is introduced by the domain-specific terminology and the classification is left to be examined. It would require field studies in development studios to fully validate this hypothesis. Nevertheless, the analysis of the mere coding effort suggests a positive outcome. Hypothesis 4 states that a limitation to design-time configuration significantly reduces the performance overhead at run-time compared to run-time adaptable approaches. Therefore, in section 17.6, the performance impact of configurability were quantified. The prototype was compared to run-time configurable as well as non-configurable productionready approaches. The results validated the hypothesis. The run-time configurable approaches showed significant additional overhead, compared to M2 etis. In contrast, as expected, non-configurable approaches still have advantages compared to designtime configuration. However, with some profiling effort, it is likely that M2 etis can be optimized to be competitive with non-configurable solutions. Finally, the quality of the automated workflow was examined in terms of the introduced error for the optimized workflow and the time consumption of the simulations. The result looks promising, as the usage of meta-models approximates the behavior of DEBS to a sufficient degree, without the need of a fully sampled parameter space. The possible reduction in our measurement ranges around 80 percent. This finding can significantly reduce simulation times. However, the impact of the simplified system model needs quantification in order to give absolute predictions about the performance of a system. For absolute values a precise model is needed that considers for example network topologies and resource consumption of nodes. But a more realistic system model also increases simulation times. Therefore, for a pure relative decision which strategy performs better, probably the employed simple model is totally sufficient. To conclude, the previous part underpinned most parts of the initial hypotheses by discussion or, as far as possible, by a quantitative evaluation. It has shown that design-time configuration is viable for high-performance scenarios and that an automated QoS-aware configuration workflow can be designed without introducing unreasonable error at low sampling rates.

274

Part VI Epilogue

275

19 | Further Work During the evaluation some open issues were identified. In this chapter the most interesting and promising challenges are briefly sketched as candidates for further work.

19.1 Framework Enhancements The suggested framework with its corresponding simulator provides a testbed for a variety of current research topics. The easily extendible architecture allows for the integration of research algorithms. The advantage of this testbed is that the effort for implementation is only required once, for simulation and the deployable library. In the following some interesting topics are shortly sketched. 19.1.1 Semantic-Aware Filter and Routing In section 5.2.5 advanced filter concepts were introduced. Such concepts employ spatial or ontology based filter models. They could be integrated into the existing framework to broaden the features of the strategy repository and allow for an even larger field of application of M2 etis. In conjunction, semantic routing concepts are also of interest, as already discussed in section 5.3.3. With both, semantic routing and filtering, the possibility of a more sophisticated specification of event semantics opens up that may consider semantical relationships between different domain-specific terms based on domain ontologies. 19.1.2 Security Aspects An aspect of the framework, not addressed in this work, is security. Especially in the context of more and more online devices in our everyday life, security aspects of communication infrastructures are an important topic. Therefore, the extension of the

277

19 Further Work

proposed framework by security related strategy types is one of the immanent challenges to address. As discussed in section 5.5.6, some work has been done on security of DEBS which could easily be integrated into the framework. However, security requires more care about exploit protection and protocol design than the prototype currently offers. 19.1.3 Software Development During the development of the prototype, the development process itself offered some interesting challenges. Due to the combinatory explosion of possibly valid configurations that worsened with each added strategy, we faced problems with testability and keeping code quality high. Weyuker [Wey98] discusses this problem in context of componentbased software, where similar problems occur. Each component must be tested in each new environment it is used. With a strategy-based design, each valid composition must be tested in each environment. With a growing strategy repository automated testing methods are required that ensure a high code quality for all configurations. Continuous integration approaches which were employed during the development mitigate the problem, but do not solve the fact that the testing gets more and more expensive in terms of time and resource consumption with each strategy added.

19.2 Methodology Enhancements Besides the enhancement of the framework, the methodology for automated configuration decisions is also subject to further work. The limitations introduced for this thesis provide the major challenges to be addressed. 19.2.1 Content-aware decision model The current system model is limited to unstructured data. It is not possible to simulate workloads that consider the content of different event types. However, many discussed publish-subscribe systems provide a content-based data model and offer corresponding filter mechanisms. Therefore, the enhancement of the proposed methodology for contentbased systems poses an interesting challenge. Because the framework already supports content-based filtering and even partitions, only the system model and the decision process has to be enhanced to generate contentaware workloads for the simulations. Such an extension could involve for example an

278

19.2 Methodology Enhancements

agent based approach, statistical data generation or even traces of existing applications. Which of those possible approaches to the problem yields the most realistic workload model is the core of this challenge. 19.2.2 Parameter Optimization and Adaptiveness During the quantitative evaluation, suboptimal configuration parameters of strategies were identified. Currently configuration parameters are set during the deduction of system attributes and are not subject to optimization. In order to tune them, manual simulations have to be performed to identify the optimal configuration of each strategy. However, their potential influence on performance of certain strategies cannot be neglected. Therefore, part of further work should be the integration of parameter tuning into the automated configuration workflow. The challenge is the curse of dimensionality that comes with configuration parameter tuning. Each parameter introduces another optimization dimension that has to be sampled for a domain-specific meta-model. Hereby the need of heuristics in order to reduce simulation runs can be necessary. Another approach to a solution is to introduce adaptive strategies that adapt to different environments by tuning their configuration parameters. 19.2.3 Refinement of the System Model Besides the previously discussed extensions to the system model, its refinement is also subject to further work. The currently employed network model is rather simple and could be more sophisticated by the introduction of network topologies, as sketched in section 4.3.1. This would enable the simulations to consider unevenly distributed nodes with different link characteristics. Moreover, a model for the resource consumption like CPU and memory and their impact on QoS metrics poses a necessary extension, especially for domains like embedded systems. Considering resources on nodes is also the first step to examine global optimization decisions that do not treat every channel independently. Each of those refinements must be carefully examined regarding its influence on configuration decisions and the simulation time. The author expects a tradeoff between the time consumption for simulations and the precision of the system model. Probably an adaptive precision for domain meta-models that lets the developer choose between effort and precision will be necessary. Another idea to cope with the enormous number

279

19 Further Work

of simulations is to sample the space on demand, leading to a mixture of the naive and the optimized workflow.

19.3 Mobile Platforms The current prototype is implemented in C++ and therefore suitable for a port to mobile platforms. An evaluation of the proposed methodology for mobile platforms promises interesting results, as wireless sensor networks already were discussed as a motivation for design-time configuration in section 1.1.3. However, optimization demands, introduced by small embedded systems, vary from the optimization targets of large scale distributed systems like MMVEs. They are mainly motivated by the resource shortage on small nodes. This lack of resources leads to totally new optimization targets like energy consumption or library size. Especially in the time of small sensor networks which are never reconfigured, the proposed methodology could facilitate the development process of such resource constrained large scale mobile networks.

280

20 | Conclusion High performance internet scale applications like MMVEs are a major source of revenue for many companies in many different lines of business. However, the complexity for the development of such large scale applications stands in contrast to the huge potential profit. The resulting risk leaves only a small amount of companies with the expertise and recklessness to start the development of internet scale applications. This thesis addresses the complexity introduced by the communication demands of such applications. Their scale requires optimized communication infrastructures that go beyond simple network programming and hence require experts for their development. The goal was to provide a methodology that reduces or even eliminates the requirement for such networking experts in order to ease the development of a communication layer for distributed applications. Existing configurable approaches either introduce significant overhead which is not tolerable in large scale environments, or they are configurable but require technical knowledge to do so. The resulting three major challenges were formulated in six hypotheses. The suggested methodology that addresses exactly those challenges consists of three parts: a framework for a design-time configurable publish-subscribe system, a model for a QoS-aware configuration language, and the corresponding workflow for the automation of the configuration itself. The first challenge was the design of a general applicable framework for distributed publish-subscribe systems that does not introduce significant overhead. The proposed solution that uses strategies instead of components can be realized by policy-based design in C++ which has been validated by a proof-of-concept implementation and some measurements quantifying the overhead of design-time configuration. This poses a novel approach for the design of distributed publish-subscribe systems, as the taxonomy of similar systems in section 5.7 shows. The second challenge was the design of an easy-to-use language for developers, whereby, the need for network experts are rendered redundant. The application of a multidi-

281

20 Conclusion

mensional classification simplifies the expression of event semantics significantly. In conjunction with the possibility for a domain-specific instantiation of the classification, this approach provides a powerful and easy-to-use way for the specification of event semantics. A developer using this language does neither need to learn a new terminology nor be an expert for distributed systems in order to configure a publish-subscribe system. The third challenge was the automation of the workflow that maps the domain-specific description of event semantics to a technical configuration. To cope with that challenge a generally applicable workflow was suggested that employs black-box measurements of the framework in order to decide for the best suitable configuration. Due to the time expenditure required for such measurements, an optimized workflow that decouples measurements and decision making was suggested. This decoupled, measurement based solution provides a novel approach for the deduction of configuration decisions in the field of DEBS. In conclusion, all parts of a methodology for QoS-aware configuration of distributed publish-subscribe systems were suggested in form of a generally applicable reference architecture as well as validated by a prototypic implementation. Furthermore, this prototypic implementation was evaluated by computer experiments and real-world measurements within the limits set by the scope of this thesis. Despite not all challenges that occurred during the design of the described methodology could be addressed, the goals set by the initial hypotheses were reached and validated during the evaluation as far as possible. In conclusion, this thesis has illustrated the feasibility of QoS-aware configuration at design-time as a facilitation of the development process for distributed event-based systems.

282

Bibliography [A-T09]

A-Team Group. Market Data Platforms - Infrastructure to Handle Your Data Challenges. Technical report, NYSE Technologies, New York, New York, USA, 2009.

[AECM08] Eike Falk Anderson, Steffen Engel, Peter Comninos, and Leigh McLoughlin. The case for research in game engine architecture. In Proceedings of the 2008 Conference on Future Play: Research, Play, Share - Future Play 08, pages 228–231. ACM Press, 2008. [AGG09]

Brian De Alwis, Carl Gutwin, and Saul Greenberg. Combining Power and Simplicity in a Groupware Toolkit. Technical report, University of Calgary, Calgary, 2009.

[Ale01]

Andrei Alexandrescu. Modern C++ Design. The C++ In-Depth Series. Addison-Wesley, 2001.

[Amd67]

Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference - AFIPS ’67 (Spring), pages 483–485. ACM Press, April 1967.

[APS08]

Leigh Achterbosch, Robyn Pierce, and Gregory Simmons. Massively multiplayer online role-playing games. Computers in Entertainment, 5(4):1, March 2008.

[AR02]

Filipe Araujo and Luis Rodrigues. On QoS-aware publish-subscribe. In Proceedings of the 22nd International Conference on Distributed Computing Systems, Workshops, pages 511–515. IEEE Computer Society, 2002.

[ASS+ 99]

Marcos K. Aguilera, Robert E. Strom, Daniel C. Sturman, Mark Astley, and Tushar D. Chandra. Matching events in a content-based subscription system.

283

Bibliography

In Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing - PODC ’99, pages 53–61. ACM Press, May 1999. [ASSC02]

Ian F. Akyildiz, Weilian Su, Yogesh Sankarasubramaniam, and Erdal Cayirci. A survey on sensor networks. IEEE Communications Magazine, 40(8):102– 114, 2002.

[Atw04]

William J. Atwood. A classification of reliable multicast protocols. IEEE Network, 18(3):24–34, 2004.

[AW94]

Hagit Attiya and Jennifer Welch. Sequential consistency versus linearizability. ACM Transactions on Computer Systems (TOCS), 12(2):91–122, 1994.

[AW98]

Hagit Attiya and Jennifer Welch. Distributed Computing - Fundamentials, Simulations and Advanced Topics. McGraw-Hill Publishing Company, 1998.

[BAC09]

Eliya Buyukkaya, Maha Abdallah, and Romain Cavagna. VoroGame: A Hybrid P2P Architecture for Massively Multiplayer Games. In Proceedings of the 6th IEEE Consumer Communications and Networking Conference, pages 1–5. IEEE Computer Society, January 2009.

[Bae13]

Michael Baer. Vergleich und Implementierung von Synchronisationsstrategien zur Event Verarbeitung im Kontext von MMVEs. Bachelor thesis, Friedrich-Alexander University, Erlangen, 2013.

[BB02]

Suman Banerjee and Bobby Bhattacharjee. A comparative study of application layer multicast protocols. Technical report, Department of Computer Science, University of Maryland, College Park, 2002.

[BB12]

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281–305, 2012.

[BBC+ 10]

Anders Berglund, Scott Boag, Don Chamberlin, Mary F. Fernández, Michael Kay, Jonathan Robie, and Jérôme Siméon. XML Path Language (XPath) 2.0 (Second Edition). Technical report, W3C, 2010.

[BBK02]

Suman Banerjee, Bobby Bhattacharjee, and Christopher Kommareddy. Scalable application layer multicast. In Proceedings of the 2002 conference

284

Bibliography

on Applications, technologies, architectures, and protocols for computer communications - SIGCOMM ’02, pages 205–217. ACM Press, August 2002. [BBMS98]

John Bates, Jean Bacon, Ken Moody, and Mark Spiteri. Using events for the scalable federation of heterogeneous components. In Proceedings of the 8th ACM SIGOPS European workshop on Support for composing distributed applications - EW 8, pages 58–65. ACM Press, September 1998.

[BBPQ12]

Roberto Baldoni, Silvia Bonomi, Marco Platania, and Leonardo Querzoni. Dynamic Message Ordering for Topic-Based Publish/Subscribe Systems. In Proceedings of the 26th International Parallel and Distributed Processing Symposium, pages 909–920. IEEE Computer Society, May 2012.

[BCM+ 99]

Guruduth Banavar, Tushar Chandra, Bodhi Mukherjee, Jay Nagarajarao, Robert E. Strom, and Daniel C. Sturman. An Efficient Multicast Protocol for Content-Based Publish-Subscribe Systems. In Proceedings of the 19th IEEE International Conference on Distributed Computing Systems - ICDCS ’99, pages 1–9. IEEE Computer Society, May 1999.

[BCNN01]

Jerry Banks, John S. Carson, Barry L. Nelson, and David M. Nicol. Discreteevent System Simulation. International series in industrial and systems engineering. Prentice-Hall, Inc., 3rd edition, 2001.

[BCSS99]

Guruduth Banavar, Tushar Deepak Chandra, Robert E. Strom, and Daniel C. Sturman. A Case for Message Oriented Middleware. In Proceedings of the 13th International Symposium on Distributed Computing, pages 1–18. Springer, September 1999.

[BDL+ 08]

Ashwin Bharambe, J.R. Douceur, J.R. Lorch, Thomas Moscibroda, Jeffrey Pang, Srinivasan Seshan, and Xinyu Zhuang. Donnybrook: Enabling Large-Scale, High-Speed, Peer-to-Peer Games. ACM SIGCOMM Computer Communication Review, 38(4):389–400, 2008.

[BFH04]

I Buck, T Foley, and D Horn. Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2004, 23(3):777–786, 2004.

[BFM06]

Stefan Behnel, Ludger Fiege, and Gero Mühl. On Quality-of-Service and Publish-Subscribe. In Proceedings of the 26th IEEE International Conference

285

Bibliography

on Distributed Computing Systems Workshops (ICDCSW’06), pages 20–25. IEEE Computer Society, July 2006. [BGS04]

Nicolas Bouillot and Eric Gressier-Soudan. Consistency models for distributed interactive multimedia applications. ACM SIGOPS Operating Systems Review, 38(4):20–32, October 2004.

[BH07]

Sven Bittner and Annika Hinze. The arbitrary Boolean publish/subscribe model. In Proceedings of the 2007 inaugural international conference on Distributed event-based systems - DEBS ’07, pages 226–237. ACM Press, June 2007.

[BHG86]

Philip A Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency control and recovery in database systems. Addison-Wesley, July 1986.

[BHK07]

Ingmar Baumgart, Bernhard Heep, and Stephan Krause. OverSim: A flexible overlay network simulation framework. In Proceedings of the 2007 IEEE Global Internet Symposium, pages 79–84. IEEE Computer Society, May 2007.

[BHSW07]

Rainer Baumann, Simon Heimlicher, Mario Strasser, and Andreas Weibel. A survey on routing metrics. Technical report, Computer Engineering and Networks Laboratory ETH-Zentrum, Zürich, Switzerland, 2007.

[Bit08]

Sven Bittner. General Boolean Expressions in Publish-Subscribe Systems. Phd thesis, University of Waikato, Hamilton, New Zealand, 2008.

[BKV06]

Jean-Sébastien Boulanger, Jörg Kienzle, and Clark Verbrugge. Comparing interest management algorithms for massively multiplayer games. In Proceedings of the 5th ACM SIGCOMM workshop on Network and system support for games - NetGames ’06, pages 1–12. ACM Press, 2006.

[Bon13]

Daniel Bonrath. Entwurf und Implementierung eines diskreten eventbasierten Simulators für die i6m2etis Middleware. Bachelor thesis, FriedrichAlexander Universität, Erlangen, Germany, 2013.

[BPS06]

Ashwin Bharambe, Jeffrey Pang, and Srinivasan Seshan. Colyseus: A Distributed Architecture for Online Multiplayer Games. In Proceedings of the 3rd Symposium on Networked Systems Design & Implementation (NSDI), pages 155–168, Berkeley, CA, USA, 2006. USENIX Association.

286

Bibliography

[Bre00]

Eric Brewer. Towards Robust Distributed Systems. Keynote at the annual ACM Symposium on Principles of distributed computing - PODC ’00, 2000.

[Bre12]

Eric Brewer. CAP Twelve Years Later: How the Rules Have Changed. Computer, 45(2):23–29, 2012.

[BRS02]

Ashwin Bharambe, Sanjay Rao, and Srinivasan Seshan. Mercury: a scalable publish-subscribe system for internet games. In Proceedings of the 1st workshop on Network and system support for games - NetGames ’02, pages 3–9. ACM Press, 2002.

[BSBA02]

S. Bhola, R. Strom, S. Bagchi, and J. Auerbach. Exactly-once delivery in a content-based publish-subscribe system. In Proceedings of the 2002 International Conference on Dependable Systems and Networks - DSN 2002, pages 7–16. IEEE Computer Society, 2002.

[BSW79]

Philip A. Bernstein, D.W. Shipman, and W.S. Wong. Formal Aspects of Serializability in Database Concurrency Control. IEEE Transactions on Software Engineering, SE-5(3):203–216, May 1979.

[Bun97]

Peter Buneman. Semistructured data. In Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS ’97, pages 117–121. ACM Press, May 1997.

[CABB04]

M. Cilia, M. Antollini, C. Bornhoevd, and A. Buchmann. Dealing with heterogeneous data in pub/sub systems: the concept-based approach. In Proceedings of the 26th International Conference on Software Engineering - W18L Workshop "International Workshop on Distributed Event-based Systems (DEBS 2004), pages 26–31. IET Digital Library, 2004.

[Car98]

Antonio Carzaniga. Architectures for an Event Notification Service Scalable to Wide-area Networks. Phd thesis, Politecnico di Milano, Milano, Italy, 1998.

[CAR05]

Nuno Carvalho, Filipe Araujo, and Luis Rodrigues. Scalable QoS-based event routing in publish-subscribe systems. In Proceedings of the Fourth IEEE International Symposium on Network Computing and Applications (NCA ’05), pages 101–108. IEEE Computer Society, 2005.

287

Bibliography

[CCC+ 01]

A. Campailla, S. Chaki, E. Clarke, S. Jha, and H. Veith. Efficient filtering in publish-subscribe systems using binary decision diagrams. In Proceedings of the 23rd International Conference on Software Engineering - ICSE 2001, pages 443–452. IEEE Computer Society, 2001.

[CCR03]

Xiaoyan Chen, Ying Chen, and Fangyan Rao. An efficient spatial publish/subscribe system for intelligent location-based services. In Proceedings of the 2nd international workshop on Distributed event-based systems - DEBS ’03, pages 1–6. ACM Press, 2003.

[CDF98]

Gianpaolo Cugola, E. Di Nitto, and A. Fuggetta. Exploiting an event-based infrastructure to develop complex distributed systems. In Proceedings of the 20th International Conference on Software Engineering, pages 261–270. IEEE Computer Society, 1998.

[CDZ97]

K.L. Calvert, M.B. Doar, and E.W. Zegura. Modeling Internet topology. IEEE Communications Magazine, 35(6):160–163, June 1997.

[CF04]

R Chand and Pascal Felber. XNET : A Reliable Content-Based Publish / Subscribe System. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, pages 264–273. IEEE Computer Society, 2004.

[CFMP04]

Gianpaolo Cugola, Davide Frey, Amy L. Murphy, and Gian Pietro Picco. Minimizing the reconfiguration overhead in content-based publish-subscribe. In Proceedings of the 2004 ACM symposium on Applied computing - SAC ’04, pages 1134–1140. ACM Press, 2004.

[CH03]

Reuven Cohen and Shlomo Havlin. Scale-Free Networks Are Ultrasmall. Physical Review Letters, 90(5):1–4, February 2003.

[CH10]

Reuven Cohen and Shlomo Havlin. Complex Networks: Structure, Robustness and Function. Cambridge University Press, 2010.

[CHHL05]

Kuan-Ta Chen, Polly Huang, Chun-Ying Huang, and Chin-Laung Lei. Game traffic analysis. In Proceedings of the international workshop on Network and operating systems support for digital audio and video - NOSSDAV ’05, pages 19–24. ACM Press, June 2005.

288

Bibliography

[Cla11]

Colin Clark. Behind the numbers. https://exchanges.nyx.com/cclark/ improving-speed-and-transparency-market-data, 2011. Blog, Accessed: 2012-10-30.

[CLBF06]

Dave Clark, Bill Lehr, Steve Bauer, and Peyman Faratin. Overlay Networks and the Future of the Internet. Communication & Strategies, 63:1–21, 2006.

[CLRS09]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. The MIT Press, 2009.

[CMM09]

Gianpaolo Cugola, Alessandro Margara, and Matteo Migliavacca. Contextaware publish-subscribe: Model, implementation, and evaluation. In Proceedings of the 2009 IEEE Symposium on Computers and Communications, pages 875–881. IEEE Computer Society, July 2009.

[Coh03]

Bram Cohen. Incentives build robustness in BitTorrent. In Proceedings of the 2003 Workshop on Economics of Peer-to-Peer systems, volume 6, pages 68–72, 2003.

[CPSL05]

J Crowcroft, M Pias, R. Sharma, and S. Lim. A survey and comparison of peer-to-peer overlay network schemes. IEEE Communications Surveys & Tutorials, 7(2):72–93, January 2005.

[CQ06]

Angelo Corsaro and Leonardo Querzoni. Quality of service in publish/subscribe middleware. In Roberto Baldoni, Giovanni Cortese, Fabrizio Davide, and Angelo Melpignano, editors, Global Data Management, Emerging Communication: Studies in New Technologies and Practices in Communication, pages 79–97. IOS Press, 2006.

[Cro10]

Rob Crossley. Study : Average Dev costs as high as $ 28M. http://www.develop-online.net/news/33625/ Study-Average-dev-cost-as-high-as-28m, 2010. Online Magazine, Accessed: 2012-08-29.

[CRW01]

Antonio Carzaniga, David S. Rosenblum, and Alexander L. Wolf. Design and evaluation of a wide-area event notification service. ACM Transactions on Computer Systems, 19(3):332–383, August 2001.

289

Bibliography

[CV01]

A. Casimiro and P. Verissimo. Using the timely computing base for dependable QoS adaptation. In Proceedings of the 20th IEEE Symposium on Reliable Distributed Systems, pages 208–217. IEEE Computer Society, 2001.

[CW05]

F. Chang and J. Walpole. A traffic characterization of popular on-line games. IEEE/ACM Transactions on Networking, 13(3):488–500, June 2005.

[Dat02]

Mayur Datar. Butterflies and Peer-to-Peer Networks. Technical report, Stanford InfoLab, Stanford, USA, 2002.

[DBGM02] H Deshpande, M Bawa, and H Garcia-Molina. Streaming live media over peers. Technical report, Stanford InfoLab, Stanford, USA, 2002. [DC90]

Stephen E. Deering and David R. Cheriton. Multicast routing in datagram internetworks and extended LANs. ACM Transactions on Computer Systems, 8(2):85–110, May 1990.

[DDKN11]

Sebastian Deterding, Dan Dixon, Rilla Khaled, and Lennart Nacke. From Game Design Elements to Gamefulness : Defining “Gamification”. In Proceedings of the 15th International Academic MindTrek Conference: Envisioning Future Media Environments - MindTrek ’11, pages 9–15. ACM Press, 2011.

[DESS11]

A Dworak, F Ehm, W Sliwinski, and M Sobczak. Middleware Trends and Market Leaders 2011. In Proceedings of the 13th International Conference on Accelerator and Large Experimental Physics Control Systems, Grenoble, France, 2011. CERN Press.

[DFC12]

DFC Intelligence. Worldwide Market Forecasts for the Video Game and Interactive Entertainment Industry. Technical report, DFC Intelligence, 2012.

[DGH+ 06]

Alan Demers, Johannes Gehrke, Mingsheng Hong, Mirek Riedewald, and Walker White. Towards Expressive Publish/Subscribe Systems. In Yannis Ioannidis, Marc H Scholl, Joachim W Schmidt, Florian Matthes, Mike Hatzopoulos, Klemens Boehm, Alfons Kemper, Torsten Grust, and Christian Boehm, editors, Advances in Database Technology - EDBT 2006, volume 3896 of Lecture Notes in Computer Science, chapter 38, pages 627–644. Springer-Verlag, Berlin/Heidelberg, 2006.

290

Bibliography

[DGK+ 09]

Alan Demers, Johannes Gehrke, Christoph Koch, Ben Sowell, and Walker White. Database research in computer games. In Proceedings of the 35th SIGMOD international conference on Management of data - SIGMOD ’09, pages 1011–1014. ACM Press, 2009.

[DLS88]

Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presence of partial synchrony. Journal of the ACM, 35(2):288–323, April 1988.

[Dob07]

Jason Dobson. SIG Analysts Examine The Deteriorating Video Game Life Cycle. http://www.gamasutra.com/view/news/103540/SIG_Analysts_ Examine_The_Deteriorating_Video_Game_Life_Cycle.php, 2007. Online Magazine, Accessed: 2013-08-25.

[Dol07]

George Dolbier. Massively multiplayer online games, Part 1: A performance-based approach to sizing infrastructure. http://www.ibm. com/developerworks/library/wa-mmogame1/, 2007. Online Magazine, Accessed: 2013-05-22.

[DP10]

Waltenegus Dargie and Christian Poellabauer. Fundamentials of Wireless Sensor Networks - Theory and Practice. John Wiley & Sons Ltd., 2010.

[DSU04]

Xavier Defago, Andre Schiper, and Peter Urban. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Computing Surveys (CSUR), 36(4):372–421, 2004.

[DZD+ 03]

Frank Dabek, B. Zhao, Peter Druschel, John Kubiatowicz, and I. Stoica. Towards a common API for structured peer-to-peer overlays. In Peer-toPeer Systems II, Lecture Notes in Computer Science, pages 33–44. Springer, 2003.

[ECR13]

Christian Esposito, Domenico Cotroneo, and Stefano Russo. On reliability in publish/subscribe services. Computer Networks, 57(5):1318–1343, January 2013.

[EFGK03]

Patrick Th. Eugster, Pascal Felber, Rachid Guerraoui, and Anne-Marie Kermarrec. The many faces of publish/subscribe. ACM Computing Surveys, 35(2):114–131, June 2003.

291

Bibliography

[EGH05]

Patrick Th. Eugster, B. Garbinato, and A. Holzer. Location-based Publish/Subscribe. In Proceedings of the Fourth IEEE International Symposium on Network Computing and Applications, pages 279–282. IEEE Computer Society, 2005.

[EN10]

O Etzion and P Niblett. Event processing in action. Manning Publications Company, 2010.

[ETJ09]

Peter S. Excell, Wanqing Tu, and Xing Jin. Performance Analysis for Overlay Multimedia Multicast on r-ary Tree and m-D Mesh Topologies. IEEE Transactions on Multimedia, 11(4):696–706, June 2009.

[Eug07]

Patrick Th. Eugster. Type-based publish/subscribe. ACM Transactions on Programming Languages and Systems, 29(1), January 2007.

[FCBS08]

Halldor Fannar, Victoria Coleman, Randy Breen, and Brandon Van Slyke. The server technology of eve online: How to cope with 300,000 players on one server. Talk at the 2008 Game Developer Conference, Austin Texas, 2008.

[FDI+ 10]

Thomas Fischer, Michael Daum, Florian Irmert, Christoph Neumann, and Richard Lenz. Exploitation of event-semantics for distributed publish/subscribe systems in massively multiuser virtual environments. In Proceedings of the Fourteenth International Database Engineering & Applications Symposium, pages 90–97, Montreal, Canada, 2010. ACM Press.

[FHL11]

Thomas Fischer, Johannes Held, and Richard Lenz. M2etis: An adaptable Publish/Subscribe System for MMVEs based on Event Semantics. In Peter M Fischer, Hagen Höpfner, Joachim Klein, Daniela Nicklas, Bernhard Seeger, Tobias Umblia, and Matthias Virgin, editors, Proceedings of the BTW 2011, Workshops und Studierendenprogramm, pages 23–32. Schriftenreihe des Fachbereichs Informatik, 2011.

[FHLL11]

Thomas Fischer, Johannes Held, Frank Lauterwald, and Richard Lenz. Towards an adaptive event dissemination middleware for MMVEs. In Proceedings of the 5th ACM international conference on Distributed eventbased systems - DEBS ’11, pages 397–398. ACM Press, 2011.

292

Bibliography

[FHM+ 09]

Thomas Fischer, Gregor Hohmann, Tino Münster, Frank Lauterwald, and Richard Lenz. Improving data quality of routing data for clinical research. In Tagungsband der 54. GMDS Jahrestagung, Essen, Germany, 2009. Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie.

[Fis08]

Thomas Fischer. Laufzeitadaption in einem serviceorientierten Komponentenframework. In Informatiktage 2008 - Fachwissenschaftlicher InformatikKongress, Lecture Notes in Informatics, pages 161–165. Gesellschaft für Informatik, Bonn, Germany, 2008.

[FJL+ 01]

Françoise Fabret, H. Arno Jacobsen, François Llirbat, Joăo Pereira, Kenneth A. Ross, and Dennis Shasha. Filtering algorithms and implementation for very fast publish/subscribe systems. ACM SIGMOD Record, 30(2):115– 126, June 2001.

[FK04]

Sonia Fahmy and Mineseok Kwon. Characterizing overlay multicast networks and their costs. Technical report, Purdue University, West Lafayette, 2004.

[FK07]

Sonia Fahmy and Mineseok Kwon. Characterizing overlay multicast networks and their costs. IEEE/ACM Transactions on Networking (TON), 15(2):373– 386, 2007.

[FL10a]

Thomas Fischer and Richard Lenz. Event semantics in event dissemination architectures for massive multiuser virtual environments. In Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems - DEBS ’10, pages 93–94, Cambridge, United Kingdom, 2010. ACM Press.

[FL10b]

Thomas Fischer and Richard Lenz. Towards Exploitation of Event Semantics in Event Dissemination Architectures for Massive Multiuser Virtual Environments. In DEBS PhD Workshops, Fourth ACM international Conference on Distributed Event-Based Systems, Cambridge, United Kingdom, 2010.

[FLS10]

Kai-Tai Fang, Run-ze Li, and Agus Sudjianto. Design and Modeling for Computer Experiments. CRC Press, 2010.

[FM90]

Alan O. Freier and Keith Marzullo. MTP: An Atomic Multicast Transport Protocol. Technical report, Cornell University, 1990.

293

Bibliography

[FR05]

Roberto S. Silva Filho and David F. Redmiles. Striving for versatility in publish/subscribe infrastructures. In Proceedings of the 5th international workshop on Software engineering and middleware - SEM ’05, pages 17–24. ACM Press, 2005.

[Fra00]

Paul Francis. Yoid: Extending the Internet Multicast Architecture. Technical report, Information Sciences Institute, University of Southern California, 2000.

[Fuj90]

Richard M Fujimoto. Parallel discrete event simulation. Commun. ACM, 33(10):30–53, 1990.

[FZB+ 04]

L. Fiege, A. Zeidler, A. Buchmann, R. Kilian-Kehr, and G. Muhl. Security aspects in publish/subscribe systems. In Proceedings of the 26th International Conference on Software Engineering - W18L Workshop International Workshop on Distributed Event-based Systems (DEBS 2004), pages 44–49. IET Digital Library, 2004.

[Gau05]

Wilhelm Gaus. Dokumentations- und Ordnungslehre: Theorie und Praxis des Information Retrieval (German Edition). eXamen.press. Springer, 5th edition, 2005.

[GEW06]

Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, March 2006.

[GHJV94]

Erich Gamma, Richard Helm, Ralph Johnson, and John M Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, November 1994.

[GL02]

Seth Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51–59, 2002.

[GL12]

Seth Gilbert and Nancy Lynch. Perspectives on the CAP Theorem. Computer, 45(2):30–36, February 2012.

[GMS91]

Hector Garcia-Molina and AnneMarie Spauster. Ordered and reliable multicast communication. ACM Transactions on Computer Systems, 9(3):242– 271, August 1991.

294

Bibliography

[GRW05]

S Götz, Simon Rieche, and Klaus Wehrle. Selected DHT Algorithms. In Ralf Steinmetz and Klaus Wehrle, editors, Peer-to-Peer Systems and Applications, Lecture Notes on Computer Science, chapter 8, pages 95–117. Springer, 2005.

[GS95]

John Gough and Glenn Smith. Efficient Recognition of Events in a Distributed System. In Proceedings of the 18th Australasian Computer Science Conference (ACSC 1995), 1995.

[HA90]

P.W. Hutto and M. Ahamad. Slow memory: Weakening consistency to enhance concurrency in distributed shared memories. In Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–309. IEEE Computer Society, 1990.

[HASG07]

Mojtaba Hosseini, Dewan Tanvir Ahmed, Shervin Shirmohammadi, and Nicolas D. Georganas. A Survey of Application-Layer Multicast Protocols. IEEE Communications Surveys Tutorials, 9(3):58–74, 2007.

[HCC06]

Shun-Yun Hu, Jui-Fa Chen, and Tsu-Han Chen. VON: A Scalable Peerto-Peer Network for Virtual Environments. IEEE Network, 20(4):22–31, 2006.

[Hel10]

Johannes Held. Entwicklung eines Frameworks für die Verteilungsoptimierung in Publish/Subscribe Systemen auf Basis eines strukturierten P2POverlay Netzwerks. Diplomarbeit, Friedrich-Alexander University, Erlangen, Germany, 2010.

[Hen10]

Robert Henjes. Performance Evaluation of Publish/Subscribe Middleware Architectures. Phd thesis, Julius-Maximilians-Universität, Würzburg, Germany, 2010.

[HLMS06]

Andreas Hanemann, Athanassios Liakopoulos, Maurizio Molina, and D. Martin Swany. A study on network performance metrics and their composition. Campus-Wide Information Systems, 23(4):268–282, 2006.

[HMS09]

Joe Hoffert, Daniel Mack, and Douglas Schmidt. Using machine learning to maintain pub/sub system QoS in dynamic environments. In Proceedings of the 8th International Workshop on Adaptive and Reflective MIddleware ARM ’09, pages 1–6, New York, New York, USA, 2009. ACM Press.

295

Bibliography

[HMS10]

Joe Hoffert, Daniel Mack, and Douglas Schmidt. Integrating Machine Learning Techniques to Adapt Protocols for QoS-enabled Distributed Realtime and Embedded Publish/Subscribe Middleware. Network Protocols and Algorithms, 2(3):1–33, October 2010.

[HPM09]

Richard S Hall, Karl Pauls, and Stuart McCulloch. OSGi in Action. Manning Publications Company, 1st edition, 2009.

[HSB09]

Annika Hinze, Kai Sachs, and Alejandro Buchmann. Event-based applications and enabling technologies. In Proceedings of the Third ACM International Conference on Distributed Event-Based Systems - DEBS ’09, pages 1–15. ACM Press, 2009.

[HSG07]

Joe Hoffert, Douglas Schmidt, and Aniruddha Gokhale. A QoS policy configuration modeling language for publish/subscribe middleware platforms. In Proceedings of the 2007 inaugural international conference on Distributed event-based systems - DEBS ’07, pages 140–145. ACM Press, 2007.

[HSG08]

Joe Hoffert, Douglas Schmidt, and Aniruddha Gokhale. DQML: A Modeling Language for Configuring Distributed Publish/Subscribe Quality of Service Policies. In On the Move to Meaningful Internet Systems: OTM 2008, Lecture Notes in Computer Science Volume 5331, pages 515–534. Springer, 2008.

[HSG10]

Joe Hoffert, DC Schmidt, and Aniruddha Gokhale. Adapting distributed real-time and embedded pub/sub middleware for cloud computing environments. In Middleware 2010, Lecture Notes in Computer Science Volume 6452, pages 21–41. Springer, 2010.

[HW03]

Gregor Hohpe and Bobby Woolf. Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley Professional, 2003.

[HY05]

TY Hsiao and SM Yuan. Practical middleware for massively multiplayer online games. IEEE Internet Computing, 9(5):47–54, 2005.

[IFLMW08] Florian Irmert, Thomas Fischer, Frank Lauterwald, and Klaus MeyerWegener. The Storage System of a Runtime Adaptable DBMS. In Software

296

Bibliography

Engineering for Tailor-made Data Management - Dagstuhl Seminar Proceedings, page 6, Dagstuhl, Germany, 2008. Dagstuhl : Leibniz-Zentrum für Informatik. [IFLMW09] Florian Irmert, Thomas Fischer, Frank Lauterwald, and K. Meyer-Wegener. The Adaptation Model of a Runtime Adaptable DBMS. In Proceedings of the 26th British National Conference on Databases: Dataspace: The Final Frontier - BNCOD 26, pages 189–192. Springer, 2009. [IFMW08]

Florian Irmert, Thomas Fischer, and Klaus Meyer-Wegener. Runtime adaptation in a service-oriented component model. In Proceedings of the 2008 international workshop on Software engineering for adaptive and selfmanaging systems - SEAMS ’08, pages 97–104. ACM, 2008.

[ILB+ 08]

Florian Irmert, Frank Lauterwald, Matthias Bott, Thomas Fischer, and Klaus Meyer-Wegener. Integration of dynamic AOP into the OSGi service platform. In Proceedings of the 2nd workshop on Middleware-application interaction affiliated with the DisCoTec federated conferences 2008 - MAI ’08, pages 25–30. ACM Press, 2008.

[IUKB04]

M Izal, G Urvoy-Keller, and EW Biersack. Dissecting bittorrent: Five months in a torrent’s lifetime. In Passive and Active Network Measurement, Lecture Notes in Computer Science Volume 3015, pages 1–11. Springer, 2004.

[Jac09]

A Jacobs. The pathologies of big data. Communications of the ACM, 52(8):36–44, 2009.

[JCL+ 10]

Hans-Arno Jacobsen, Alex Cheung, Guoli Li, Balasubramaneyam Maniymaran, Vinod Muthusamy, and Reza Sherafat Kazemzadeh. The PADRES Publish / Subscribe System. Technical report, Middleware Systems Research Group, University of Toronto, Toronto, Canada, 2010.

[Jef85]

David R Jefferson. Virtual Time. ACM Transactions on Programming Languages and Systems (TOPLAS), 7(3):404–425, 1985.

[JF06]

Z. Jerzak and C. Fetzer. Handling Overload in Publish/Subscribe Systems. In Proceedings of the 26th IEEE International Conference on Distributed

297

Bibliography

Computing Systems Workshops (ICDCSW’06), pages 32–37. IEEE Computer Society, 2006. [JMWP06] MA Jaeger, G Mühl, Matthias Werner, and Helge Parzyjegla. Reconfiguring Self-stabilizing Publish/Subscribe Systems. In Large Scale Management of Distributed Systems, Lecture Notes in Computer Science Volume 4269, pages 233–238. Springer, 2006. [JWB+ 04]

Daniel James, Gordon Walton, Nova Barlow, Elonka Dunin, Edward Castronova, David Kennerly, George Dolbier, and Justin Quimby. 2004 Persistent Worlds Whitepaper. Technical report, IGDA Online Games SIG, 2004.

[JWM00]

Prasad Jogalekar, Murray Woodside, and Senior Member. Evaluating the Scalability of Distributed Systems. IEEE Transactions on Parallel and Distributed Systems, 11(6):589–603, 2000.

[KCP+ 13]

Kyriakos Kritikos, Manuel Carro, Barbara Pernici, Pierluigi Plebani, Cinzia Cappiello, Marco Comuzzi, Salima Benrernou, Ivona Brandic, Attila Kertész, and Michael Parkin. A survey on service quality description. ACM Computing Surveys, 46(1):1–58, October 2013.

[Keo02]

James Keogh. J2EE: the complete reference. McGraw-Hill/Osborne, 2002.

[KLXH04]

Björn Knutsson, Honghui Lu, W. Xu, and Bryan Hopkins. Peer-to-peer Support for Massively Multiplayer Games. In Proceedings of the Twentythird AnnualJoint Conference of the IEEE Computer and Communications Societies - INFOCOM 2004, volume 1, pages 96–107. IEEE Computer Society, 2004.

[KR05]

Sandeep S. Kulkarni and Ravikant. Stabilizing causal deterministic merge. Journal of High Speed Networks, 14(2):155–183, 2005.

[KRT97]

Anssi Karhinen, Alexander Ran, and Tapio Tallgren. Configuring designs for reuse. ACM SIGSOFT Software Engineering Notes, 22(3):199–208, May 1997.

[Lam78]

Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.

298

Bibliography

[Lam79]

Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 100(9):690– 691, 1979.

[LBC12]

Huaiyu Liu, Mic Bowman, and Francis Chang. Survey of State Melding in Virtual Worlds. ACM Computing Surveys, 44(4):21:1–21:25, 2012.

[LCC+ 02]

Qin Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker. Search and replication in unstructured peer-to-peer networks. In Proceedings of the 16th international conference on Supercomputing - ICS ’02, pages 84–95. ACM Press, 2002.

[LCGM05]

Li Lao, Jun-hong Cui, Mario Gerla, and Dario Maggiorini. A Comparative Study of Multicast Protocols : Top , Bottom , or In the Middle ? In Proceedings of the 24th Annual Joint Conference of the IEEE Computer and Communications Societies - INFOCOM 2005, pages 2809–2814. IEEE Computer Society, 2005.

[Len97]

Richard Lenz. Adaptive Datenreplikation in verteilten Systemen. Phd thesis, Friedrich Alexander University, Erlangen, Germany, 1997.

[LHJ05]

Guoli Li, Shuang Hou, and Hans-Arno Jacobsen. A Unified Approach to Routing, Covering and Merging in Publish/Subscribe Systems Based on Modified Binary Decision Diagrams. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05), pages 447–457. IEEE Computer Society, June 2005.

[Loh12]

S Lohr. The age of big data. http://wolfweb.unr.edu/homepage/ania/ NYTFeb12.pdf, 2012. Newspaper article, online archive, accessed: 2014-0108.

[LP03]

Y Liu and Beth Plale. Survey of publish subscribe event systems. Technical report, Computer Science Dept, Indiana University, Bloomington, USA, 2003.

[Luc02]

D C Luckham. The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems. Addison-Wesley, 2002.

[LVZ+ 10]

Dmitrij Lagutin, Kari Visala, Andras Zahemszky, Trevor Burbridge, and Giannis F. Marias. Roles and security in a publish/subscribe network

299

Bibliography

architecture. In Proceedings of the IEEE symposium on Computers and Communications, pages 68–74. IEEE Computer Society, June 2010. [Mau00]

Martin Mauve. Consistency in replicated continuous interactive media. In Proceedings of the 2000 ACM conference on Computer supported cooperative work - CSCW ’00, pages 181–190. ACM Press, 2000.

[MBC79]

M. D. McKay, R. J. Beckman, and W. J. Conover. Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, 21(2):239–245, May 1979.

[MBCK12] Tobias R. Mayer, Lionel Brunie, David Coquil, and Harald Kosch. On reliability in publish / subscribe systems : a survey. International Journal of Parallel, Emergent and Distributed Systems, 27(5):369–386, 2012. [MBE10]

Ken Moody, Jean Bacon, and David Evans. Implementing a practical spatiotemporal composite event language. In Kai Sachs, Ilia Petrov, and Pablo Guerrero, editors, From Active Data Management to Event-Based Systems and More - Papers in Honor of Alejandro Buchmann on the Occasion of His 60th Birthday, Lecture Notes in Computer Science Volume 6462, pages 108–123. Springer, 2010.

[MD10]

JL Martins and S Duarte. Routing algorithms for content-based publish/subscribe systems. IEEE Communications Surveys & Tutorials, 12(1):39–58, 2010.

[MFP06]

G Mühl, L Fiege, and P Pietzuch. Distributed Event-Based Systems. Springer, 2006.

[MG08]

Christoph P Mayer and Thomas Gamer. Integrating real world applications into OMNeT ++. Technical report, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2008.

[MKB07]

Shruti P. Mahambre, Madhu Kumar S.D., and Umesh Bellur. A Taxonomy of QoS-Aware, Adaptive Event-Dissemination Middleware. IEEE Internet Computing, 11(4):35–44, July 2007.

[MM06]

Reinhard Müller and Frank Mackenroth. German Entertainment and Media Outlook: 2006-2010. Technical report, PricewaterhouseCoopers, 2006.

300

Bibliography

[MMSW07] Maged Michael, Jose E. Moreira, Doron Shiloach, and Robert W. Wisniewski. Scale-up x Scale-out: A Case Study using Nutch/Lucene. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, pages 1–8. IEEE Computer Society, 2007. [Mos93]

D. Mosberger. Memory consistency models. ACM SIGOPS Operating Systems Review, 27(1):18–26, 1993.

[MTA09]

Rene Mueller, Jens Teubner, and Gustavo Alonso. Data processing on FPGAs. Proceedings of the VLDB Endowment, 2(1):910–921, 2009.

[Mü02]

Gero Mühl. Large-scale content-based publish-subscribe systems. Phd-thesis, Technische Universität Darmstadt, Darmstadt, Germany, 2002.

[NDM12]

Elisabetta Di Nitto, Daniel J. Dubois, and Alessandro Margara. Reconfiguration Primitives for Self-Adapting Overlays in Distributed Publish-Subscribe Systems. In Proceedings of the IEEE Sixth International Conference on Self-Adaptive and Self-Organizing Systems, pages 99–108. IEEE Computer Society, September 2012.

[New12]

New York Stock Exchange Technologies. Thinking Outside the Stack: Data Fabric - a high performance, low latency, message-oriented middleware platform. Technical report, NYSE Technologies, New York, New York, USA, 2012.

[NFL10]

C.P. Neumann, Thomas Fischer, and Richard Lenz. OXDBS: extension of a native XML database system with validation by consistency checking of OWL-DL ontologies. In Proceedings of the Fourteenth International Database Engineering & Applications Symposium, pages 143–148. ACM, 2010.

[NPVS07]

Christoph Neumann, Nicolas Prigent, Matteo Varvello, and Kyoungwon Suh. Challenges in peer-to-peer gaming. ACM SIGCOMM Computer Communication Review, 37(1):79, January 2007.

[Obj12]

Object Management Group. Data distribution service for real-time systems - Version 1.2. Technical report, Object Management Group, 2012.

[Ora02]

Oracle. The Java Message Service Specification. Technical report, Oracle, 2002.

301

Bibliography

[Pap79]

Christos H. Papadimitriou. The serializability of concurrent database updates. Journal of the ACM, 26(4):631–653, October 1979.

[PB02]

P.R. Pietzuch and Jean Bacon. Hermes: A distributed event-based middleware architecture. In Proceedings of the 22nd International Conference on Distributed Computing Systems Workshops, pages 611–618. IEEE Computer Society, 2002.

[PB12]

Davy Preuveneers and Yolande Berbers. Towards energy-aware semantic publish/subscribe for wireless embedded systems. ICST Transactions on Ubiquitous Environments, 12(10-12):1–15, 2012.

[PBJ03]

Milenko Petrovic, Ioana Burcea, and Hans-Arno Jacobsen. S-ToPSS: semantic Toronto publish/subscribe system. In Proceedings of the 29th international conference on Very large data bases - VLDB ’03, pages 1101–1104. ACM Press, September 2003.

[PCEI07]

Adrian Popescu, Doru Constantinescu, David Erman, and Dragos Ilie. A Survey of Reliable Multicast Communication. In Proceedings of the 3rd EuroNGI Conference on Next Generation Internet Networks, pages 111–118. IEEE Computer Society, May 2007.

[Peh12]

Hristiyan Pehlivanov. Vergleich und Implementierung von Routing-Strategien zur Event-Verarbeitung im Kontext von MMVEs. Master thesis, FriedrichAlexander University, Erlangen, Germany, 2012.

[PEKS07]

Peter Pietzuch, David Eyers, Samuel Kounev, and Brian Shand. Towards a common API for publish/subscribe. In Proceedings of the 2007 inaugural international conference on Distributed event-based systems - DEBS ’07, pages 152–157. ACM Press, June 2007.

[PGS+ 10]

Helge Parzyjegla, Daniel Graff, A. Schröter, J. Richling, and Gero Mühl. Design and implementation of the rebeca publish/subscribe middleware. In Kai Sachs, Ilia Petrov, and Pablo Guerrero, editors, From Active Data Management to Event-Based Systems and More - Papers in Honor of Alejandro Buchmann on the Occasion of His 60th Birthday, Lecture Notes on Computer Science, pages 124–140. Springer, 2010.

302

Bibliography

[PHH+ 12]

Andreas Petlund, På l Halvorsen, Pål Frogner Hansen, Torbjörn Lindgren, Rui Casais, and Carsten Griwodz. Network traffic from Anarchy Online. In Proceedings of the 3rd Conference on Multimedia Systems - MMSys ’12, pages 95–100. ACM Press, February 2012.

[Pie04]

Peter Robert Pietzuch. Hermes : A Scalable Event-Based Middleware. Phd thesis, Cambridge University, Cambridge, United Kingdom, 2004.

[Pon11]

T Pongthawornkamol. Reliability and timeliness analysis of content-based publish/subscribe systems. Phd-thesis, University of Illinois, Urbana, USA, 2011.

[Poo00]

Steven Poole. Trigger Happy : Videogames and the Entertainment Revolution. Arcade Publishing, 2000.

[PRR99]

CG Plaxton, R Rajaraman, and AW Richa. Accessing nearby copies of replicated objects in a distributed environment. Theory of Computing Systems, 280(32):241–280, 1999.

[PRU03]

G. Pandurangan, P. Raghavan, and E. Upfal. Building low-diameter peerto-peer networks. IEEE Journal on Selected Areas in Communications, 21(6):995–1002, August 2003.

[PW02a]

Lothar Pantel and Lars C. Wolf. On the suitability of dead reckoning schemes for games. In Proceedings of the 1st workshop on Network and system support for games - NETGAMES ’02, pages 79–84. ACM Press, April 2002.

[PW02b]

Lothar Pantel and L.C. Wolf. On Impact of Delay on Real-Time Multiplayer Games. In Proceedings of the 12th international workshop on Network and operating systems support for digital audio and video, pages 23–29. ACM, 2002.

[Qui86]

J.R. Quinlan. Induction of Decision Trees. Machine Learning, 1(1):81–106, March 1986.

[Ray96]

M Raynal. Logical time: Capturing causality in distributed systems. Computer, 29(2):49–56, 1996.

303

Bibliography

[RD01]

Antony Rowstron and Peter Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. In Rachid Guerraoui, editor, Middleware 2001, Lecture Notes in Computer Science Volume 2218, pages 329–350. Springer, October 2001.

[RFH01]

Sylvia Ratnasamy, Paul Francis, and Mark Handley. A scalable contentaddressable network. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications SIGCOMM ’01, pages 161–172. ACM Press, 2001.

[RKCD01]

Antony Rowstron, Anne-Marie Kermarrec, Miguel Castro, and Peter Druschel. Scribe: The Design of a Large-Scale Event Notification Infrastructure. In Jon Crowcroft and Markus Hofmann, editors, Networked Group Communication, volume 2233 of Lecture Notes in Computer Science, pages 30–43. Springer Berlin / Heidelberg, 2001.

[RS02]

S.G. Rao and S. Seshan. A case for end system multicast. IEEE Journal on Selected Areas in Communications, 20(8):1456–1471, October 2002.

[RS07]

Laura Ricci and Andrea Salvadori. Nomad: Virtual environments on P2P Voronoi overlays. In Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems OTM’07, pages 911–920. ACM Press, 2007.

[RSS07]

Szabolcs Rozsnyai, Josef Schiefer, and Alexander Schatten. Concepts and models for typing events for event-based systems. In Proceedings of the 2007 inaugural international conference on Distributed event-based systems - DEBS ’07, page 62. ACM Press, 2007.

[RW06]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, December 2006.

[RWY02]

A. Riabov, J.L. Wolf, and P.S. Yu. Clustering algorithms for content-based publication-subscription systems. In Proceedings of the 22nd International Conference on Distributed Computing Systems, pages 133–142. IEEE Computer Society, 2002.

[SAB+ 00]

Bill Segall, David Arnold, Julian Boot, Michael Henderson, and Ted Phelps. Content Based Routing with Elvin4. In Proceedings of the AUUG2K conference, pages 55–65, Canberra, Australia, 2000. AUUG Inc.

304

Bibliography

[SAKB10]

Kai Sachs, Stefan Appel, Samuel Kounev, and Alejandro Buchmann. Benchmarking publish/subscribe-based messaging systems. In Database Systems for Advanced Applications, Lecture Notes in Computer Science Volume 6193, pages 203–214. Springer, 2010.

[SBC05]

Thirunavukkarasu Sivaharan, Gordon Blair, and Geoff Coulson. Green: A configurable and re-configurable publish-subscribe middleware for pervasive computing. In On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE, Lecture Notes in Computer Science Volume 3760, pages 732–749. Springer, 2005.

[Sch06]

D.C. Schmidt. Guest Editor’s Introduction: Model-Driven Engineering. Computer, 39(2):25–31, February 2006.

[Sez05]

Ali Sezgin. On the definition of sequential consistency. Information processing letters, 96(6):193–196, 2005.

[SF11]

Michael Schulze and Marcus Förster. AFFIX – Ein ressourcenbewusstes Framework für Kontextinformationen in eingebetteten verteilten Systemen. In Proceedings of the Workshops der wissenschaftlichen Konferenz Kommunikation in verteilten Systemen 2011 (WowKiVS 2011). European Association of Software Science and Technology, 2011.

[SFFF03]

G. Siganos, M. Faloutsos, P. Faloutsos, and C. Faloutsos. Power laws and the AS-level internet topology. IEEE/ACM Transactions on Networking, 11(4):514–524, August 2003.

[SKH02]

Jouni Smed, T. Kaukoranta, and Harri Hakonen. A review on networking and multiplayer computer games. Technical Report 454, Turku Centre for Computer Science, Turku, Finland, 2002.

[Ski05]

Max Skibinsky. The Quest for Holy Scale - Part1: Large Scale Computing. In Thor Alexander, editor, Massively Multiplayer Game Development 2, chapter 2.13, pages 339–355. Charles River Media, 2005.

[SLI11]

Mudhakar Srivatsa, Ling Liu, and Arun Iyengar. EventGuard. ACM Transactions on Computer Systems, 29(4):1–40, December 2011.

[SMHA12]

Daniel Schreiber, Max Muhlhauser, Aristotelis Hadjakos, and Erwin Aitenbichler. Configurable Middleware for Multimedia Collaboration Applications.

305

Bibliography

In Proceedings of the IEEE International Symposium on Multimedia, pages 332–339. IEEE Computer Society, December 2012. [SMK+ 01]

Ion Stoica, Robert Morris, David Karger, M.F. Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149–160, 2001.

[Som07]

Ian Sommerville. Software Engineering. Addison-Wesley, 8th edition, 2007.

[Sto12]

Michael Stonebraker. What Does ’Big Data’ Mean? http://cacm.acm.org/ blogs/blog-cacm/155468-what-does-big-data-mean/fulltext, 2012. Blog@CACM, accessed 2014-01-02.

[SW05]

Ralf Steinmetz and Klaus Wehrle. What is this "Peer-to-Peer" about? In Ralf Steinmetz and Klaus Wehrle, editors, Peer-to-peer Systems and Applications, Lecture Notes on Computer Science Volume 3485, chapter 2, pages 9–16. Springer, 2005.

[SZ99]

Sandeep Singhal and Michael Zyda. Networked Virtual Environments: Design and Implementation. Addison-Wesley Professional, New York, NY, USA, 1999.

[Tan95]

Andrew S. Tanenbaum. Distributed Operating Systems. Prentice-Hall, Inc., New Jersey, 1995.

[Tar12]

Sasu Tarkoma. Publish / Subscribe Systems: Design and Principles. Wiley Series on Communications Networking & Distributed Systems. Wiley, 2012.

[TBF+ 03]

Wesley W. Terpstra, Stefan Behnel, Ludger Fiege, Andreas Zeidler, and Alejandro P. Buchmann. A peer-to-peer approach to content-based publish/subscribe. In Proceedings of the 2nd international workshop on Distributed event-based systems - DEBS ’03, pages 1–8. ACM Press, June 2003.

[TKK11]

MA Tariq, Boris Koldehofe, and GG Koch. Meeting subscriber-defined QoS constraints in publish/subscribe systems. Concurrency and Computation: Practice and Experience, 23(17):2140–2153, 2011.

[UPS11]

Guido Urdaneta, Guillaume Pierre, and Maarten Van Steen. A survey of DHT security techniques. ACM Computing Surveys, 43(2):1–49, January 2011.

306

Bibliography

[Val13]

Roland Vallery. Vergleich und Implementierung von inhaltsbasierten Filterstrategien zur Event-Verarbeitung im Kontext von MMVEs. Bachelor thesis, Friedrich-Alexander University, Erlangen, Germany, 2013.

[Vel95]

Todd Veldhuizen. Expression Templates. C++ Report, 7(5):26–31, 1995.

[Vir03]

Antonino Virgillito. Publish / Subscribe Communication Systems : from Models to Applications. Phd thesis, Universita degli Studi di Roma “La Sapienza”, 2003.

[Vog09]

W. Vogels. Eventually Consistent. Communications of the ACM, 52(1):40– 44, 2009.

[Wah13]

Andreas M. Wahl. Automatisierung der dienstguetebezogenen Konfiguration einer Publish-Subscribe Middleware. Master thesis, Friedrich-Alexander University, Erlangen, Germany, 2013.

[Wey98]

E.J. Weyuker. Testing component-based software: a cautionary tale. IEEE Software, 15(5):54–59, 1998.

[WFL14]

Andreas M. Wahl, Thomas Fischer, and Richard Lenz. MATINEE: A Quality-of-Service-aware Event Semantics Modeling Language. Technical report, Department of Computer Science, Friedrich Alexander University, Erlangen, Germany, 2014.

[WGR05]

Klaus Wehrle, Stefan Götz, and Simon Rieche. Distributed Hash Tables. In Ralf Steinmetz and Klaus Wehrle, editors, Peer-to-peer Systems and Applications, Lecture Notes on Computer Science Volume 3485, chapter 7, pages 79–93. Springer Berlin Heidelberg, 2005.

[WJL04]

Jinling Wang, Beihong Jin, and Jing Li. An ontology-based publish/subscribe system. In Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware, pages 232–253. ACM Press, October 2004.

[WJLS04]

Jinling Wang, Beihong Jin, Jing Li, and Danhua Shao. A semantic-aware publish/subscribe system with RDF patterns. In Proceedings of the 28th Annual International Computer Software and Applications Conference, COMPSAC, pages 141–146. IEEE, 2004.

307

Bibliography

[WKG+ 07] Walker White, Christoph Koch, Nitin Gupta, Johannes Gehrke, and Alan Demers. Database Research Opportunities in Computer Games. ACM SIGMOD Record, 36(3):7–13, 2007. [WL03]

Chonggang Wang and Bo Li. Peer-to-peer overlay networks: A survey. Technical report, The Hong Kong University of Science and Technology, Hong Kong, 2003.

[WS98]

D J Watts and S H Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393(6684):440–2, June 1998.

[WYY05]

Xiaoyun Wang, YL Yin, and Hongbo Yu. Finding collisions in the full SHA1. In Advances in Cryptology–CRYPTO 2005, Lecture Notes in Computer Science Volume 3621, pages 17–36. Springer, 2005.

[XG03]

Xiao Fan Wang and Guanrong Chen. Complex networks: Small-world, scale-free and beyond. IEEE Circuits and Systems Magazine, 3(1):6–20, 2003.

[YGM94]

Tak W. Yan and Héctor García-Molina. Index structures for selective dissemination of information under the Boolean model. ACM Transactions on Database Systems, 19(2):332–364, June 1994.

[YK13]

Amir Yahyavi and Bettina Kemme. Peer-to-peer architectures for massively multiplayer online games: A Survey. ACM Computing Surveys (CSUR), 46(1):9, October 2013.

[YMG08]

Jennifer Yick, Biswanath Mukherjee, and Dipak Ghosal. Wireless sensor network survey. Computer Networks, 52(12):2292–2330, August 2008.

[ZER09]

N. Zolotorevsky, Opher Etzion, and Y.G. Rabinovich. Spatial perspectives in event processing. In Proceedings of the Third ACM International Conference on Distributed Event-Based Systems, pages 32–54. ACM Press, 2009.

[Zha11]

Qinping Zhao. 10 Scientific Problems in Virtual Reality. Communications of the ACM, 54(2):116, February 2011.

[ZKJ01]

Ben Y. Zhao, John D. Kubiatowicz, and Anthony D. Joseph. Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing. Technical report, University of California at Berkeley, Berkeley, CA, USA, April 2001.

308

Bibliography

[ZMJ12]

Kaiwen Zhang, Vinod Muthusamy, and Hans-Arno Jacobsen. Total Order in Content-Based Publish/Subscribe Systems. In Proceedings of the 32nd IEEE International Conference on Distributed Computing Systems, pages 335–344. IEEE Computer Society, June 2012.

[ZZJ+ 01]

Shelley Q. Zhuang, Ben Y. Zhao, Anthony D. Joseph, Randy H. Katz, and John D. Kubiatowicz. Bayeux: an architecture for scalable and fault-tolerant wide-area data dissemination. In Proceedings of the 11th international workshop on Network and operating systems support for digital audio and video - NOSSDAV ’01, pages 11–20. ACM Press, January 2001.

309

Symbols DA A

Multidimensional classification of an application A Symbol for an application

c C Ψ

Symbol for a channel Set of channels Symbol for an instance of a multidimensional classification

ΓA Γdomain Γτ Γnet Γy D

Symbol tion A Symbol Symbol Symbol Symbol Symbol

o O e E τ Dτ U

Symbol for an edge in an overlay Set of edges in an overlay Symbol for an event Set of events Symbol for an event-type Multidimensional classification of event-type τ Set of event-types

f F

Symbol for a filter Symbol for a filter predicate

for an application description for an applicafor for for for for

a a a a a

domain profile event-type description for event-type τ network profile strategy description of strategy y dimension

311

Symbols

312

F

Set of filters

k K

Symbol for a key Set of keys (key space)

m Θ M D Ωτ Ωlimit Ωsystem

Symbol for a message Symbol for a message dissemination sequence Set of messages Set of delivered messages Set of QoS requirements for event-type τ Set of default limits for metrics Set of system metrics

G lApp LApp lN et LN et lT ree LT ree

Symbol for an overlay network Symbol for an application node Set of application nodes Symbol for an overlay node Set of overlay nodes Symbol for a tree node Set of tree nodes

λ Λconf ΛD Λτ Λnet Λsystem Λterms

Symbol for a parameter Set of configuration parameters Set of parameters that describe dimension D Set of parameters, specific to event-type τ Set of network parameters Set of system attributes Set of domain-specific parameters, called terms

y Y y Y Y

Symbol for a strategy Symbol for a configuration consisting of strategies Set of strategies Symbol for a strategy-type Set of strategy-types

Symbols

T R s T

Symbol for an multicast tree Set of receiving nodes in an multicast tree Symbol for the root node of an multicast tree Set of multicast trees

313

Acronyms ALM . . . . . . . . . . . . . . . . . . . . . . .application-layer multicast AoI . . . . . . . . . . . . . . . . . . . . . . . .area-of-interest API . . . . . . . . . . . . . . . . . . . . . . . .application programming interface AS . . . . . . . . . . . . . . . . . . . . . . . . .autonomous system CAN . . . . . . . . . . . . . . . . . . . . . . .Content Addressable Network CDN . . . . . . . . . . . . . . . . . . . . . . .content delivery network CEP . . . . . . . . . . . . . . . . . . . . . . . .complex event processing DDS . . . . . . . . . . . . . . . . . . . . . . . .Data Distribution Service DEBS . . . . . . . . . . . . . . . . . . . . . . .distributed event-based system DHT . . . . . . . . . . . . . . . . . . . . . . .distributed hash-table DSL . . . . . . . . . . . . . . . . . . . . . . . .domain-specific language DVE

. . . . . . . . . . . . . . . . . . . . . . .Distributed Virtual Environment

FPS . . . . . . . . . . . . . . . . . . . . . . . .first person shooter HPC

. . . . . . . . . . . . . . . . . . . . . . .high performance computing

ISP . . . . . . . . . . . . . . . . . . . . . . . .internet service provider KBR . . . . . . . . . . . . . . . . . . . . . . .key-based-routing LSH . . . . . . . . . . . . . . . . . . . . . . . .Latin Hypercube Sampling MAESTRO . . . . . . . . . . . . . . . . . . .M2 etis Adaptive System Configurator

315

Acronyms

MATINEE . . . . . . . . . . . . . . . . . . . .M2 etis Quality-of-Service-aware Semantics Modeling Language MIT . . . . . . . . . . . . . . . . . . . . . . . .Massachusetts Institute of Technology MMOG . . . . . . . . . . . . . . . . . . . . . .Massively Multiplayer Online Game MMVE . . . . . . . . . . . . . . . . . . . . . .Massively Multiuser Virtual Environment NVE

. . . . . . . . . . . . . . . . . . . . . . .Networked Virtual Environment

NYSE . . . . . . . . . . . . . . . . . . . . . . .New York Stock Exchange OMG . . . . . . . . . . . . . . . . . . . . . . .Object Management Group P2P . . . . . . . . . . . . . . . . . . . . . . . .peer-to-peer QoS . . . . . . . . . . . . . . . . . . . . . . . .Quality-of-Service RDMA . . . . . . . . . . . . . . . . . . . . . .Remote Direct Memory Access RPG

. . . . . . . . . . . . . . . . . . . . . . .role playing game

SPS . . . . . . . . . . . . . . . . . . . . . . . .stream processing system VPN

. . . . . . . . . . . . . . . . . . . . . . .virtual private network

WAN . . . . . . . . . . . . . . . . . . . . . . .wide area network WoW . . . . . . . . . . . . . . . . . . . . . . .World of Warcraft WSN . . . . . . . . . . . . . . . . . . . . . . .wireless sensor network

316

Glossary application-layer multicast Application-layer multicast provides multicast capabilities realized on OSI layer 7. It is not as efficient as IP multicast, residing on OSI layer 3, but does not require router support. area-of-interest Area-of-interest is the area around an entity in a virtual world, for which the entity is interested in events produced by other entities. An entity may have more than one area-of-interest. For example one for interaction and one for visibility. autonomous system An autonomous system (AS) is a connected group of one or more IP prefixes run by one or more network operators which has a single and clearly defined routing policy (RFC 1930). Content Addressable Network Content Addressable Network (CAN) is a structured overlay network which uses an n-dimensional key-space to address content. Content Delivery Network Content Delivery networks are global spanning networks for distribution of digital objects like pictures or videos. Data Distribution Service The Data Distribution Service is a standard for an anonymous publish-subscribe architecture and is proposed by the Object Management Group (OMG).

317

Glossary

distributed hash-table A distributed hash-table provides an addressing scheme for distributed systems that works similar to a hash-table. It is commonly used for the construction of structured peer-to-peer overlays to abstract the address space from IP. first person shooter A first person shooter is a game genre in which the player plays an avatar in a first-person view and uses different kinds of weapons. Games of this genre are particularly fast paced and require fast hand-eye coordination to play. Quality-of-Service Quality-of-Service describes the quality a certain service offers. A user can specify QoS requirements for a service and a service can give QoS guarantees. The quantification of such guarantees is done by QoS metrics. role playing game In a role playing game, the player controls an avatar in a virtual world. He assumes the role of the avatar and plays through a story by performing certain tasks in the role of the avatar. virtual private network A virtual private network extends a private IP space over the borders of a public network like the internet by the means of encryption. It enables e.g. to connect different private LANs over the internet without the need for dedicated WAN connections. World of Warcraft World of Warcraft is the most successful Massively Multiplayer Online Game (MMOG) ever produced until 2012. It had a maximum of over 12 million paying subscribers in 2011. gamification Gamification refers to the use of design elements (rather than full-fledged games), characteristic for games, in non-game contexts. [DDKN11]. M2 etis is a publish-subscribe middleware exploiting event semantics to gain optimized dissemination characteristics.

318

Glossary

publish-subscribe Publish-Subscribe is a design paradigm to build distributed and loosely coupled systems. It realizes an asynchronous one-to-many communication.

319

List of Figures 3.1 3.2 3.3 3.4 3.5 3.6

Trend of subscriptions for World of Warcraft (Source: MMVE Reference Architecture following [JWB+ 04] . Architecture of Eve Online following [FCBS08] . . . Application model with partitions . . . . . . . . . . Area-of-Interest and event distribution . . . . . . . . Screenshots of the Tri6 game . . . . . . . . . . . . .

Activision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Blizzard) . . . . . . . . . . . . . . . . . . . . . . . . .

4.1 4.2 4.3 4.4 4.5 4.6 4.7

KBR routing scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . Chord routing example (6-bit key-space) . . . . . . . . . . . . . . . . Pastry routing state example for node 34F1 (16-bit key-space, b=4) . Pastry routing example (16-bit key-space, b=4) . . . . . . . . . . . . 2-dimensional CAN key-space with 6-bit keys along each dimension. Exemplary internet topology on domain level following [CDZ97] . . . Illustrations of the different network models . . . . . . . . . . . . . .

5.1 5.2 5.3 5.4 5.5 5.6 5.7

Event-based system components following Mühl [MFP06] . . . . . . . . 78 Decision tree example with two filters . . . . . . . . . . . . . . . . . . . 87 Outline of the matching algorithm for general boolean expressions [Bit08] 88 Structure of broker-based routing topologies . . . . . . . . . . . . . . . . 93 Routing operations in broker-based routing topologies . . . . . . . . . . 94 Tree-like overlay structure . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Mesh-first vs. tree-first group management . . . . . . . . . . . . . . . . 101

8.1 8.2

Abstraction layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Reference model of the overlay network layer . . . . . . . . . . . . . . . 150

9.1 9.2

Basic components of the notification service framework . . . . . . . . . . 153 Strategies and their responsibility during dissemination . . . . . . . . . 155

. . . . . . .

. . . . . . .

34 36 37 41 42 44 63 64 66 66 68 71 72

321

List of Figures

9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11

Broker-based vs. hierarchical topologies . Example for a tree with filtered attributes Message type responsibilities . . . . . . . Abstract routing process . . . . . . . . . Abstract dissemination process . . . . . . Abstract forwarding process . . . . . . . . Abstract delivery process . . . . . . . . . Abstract local delivery process . . . . . . Coarse configuration workflow . . . . . .

10.1 10.2

Overview of the configuration model . . . . . . . . . . . . . . . . . . . . 177 Abstraction of the multidimensional application classification . . . . . . 178

11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8

Basic solution workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of the simulation model . . . . . . . . . . . . . . . . . . . . . Overview of the naive workflow . . . . . . . . . . . . . . . . . . . . . . . Candidate identification process . . . . . . . . . . . . . . . . . . . . . . Location of parameters for the deduction of system attributes [Wah13] . Overview of the optimized workflow . . . . . . . . . . . . . . . . . . . . Mathematical representation of physical systems and their meta-modeling [FLS10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meta model generation . . . . . . . . . . . . . . . . . . . . . . . . . . .

12.1

Architectural overview of M2 etis . . . . . . . . . . . . . . . . . . . . . . 210

13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 13.9 13.10

Namespaces and interfaces of the M2 etis library . . . . . . Thread model of the M2 etis library . . . . . . . . . . . . . Logical structure of a message . . . . . . . . . . . . . . . Simplified network abstraction layer of the M2 etis library Simplified notification service layer of the M2 etis library . Routing publish messages . . . . . . . . . . . . . . . . . . Delivery of publish and notify messages . . . . . . . . . . Routing of subscribe and unsubscribe messages . . . . . . Delivery of subscribe, unsubscribe and control messages . Forwarding subscribe and unsubscribe messages . . . . . .

14.1

Layers of the simulator architecture . . . . . . . . . . . . . . . . . . . . 223

322

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

158 159 163 165 166 166 167 167 168

193 194 195 196 197 199 201 204

211 212 213 215 216 217 218 219 219 220

List of Figures

15.1 17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8 17.9 17.10 17.11 17.12 17.13 17.14 17.15 17.16 17.17 17.18

Architecture of MAESTRO [Wah13] . . . . . . . . . . . . . . . . . . . . 231

Size of the library with different routing strategies . . . . . . . . . . . . Size of additional strategies . . . . . . . . . . . . . . . . . . . . . . . . . Influence of the number of channels on the library’s size . . . . . . . . . Resource consumption of M2 etis . . . . . . . . . . . . . . . . . . . . . . Simulations vs. real-world measurements . . . . . . . . . . . . . . . . . Performance for one-to-many distribution . . . . . . . . . . . . . . . . . Throughput for one-to-many distribution . . . . . . . . . . . . . . . . . Performance for many-to-many distribution [Wah13] . . . . . . . . . . . Throughput for many-to-many distribution . . . . . . . . . . . . . . . . Impact of order strategies on SpreadIt routing (one-to-many) . . . . . . Lines of code required for configuration . . . . . . . . . . . . . . . . . . Comparison of M2 etis and ADAMANT - average path latency . . . . . . Throughput comparison for test 1 [msg/sec] . . . . . . . . . . . . . . . . Cumulative latency comparison for test 2 . . . . . . . . . . . . . . . . . Regression for different routing strategies [Wah13] . . . . . . . . . . . . Central server - average path latency [Wah13] . . . . . . . . . . . . . . . Central server - average event loss [Wah13] . . . . . . . . . . . . . . . . Mean absolute error depending on the ratio used from the full training set (SpreadIt routing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.19 Simulation duration for direct broadcast [Wah13] . . . . . . . . . . . . . 17.20 Speedup of simulation duration [Wah13] . . . . . . . . . . . . . . . . . .

246 247 248 248 250 252 254 255 256 257 259 262 264 265 266 267 268 269 270 271

323

List of Tables 3.1 3.2 3.3 3.4

Event-type Event-type Event-type Event-type

4.1 4.2

Comparison of lookup concepts following [WGR05] Comparison of selected structured overlay schemes GRW05] . . . . . . . . . . . . . . . . . . . . . . . . Selected network characteristics following [XG03] .

4.3 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9

9.1

characteristics characteristics characteristics characteristics

for for for for

the the the the

movement use case . . collision use case . . . chat use case . . . . . . match coordination use

. . . . . . . . . case

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

47 49 50 52

. . . . . . . . . . . . following [CPSL05, . . . . . . . . . . . . . . . . . . . . . . . .

58

Routing algorithms and their use-cases [MFP06] . . . . . . . . . . . . . . Classification of selected ALM algorithms, following [HASG07, BB02] . . Classification of selected total order algorithms, following [DSU04] . . . . Taxonomy of filter and routing capabilities for publish-subscribe middleware – part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taxonomy of QoS and reliability capabilities for publish-subscribe middleware – part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taxonomy of adaptability capabilities for publish-subscribe middleware – part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taxonomy of filter and routing capabilities for publish-subscribe middleware – part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taxonomy of QoS and reliability capabilities for publish-subscribe middleware – part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taxonomy of adaptability capabilities for publish-subscribe middleware – part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62 74 95 104 112 117 118 119 120 121 122

Matrix for strategy application in the interaction model, based on [FHL11]164

325

List of Tables

10.1 10.2 10.3 10.4

Important system attributes and metrics . . . Classification of the context dimension . . . . Classification of the validity dimension . . . . Classification of the synchronization dimension

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

182 184 185 186

17.1 Values for each system attribute used in simulations, following [Wah13] . 251 17.2 Quality of meta-models for the average path latency . . . . . . . . . . . . 266

326

Quality-of-Service-Aware Configuration of Distributed ...

Quality-of-Service-Aware Configuration of Distributed ...

Suggest Documents

Randomized Distributed Configuration Management of

DISTRIBUTED CONFIGURATION MANAGEMENT ... - Semantic Scholar

Managing the Configuration Complexity of Distributed ... - BNRG

Detection of Configuration Vulnerabilities in Distributed (Web

A Reusable, Distributed Repository for Configuration Management ...

Distributed Cross-Domain Configuration Management - Springer Link

Modeling and Solving Distributed Configuration ... - Semantic Scholar

A Reusable, Distributed Repository for Configuration

Co-operative and Distributed Configuration - Semantic Scholar

Security for Automated, Distributed Configuration Management

System Services for Distributed Application Configuration - CiteSeerX

cooperation and configuration within distributed systems management

Security for Automated, Distributed Configuration Management

Fully Distributed Service Configuration Management - Google Sites

IST Supporting Distributed Product Configuration by ...

Dynamic Configuration for Distributed Systems - Spiral

Configuration Management For Distributed Software Services

Configuration Management for Distributed Development ... - CiteSeerX

Nearly Optimal Distributed Configuration Management ... - Chuanyi Ji

System Services for Distributed Application Configuration - CiteSeerX

System Services for Distributed Application Configuration - CiteSeerX

Fully Distributed Service Configuration Management - Google Sites

Git & Mercurial: Distributed Configuration Management [PDF]

Safe Configuration of TLS Connections - BBN Distributed Systems