enforcing user-defined management logic in large scale ... - CiteSeerX

ENFORCING USER-DEFINED MANAGEMENT LOGIC IN LARGE SCALE SYSTEMS

Hemapani Srinath Perera Department of Computer Science Indiana University

Submitted to the faculty of the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the Department of Computer Science Indiana University

May 2009

UMI Number: 3358983

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

______________________________________________________________ UMI Microform 3358983 Copyright 2009 by ProQuest LLC All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

_______________________________________________________________ ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106-1346

Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements of the degree of Doctor of Philosophy.

Prof. Dennis Gannon, PhD. (Principal Advisor)

Doctoral Committee

Prof. Beth Plale, PhD.

Prof. Geoffrey Fox, PhD.

Prof. David Leake, PhD.

March 25, 2009

Dr. Sanjiva Weerawarana, PhD.

ii

c 2009 Copyright Hemapani Srinath Perera Department of Computer Science Indiana University ALL RIGHTS RESERVED

iii

To my wife, to my parents, and to Dr. Sanjiva Weerawarana with gratitude

iv

Acknowledgements

This dissertation would have been impossible without guidance of my advisor, Prof. Dennis Gannon whose insights and encouragements made this thesis possible. I would like to thank him for his unceasing advice and support throughout the time period. I also want to thank my committee, Prof Geoffrey Fox, Prof Beth Plale, Prof David Leake and Dr. Sanjiva Weerawarana. Furthermore, I would like to thank Dr. Sanjiva Weerawarana for convincing me to read for a PhD and his unceasing attention and advice though years. I owe my deepest gratitude to my wife, Miyuru for her care, for her support, and for spending long years away from her home and parents on my behalf. I am indebted to her and my parents who have been bacons of my life and will ever remain. I also want to thank Jaliya Ekanayake for feedback and his support throughout my course of study, Lindsey Pauley for help with editing, and Suresh Marru and my fellow members of Extreme Lab for their insights, feedback, help, friendship, and last but not least extreme coffee hours.

v

Abstract Due to advances in distributed systems, social motivations, and economic motivations, scales of systems are on the rise. In large-scale systems, changes—caused by failures, maintenance, and additions—are a norm rather than an exception, and therefore, manually keeping these systems running is difficult, if not impossible. System management, which monitors and controls systems, is a prominent solution to this problem. However, management usecases differ from system to system, yet developing a specific management framework for each system defeats the purpose of building system management frameworks in the first place. Management frameworks that enforce management logic authored by users provide a solution for this problem. These frameworks enable users to change framework’s decision logic to cater for user’s specific requirements, and after deployed, they monitor and control target systems in accordance to the user-defined management logic. If these logic assert only a single component of the system, we call them local logic, and if these logic assert multiple components in the system, we call them global logic. The global logic depend on a global view about a system, which is non-trivial vi

to support in large-scale systems. However, they enable users to reason about the target system explicitly and, therefore, provide a natural way to express management usecases. This dissertation presents a new, dynamic, and robust management architecture that manages large-scale systems by enforcing user-defined management logic that depend on a global view of the managed system. Using empirical analysis, we have shown that it scales to manage 100,000 resources, which demonstrates that the architecture can manage most practical systems. This is a testament that despite its dependency on a global view of the managed system, a system management framework can manage systems in accordance to user-defined management logic and can still scale to manage most real world systems. Furthermore, we have demonstrated that the architecture is robust in the face of failures and stable with respect to different operational conditions.

vii

Contents

Acknowledgements

v

Abstract

vi

1

Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.1

Large Scale Systems . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

Sustaining Large Scale Systems . . . . . . . . . . . . . . . . . . .

4

1.2

Background and Terminology . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3

The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 viii

2

Background and Related Works 2.1

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2

Survey of Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3

2.4

3

15

2.2.1

Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2

Resource Level Data Collection . . . . . . . . . . . . . . . . . . . 19

2.2.3

Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.4

Resource Level Control . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.5

System Level Control . . . . . . . . . . . . . . . . . . . . . . . . . 28

Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.1

Centralized Managers . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.2

Group of Managers . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.3

Group of Managers with Global Control . . . . . . . . . . . . . . . 34

2.3.4

Monitoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . 37

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Hasthi Management Framework Architecture

40

3.1

Hasthi Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2

Manager-Cloud (MCloud) . . . . . . . . . . . . . . . . . . . . . . . . . . 44 ix

3.3

Meta-Model: An Abstraction for Monitored Information . . . . . . . . . . 47

3.4

Case for Delta-Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5

Decision Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6

Programming Hasthi to Manage a System . . . . . . . . . . . . . . . . . . 55

3.7

3.8

4

3.6.1

Management Actions . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6.2

Management Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.6.3

Resource Life Cycle within Decision Model . . . . . . . . . . . . . 60

How does it all work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7.1

User Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.7.2

Motivating Usecase . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Managed Resources and Instrumentations

68

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2

Hasthi WSDM-Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3

Instrumentations Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1

In-Memory Agent to Instrument Services . . . . . . . . . . . . . . 72

4.3.2

Host Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 x

4.4

4.5

5

4.3.3

Polling Based Agent . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.4

Process Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.5

JMX Based Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.6

Script Based Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.7

Logging Based Agent . . . . . . . . . . . . . . . . . . . . . . . . 75

Implementing Management Actions . . . . . . . . . . . . . . . . . . . . . 76 4.4.1

WSDM-Runtime Based Actions . . . . . . . . . . . . . . . . . . . 76

4.4.2

Shell Scripts Based Actions . . . . . . . . . . . . . . . . . . . . . 78

4.4.3

User Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Proof Of Correctness

81

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2

System Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3

5.2.1

Basic Definition and Notations . . . . . . . . . . . . . . . . . . . . 82

5.2.2

Basic System Definition and Notations . . . . . . . . . . . . . . . 82

5.2.3

Representing State . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Manager-Cloud Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 84 xi

5.4

6

5.3.1

Managed System Definition . . . . . . . . . . . . . . . . . . . . . 84

5.3.2

Constants in a Managed System . . . . . . . . . . . . . . . . . . . 87

5.3.3

Algorithm Sudo Code . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.4

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.1

Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4.2

Resource Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4.3

Manager Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4.4

Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4.5

System Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4.6

Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.5

Application to Hasthi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.6

Availability of the Manager-Cloud Algorithm . . . . . . . . . . . . . . . . 112

5.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Empirical Analysis 6.1

122

Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1.1

Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 xii

6.2

6.3

6.1.2

Factors and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.1.3

Test Environment and Settings . . . . . . . . . . . . . . . . . . . . 127

Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2.1

Limits of a Manager . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2.2

Load Behavior of Hasthi . . . . . . . . . . . . . . . . . . . . . . . 131

6.2.3

Limits of the Coordinator

6.2.4

Verifying Independence of Managers . . . . . . . . . . . . . . . . 136

6.2.5

Scalability of Hasthi . . . . . . . . . . . . . . . . . . . . . . . . . 138

. . . . . . . . . . . . . . . . . . . . . . 132

Sensitivity to Operational Conditions . . . . . . . . . . . . . . . . . . . . . 138 6.3.1

Sensitivity to Management Workload . . . . . . . . . . . . . . . . 138

6.3.2

Sensitivity to Epoch time intervals . . . . . . . . . . . . . . . . . . 141

6.3.3

Sensitivity to Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.4

Election and Recovery Behavior . . . . . . . . . . . . . . . . . . . . . . . 145

6.5

Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.6

Application to a Real Life Usecase . . . . . . . . . . . . . . . . . . . . . . 152

6.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 xiii

7

Managing Systems Using Hasthi 7.1

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.2

Managing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.2.1

7.3

7.4

8

158

Management Scenarios . . . . . . . . . . . . . . . . . . . . . . . . 164

Application Domain of Hasthi . . . . . . . . . . . . . . . . . . . . . . . . 166 7.3.1

Effects of Changes and Recovery . . . . . . . . . . . . . . . . . . 166

7.3.2

Architectural Solutions for Effects of Changes and Recovery . . . . 167

7.3.3

Handling Effects-of-Changes with Hasthi . . . . . . . . . . . . . . 169

Application Domain of Hasthi and Required Guarantees . . . . . . . . . . 170 7.4.1

Characteristics of a System . . . . . . . . . . . . . . . . . . . . . . 171

7.4.2

Methods used for Preserving State . . . . . . . . . . . . . . . . . . 173

7.4.3

Required Guarantees from Systems . . . . . . . . . . . . . . . . . 175

7.5

Pitfalls and Complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Managing Distributed Computations

186

8.1

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

8.2

Generic Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 xiv

8.3

8.4

9

Utilizing Application Behavior . . . . . . . . . . . . . . . . . . . . . . . . 191 8.3.1

Tightly Coupled applications . . . . . . . . . . . . . . . . . . . . . 192

8.3.2

Iterative Applications . . . . . . . . . . . . . . . . . . . . . . . . . 193

8.3.3

Applications with Limited State . . . . . . . . . . . . . . . . . . . 196

8.3.4

Loosely coupled Applications . . . . . . . . . . . . . . . . . . . . 199

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

Motivating Usecases 9.1

205

The Primary Usecase: LEAD System . . . . . . . . . . . . . . . . . . . . 205 9.1.1

Implementing Workflow Recovery Usecase . . . . . . . . . . . . . 210

9.1.2

Implementing Data Transfer Recovery Usecase . . . . . . . . . . . 213

9.2

Stream Processing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 215

9.3

Internet telephony, Video Conferencing or Internet TV systems . . . . . . . 218

9.4

Distributed Service Container . . . . . . . . . . . . . . . . . . . . . . . . . 219

10 Conclusion and Future Work

221

10.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 10.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 xv

10.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 10.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

A Appendix

231

xvi

List of Figures 1.1

Error Frequency Distribution of LEAD Errors . . . . . . . . . . . . . . . .

2.1

Architectural Stack for System Management . . . . . . . . . . . . . . . . . 17

3.1

Hasthi Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2

Hasthi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3

Meta-Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4

Decision Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5

Resource Hierarchy and Resource Lifecycle . . . . . . . . . . . . . . . . . 61

3.6

Hasthi from User’s Perspective . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1

Architecture of the WSDM-runtime . . . . . . . . . . . . . . . . . . . . . 70

4.2

Different Types of Management Agents . . . . . . . . . . . . . . . . . . . 72

4.3

Management Action Implementations . . . . . . . . . . . . . . . . . . . . 77 xvii

9

5.1

State and Lifecycles of Components . . . . . . . . . . . . . . . . . . . . . 84

5.2

Resource Time line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3

Availability of Hasthi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.1

Overhead on a Host while running Test-Services . . . . . . . . . . . . . . . 128

6.2

Limits of a Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.3

Hasthi Load Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.4

Test Setup of Hasthi with and without Test-Managers . . . . . . . . . . . . 133

6.5

Limits of the Coordinator . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.6

Correlation between Resources per Manager and Manager Overheads . . . 137

6.7

Response to Management Workload . . . . . . . . . . . . . . . . . . . . . 139

6.8

Sensitivity to Epoch Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.9

Sensitivity to Rule Complexity . . . . . . . . . . . . . . . . . . . . . . . . 143

6.10 Election and Recovery Behavior of Hasthi . . . . . . . . . . . . . . . . . . 146 6.11 Single Manager Overhead, CGLM and Hasthi . . . . . . . . . . . . . . . . 150 6.12 Multiple Manager Overhead of CGLM system . . . . . . . . . . . . . . . . 151 6.13 LEAD Recovery Times with Hasthi . . . . . . . . . . . . . . . . . . . . . 153

7.1

Hidden Complexities of System Management . . . . . . . . . . . . . . . . 159 xviii

7.2

Methodology to Integrate Hasthi With a System . . . . . . . . . . . . . . . 163

7.3

Characteristics of a System . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.4

Outline of Hasthi Application Domain . . . . . . . . . . . . . . . . . . . . 175

9.1

LEAD Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

9.2

LEAD Errors and Corrective Actions

9.3

Stream Processing Usecase . . . . . . . . . . . . . . . . . . . . . . . . . . 216

xix

. . . . . . . . . . . . . . . . . . . . 207

1 Introduction Ubiquity of Information Technology (IT) drives a feedback loop that nudges everything else towards IT solutions, thus boosting IT presence day by day. Hence, the presence of ordinary people in the Web is sky rocketing, and consequently, the potential user bases of systems are escalating. To support those systems and many other usecases, which we shall explore later, large-scale systems are needed. The first chapter starts by asserting that there are technological know-how, social motivations, and economic motivations for building large-scale systems and demonstrates, based on past experiences, that keeping large-scale systems running has been a daunting task. Moreover, drawing observations from previous works, this chapter argues that system management could be a potential solution to this problem, thus establishing motivations for system management frameworks that can manage large-scale systems. Furthermore, this chapter argues that in order to be useful, systems management frameworks should support user-defined, custom management logic. However, since interpreting large amounts of monitoring information is hard, supporting user-defined logic in largescale systems has given rise to many challenges. Using these observations as a basis, this chapter presents the motivation for the thesis, describes the problem that underlies this thesis work, and discusses contributions. 1

1. Introduction

1.1

2

Motivation

The goal of this section is to establish the motivation for the thesis problem, and to that end, this section will discuss motivations for large-scale systems, technological advances and trends that made them possible, and examples of these systems. The following subsections will illustrate problems, which arise while keeping these systems running, and cast system management frameworks as a potential solution.

1.1.1

Large Scale Systems

Information technology based solutions are crucial for every national and international infrastructure. There is a growing need for systems that can serve millions and even billions of users. With the rise of Internet and high bandwidth communications, everybody else is one click away, which results in virtual communities and new possibilities, and both have brought forth both social and economic changes. For example, as illustrated in the book “The Long Tail [32],” unlike earlier, most money is made neither by selling few products (e.g. first 100 box office movies), nor by selling products to elite few customers, rather by selling a wide variety of things to a large customer base. To support these trends and incorporate ever-increasing users, large-scale systems are essential. Large-scale systems are made possible by increased bandwidth, wider use of Information Technology, cheap yet powerful computers (multi core), availability of architectures like SOA and Grid, and utility computing trends like Cloud computing. Therefore, driven by necessity, large-scale systems are becoming universal. For example, with the rise of utility computing efforts like Amazon Computation Cloud (EC2), large-scale systems are becoming increasingly practical for small and medium size organizations. For instance, with EC2, running thousand machines for few days will cost only few thousand dollars. Therefore, if those organizations have a good reason to have a

1. Introduction

3

large-scale system, they could afford one. Among several motivating examples, New York Times used EC2 to convert their archives from TIFF to PDF [123], and to handle a surge of popularity, Animoto, which was running in EC2, scaled their system from 40 nodes to 3400 nodes within a week [3]. Furthermore, with computation power becoming more and more accessible, small and medium size organizations may potentially perform complex analysis in the near future, thus giving rise to large-scale systems. For example, the Clouds like EC2 could be used by a film crew doing a complex special effect, a newspaper analyzing statistics for a story, a financial firm doing an audit, or a small research group doing a simulation. All these cases would have been impossible a year before. Many similar motivating examples are listed in the Amazon Web services site [2]. Large-scale systems are common in areas like online services, national and worldwide infrastructures, large-scale sensors, and data intensive systems. Let us briefly look at few of these systems. Internet has given rise to an avalanche of information, and Internet services provide information processing in many ways. Google has been the leader and the symbol of largescale information processing, and as it turns out, handling information is more an I/O bound job than a computation bound job, and it made sense to use thousands of commodity hardwares (e.g. Google Architecture [37]), which gave rise to large-scale systems. It is wellknown that large-scale systems are present behind online services for search, sales, online games, auctions, Internet Messaging, Internet telephony, and banks etc. For instance, High Scalability.com [12] includes a number of cases studies of large-scale systems, which are based on well-known systems like YouTube, PlentyOfFish, Flickr, Amazon, and Twitter. Furthermore, national and worldwide infrastructures have given rise to large-scale systems. Among those systems are air traffic control, traffic sensors and control, border

1. Introduction

4

protection sensors, battle field observations, intelligence gathering, cyber-security, and egovernance. Two related classes are systems that provide real time predictions using sensor data (e.g. meteorology, oceanography, and earth quake data), and complex event processing systems like fraud detection, monitoring stocks, tracking sales, online patient monitoring from home, detecting forest fires using satellite images or sensor networks, and home security detection systems. Data intensive systems—a rising branch of e-science, which handles everything from Giga bytes to Peta bytes of data—give rise to large-scale systems. For example, in bioinformatics, a human genome related experimental station can generate 1TB/day, some observation telescopes can generate 200GB/sec, and particle physics experiments done at CERN can store, process, and access around 10PB/year [71]. There are infrastructures, which build around these and many other usecases, to collect raw data, generate metadata, archive, search, visualize, generate information by processing, and derive knowledge from information. Furthermore, these systems support moving, accessing and caching data, and typically are comprised of hundreds to thousands of machines and services working together. A detail discussion on this class of usecases can be found in Hey et al. [71].

1.1.2

Sustaining Large Scale Systems

Even though new developments have made large-scale systems possible, keeping those systems running has been an arduous task; it is more a black art than a science. According to Patterson et al. [101], Total Cost of Ownership could be as high as 5 to 10 times of the software and hardware cost, and the cost incurred by the management complexity is a significant portion of operational costs (e.g. Ganek et al. [61]). Both Autonomic Computing initiative [61] and Recovery Oriented Computing Initiative [101] have cited much evidence, which demonstrates the cost of sustaining systems. Following are a few facts about large-scale systems.

1. Introduction

5

1. Failures are a norm rather than an exception – In 2008 Google IO sessions, Jeff Dean, a Google Fellow, said “with 10,000 servers, each having a Mean Time to Failure (MTTF) of thousand days, 10 failures/day should be expected” [23]. As derived by Baumann [39], when a system is composed of n independent critical components (series) each having MTTF of f, the system’s MTTF is f/n. Therefore, the MTTF of the above servers is (1000 days/ 10,000) = 0.1 days, which confirms Dean’s statement. However, it is worth noting that hardware failures are only a fraction of all errors. 2. Expensive - As illustrated in Ganek et al. [61], even now, industries incur nearly half of the IT budget recovering from failures. These costs are increased in large-scale systems because when a system scales up, the complexity increases and not all parts scale the same way. Therefore, new bottlenecks surface. 3. Unreliability of Composite Operations – When the number of components participating in a composite operation increases, reliability of composite operations (e.g. workflows) decreases. For example, as derived by Pierre [92], if a workflow is composed of services each having a fi probability of success, then the probability of success of the workflow is Πni=1 (1 − fi ). That is, when n = 6, and components success rates are 0.99, 0.9, and 0.8, workflow success rates are 0.94, 0.53, and 0.26 respectively. 4. Unreliable Middleware – Grid is one of the primary platforms for building largescale systems, but the reliability of Grid has been less than satisfactory. For example, Khalili et al. [76] have observed that success rates among all Grid operations are as low as 55-80%, and Gannon et al. [62] and Fraser et al. [59] have reported similar observations. 5. Manual Control is Difficult – In a large-scale system, manually keeping track and

1. Introduction

6

controlling components are hard, if not impossible. With automation and monitoring, operations like verifying the health of a system, performing recovery, and changing configurations are greatly simplified. 6. Low Utilization – Utilization of computing resources has been estimated to be low as 15% to 20% (e.g. [125]), and system management can improve utilization by automatically allocating and optimizing resources based on trends. 7. Need for a High System to Admin Ratio – Cheap hardware and high administrator salaries have motivated a high system to admin ratio, and according to James Hamilton, the Architect of the Microsoft Data Center Future group, a system to admin ratio of 10:1 is insufficient, and they have achieved a value high as 2500:1 [69]. Management and monitoring are a prerequisite for achieving high system-to-admin ratios. 8. Geographical and Administrative Distribution – Most large-scale systems are distributed geographically and administratively. Among other complexities, scheduled and unscheduled downtimes spread out due to varying maintenance policies and peak loads, thus reducing availability. 9. Human Error – As pointed out by Patterson [16, 43], when decisions have to be taken in crunch time under pressure, 40% of all downtimes result from operator errors. Even though automation can help, they observed that automation does not necessarily fix the problem because typically complex problems are left for human users. 10. Need for QOS and SLA – The importance of Quality of Service (QOS) and Service Level Agreement (SLA) have been increased with new concepts like software and hardware as a service gaining popularity. Ensuring QOS and SLA would call for automation and monitoring. 11. Downtimes are expensive – The cost of a down-time ranges from about two hundred

1. Introduction

7

thousand dollars per hour for an Internet Service Provider (ISP) to six million dollars per hour for a brokerage [101].

In order to address these problems, there is a need for dynamic systems that can recover from failures and adapt to changes, and both designing highly available fault tolerance systems and monitoring and managing systems are two potential solutions to the problem. Usually, the former approach is applicable only for new systems because it needs a complete redesign of the target system. In practice, it is not cost effective to build specific fault tolerance architectures for each system, and specific solutions are only affordable for large organizations like Google and Yahoo. Furthermore, highly available systems need to understand all the failure scenarios at the design time, which is difficult, if not an impossible feat to accomplish. On the other hand, management frameworks provide a middleware-based solution to this problem. Since monitoring and management are generic problems, it is fitting that they are solved with generic middleware solutions. Furthermore, monitoring and management is a complex problem, and solving it takes expertise. Since they do not reinventing the wheel again and again, using that saved effort, the middleware developed to handle the problem can hide the details, perfect solutions, and reduce the cost of development. Given that underline management logic in these solutions are sufficiently user-customizable, they will be applicable for a wide variety of systems. However, this thesis does not suggest that building highly reliable components (e.g. OcenoStore [79]) is not useful. However, building those components is practical only when they have well-defined behaviors and interfaces, whereas systems have complex behaviors and typically are composed of many components. Therefore, typically systems are not designed from scratch as a highly available unit, instead monitored and managed.

8

1. Introduction

On the other hand, computer scientists have labored to build systems that avoid failures. However, experiences from real world deployment between 1999 and 2008 (e.g. [117, 21, 22, 1]) suggest that even with vast resources, expertise, and experience, failures have persisted. A different solution to this problem has been proposed by the Recovery Oriented Computing initiative, who argues that failures are a fact that a system has to deal with, and therefore, they argue that Peress Law is applicable in these settings. “If a problem has no solution, it may not be a problem, but a fact not to be solved, but to be coped with over time.” — Shimon Peres (”Peres’s Law”)

With this perspective, system failures are unavoidable, so coping with them over time requires solutions (management frameworks) that are dynamic and highly user configurable. With such systems, Administrators can efficiently address uncovered problems. A management framework monitors and controls a system based on assigned management logic. However, even with a management framework, users can define management logic for only those scenarios that are known, and identifying all failure scenarios in a system is tedious. However, as pointed out by Gray et al. [68], a remarkable observation by Adams [28] provides a break through to this problem. They observe that some bugs (software faults) occur rarely (benign bugs), whereas some occur frequently (virulent bugs), and virulent bugs are a small portion of all error types but cover large portion of all occurrences, approximating the well-known 80-20 rule (a.k.a. Pareto Principle). We have observed similar results with a large-scale e-science cyber-infrastructure, and Figure 1.1 illustrates the frequency distribution of 5000 errors occurred in the LEAD cyber infrastructure project [53] over 20 months, which follows the Pareto Principle.

9

1. Introduction

Error Frequency Distribution of LEAD Errors 1400 Error Frequency 1200

Frequency

1000 800 600 400 200 0 0

10

20

30

40

50

60

70

80

90

Error Types sorted by Frequency

Figure 1.1: Error Frequency Distribution of LEAD Errors

These results make a very strong case in favor of system management, suggesting that system designers can study failures and customize the management system (e.g. by writing rules) to handle the most frequent bugs. High availability can be achieved with this approach, which addresses most frequent failures while ignoring other errors (e.g. just restart the component or do nothing at all). Management is a luxury at a small scale, but a necessity at a large-scale. Even with a moderate deployment of 10-20 services, management can simplify deployment and maintenance, potentially reducing the total cost of ownership. With the scale of systems rising and large-scale systems potentially becoming accessible to a wider range of audiences, the future of these large-scale systems would depend on middleware solutions to sustain them. Based on the previously mentioned observations, we have identified the following requirements that should be supported by a system management framework.

1. Introduction

10

1. User-defined management logic – Since management scenarios change from a system to system, a generic management framework must support user-defined management logic like rules, which enables users to change the framework’s behavior according to the requirements of the target system. 2. Robust - The management framework should recover from failures of its components, or otherwise, the reliability of a system may be reduced when management is added. Furthermore, since resources often leave in a large-scale system, the framework should not be affected when resources leave the system. 3. Dynamic - Since it is hard to keep track of components of a large-scale system and components often join and leave, the framework should discover components of the system and also allow them to join and leave the system. 4. Scalable – The framework should be able to manage a sufficient number of resources by adding more managers. To summarize, while handling real life usecases, large-scale systems are becoming more and more common, and sustaining them requires user-customizable system management middleware that can monitor and control these systems.

1.2

Background and Terminology

We say a system is composed of independent components called resources that are working together to achieve a common goal. We define a “system management framework” (a.k.a. management system, management framework) as a framework that facilitates automated, semi-automated, or manual processes of monitoring and controlling a system as a whole to keep it within acceptable bounds. This process is defined as management. The system being managed by a management framework is called a “managed system”.

1. Introduction

11

To be managed with a management framework, resources in a system must expose a representative subset of its state and enable external control, so the framework can monitor and control resources. Such a resource is called a “manageable resource,” and if such a resource is being managed, we call it a “managed resource”. Typically, the state of a managed resource is exposed as properties, and we call these properties “resource properties”. Furthermore, resources are controlled by performing actions on them, and these actions are called “management actions”. Furthermore, given a resource, we call an externally (remotely) stored snapshot of its resource properties as a meta-object of the resource, and a collection of meta-objects such that each meta-object has an one-to-one mapping to the managed system is called a metamodel of the system. A management framework is composed of one or more services called managers, and they work together to manage a given system. We call logic that evaluates a managed system using information collected from resources and carries out management actions as management logic, which is a function from the state of a managed system to the set of all management actions (F {s|s is a system state} → 2{a|a is a management action} ). We say management logic depend on a global view when either the function or resulting management actions depend on properties or information about the overall system (global information). For example, let us consider the following management logic, which says, “If the system does not have 5 message brokers, create new brokers, and connect them to the broker network.” Logic should detect that when the system has less than five brokers, find the best place to create a new one, create a new one, and connect it to existing brokers. The process depends on information about multiple resources of the system. Hence, the above logic depends on a global view of the system. If logic depends on a global view, we call them global logic and otherwise call them local logic.

1. Introduction

1.3

12

The Problem

Often, users think about a system in terms of the overall system health. For example, if we manage a broker hierarchy, often we care whether the hierarchy works as a whole, not whether a specific part of it works. Hence, the goal of management logic is to keep the system healthy as a whole, and resulting management logic, therefore, depend on a global view of the system. Although it is possible to achieve global control using emergent approaches—where local decisions give rise to global behavior—composing such logic is challenging even for a researcher, let alone a user. Therefore, supporting management logic that depend on a global view simplifies the management logic authoring, and hence it is a preferred property of a management framework. A management framework is composed of services called managers, and they monitor and control resources assigned to them. However, a single manager process (manager) could not scale up to manage a large-scale system, hence resources of a large-scale system are assigned to a group of managers. This distribution of resources among managers leads to a split brain, that is, each manager only has a partial view of the system and, therefore, can only make local decisions about resources (e.g. evaluate local management functions). If a global management logic depends on two resources and those resources assigned to two managers, neither manager can evaluate the global management logic. This thesis addresses the enforcement of global management functions in a scalable manner, and the following research question illustrates the problem.

“How can a dynamic and robust management framework, which manages a large-scale system by enforcing user-defined management logic that depends on a global view of the managed system state, be implemented?”

1. Introduction

13

Apart from addressing the aforementioned global management function evaluation, the research problem incorporates requirements for a management framework discussed earlier.

1.4

Contributions

Our primary contribution of this thesis is proposing, implementing, and analyzing a dynamic and robust management architecture, which manages large-scale systems by enforcing user-defined management logic that depend on a global view of the managed system state, and discussing its applications. Moreover, we demonstrated that despite it dependency on a global view of the managed system state, the proposed approach can scale to handle most practical systems. Chapter 10 illustrates the contributions in detail.

1.5

Outline

We have illustrated the rise of large-scale systems, discussed evidence to motivate management frameworks for large-scale systems, and identified a research problem. This problem is addressed by a system management framework called “Hasthi,” where the name stands for Elephant in Sanskrit that denotes robustness and stability. The thesis begins with a survey of existing distributed system management frameworks and other related work, thus establishing the state of the art in the system management arena. The proposed solution to the aforementioned research problem is illustrated in the proceeding chapter where it presents the Hasthi architecture and different architectural choices available. The following Chapter 4 presents instrumentation choices provided by Hasthi, which aid users in exposing their resources as manageable resources.

1. Introduction

14

Chapters 5 and 6 are devoted to the demonstration of claims made in earlier chapters: the former includes a formal proof of manager-cloud robustness and the delta-consistency exhibits by the meta-model, and the latter includes a scalability analysis, a series of empirical analysis designed to measure the sensitivity of the system to different operation conditions, an application of Hasthi to a real life usecase, and a comparison of Hasthi to another management system. To explore the application domain of Hasthi, Chapter 7 presents a taxonomy of systems and seeks to identify different guarantees required by systems in different classes of the taxonomy. Then, each class of systems is studied for applicability of Hasthi and to identify what Hasthi expects from systems in each class. Chapter 8 discusses possibilities of using Hasthi to manage distributed computations. Furthermore, Chapter 9 presents our experiences with applying Hasthi to manage a large-scale e-science infrastructure and a series of motivating usecases. Finally, Chapter 10 concludes the thesis by revisiting results, contributions, and their implications.

2 Background and Related Works To establish the state of art in system management, this chapter surveys existing management frameworks and compares and contrasts their architectures to Hasthi while paying special attention to the frameworks that support global control. We start with an enumeration of different areas of distributed systems that have contributed to system management and propose an architectural stack for system management, thus establishing a reference model against which related works are aligned. Using this architectural stack, we present the system management literature, identify key insights, and discuss both pros and cons associated with different architectural choices. Finally, having established the state of art in system management, we present, compare, and contrast individual management frameworks with Hasthi.

2.1

Outline

System management can be loosely defined as a field of study that is concerned with building tools and architectures that facilitate the monitoring and controlling of distributed

15

2. Background and Related Works

16

systems. We call parts of a distributed system “resources”. The job of a system management framework (a.k.a. management system, management framework) is to facilitate automated, semi-automated, or the manual process of monitoring and controlling a system as a whole to keep it within acceptable bounds. The primary motivation for system management is that it is difficult, if not impossible, to manually keep track of resources in a system, and system management is concerned with instrumenting resources with sensors for measuring and controlling them, collecting and analyzing sensor data, making decisions on the system health, deciding on corrective actions, and controlling the system according to these corrective actions. As explained in Chapter 1, monitoring and controlling are implemented using one or more processes called “managers,” and they perform all or a subset of aforementioned tasks. Most initial works were done in telecommunication and network management, where these techniques were used for managing large networks composed of hundreds to thousands of nodes. However, with the rise of complex systems, system management has been used for managing a wide variety of systems. In recent years, Autonomic computing grand challenge [55] presented by IBM research has given rise to many related efforts in both industry and academia. Therefore, management frameworks are found under diverse sub-fields, where the most prominent among them are network management, system management, adaptive systems, and autonomic systems. Several literature surveys discuss these sub-fields: Huebscher et al. [89] discuss autonomic systems, Philippe et al. [86] discuss network management, Papazoglou et al. [99]. discuss Web Services management, and Zanikolas et al. [129] discusses grid monitoring systems, Sadjadi [106] discusses adaptive systems, and Ghosh et al. [66] discuss self-healing systems. Based on our survey, to understand system management, we have proposed the architectural stack illustrated in Figure 2.1. A scalable, fully-automated management framework

17


!" # $$$% & Figure 2.1: Architectural Stack for System Management

should support all levels of the stack, but some systems support only a part of the stack. In such cases, humans usually fill in for missing parts. This stack represents the process for data collection, processing, and arriving at decisions, where each level is built on top of the bottom levels and provides abstractions for upper levels. As shown by Figure 2.1, the stack can be explained as follows. Instrumnetations collect data from resources, and the resource level data collection groups instrumentation data collected under each resource and exposes them to upper layers through one of the specifications defined in the next layer. A management framework collects data from


18

resources, provides resource level controls, and also provides aggregations and summarization across resources. Finally the system level control provides global control using these aggregated and summarized data. The stack is parallel to the autonomic control-loop, where each resource (an autonomic element) exposes sensors and actuators, and they are managed by the Monitor-AnalyzePlan-Execute control-loop, which is sometimes called the Sensor-Brain-Actuator loop. However, since the autonomic control-loop is generic and does not capture both local and global control, instead of using the loop, we proposed this stack with the focus on global control. In the next section, we shall discuss each layer in detail.

2.2 2.2.1

Survey of Design Choices Instrumentation

Since a management framework watches over and controls a given system, it needs means of collecting data from resources of the system and controlling those resources to keep them in acceptable bounds, and this process is called instrumentation. Instrumentations are usually resource-specific and they measure and expose sensor data related to resources and enable management actions that are used to control resources. They are implemented in Hardware, Operating System, Middleware, and Application levels of systems. Among examples for sensors are hard disk manufacturer instrumentations (e.g. expose load and seek errors), Windows Management Instrumentation (WMI) [24], UNIX shell commands (e.g. ps and top commands) and monitoring kernel patches, Java virtual machine Instrumentations (e.g. memory, CPU usage), middleware instrumentations (e.g. tomcat expose data about web applications) , and applications specific instrumentations.


19

Lets us briefly discuss different implementations of management actions. The most common method used for implementing management actions is using UNIX shell commands. Since shell commands must be executed locally in the host where the target resource resides, these commands are executed either by an agent running in each host that accepts shell command requests and executes them or by remote execution mechanisms like Grid fork jobs or SSH support. Furthermore, deployment frameworks like SmartFrog [67] and Plush [31] provide powerful middleware that support management actions related to service deployments. Adaptive software and service containers provide support for programmatic deployment, un-deployment, updates and configurations of services that target these containers. However, runtime control gives rise to the “hot configuration” problem, where the running system needs to be translated from the old configuration to a new configuration while it is in use. This topic is addressed in detail under Adaptive middleware, and Sadjadi [106] provides a detailed description. Due to the support for on-demand virtualized hardware and VM level pausing, virtualization technologies are a versatile tool for management actions. For example, unlike typical systems where an order for hardware takes days, virtualized hardware can expand on-demand. Furthermore, VM level pausing has simplified check-pointing and migrations significantly.

2.2.2

Resource Level Data Collection

This layer groups all instrumentations of one resource together and exposes the resource as a single entity to upper layers using management standards. Examples for management standards are WSDM [26] and WS-Management [27] for Web Services, SNMP [46] for network devices, and CIMP [25] for OSI management [15]. A resource that supports one or more of these specifications for monitoring and control is called a “manageable resource”.


20

A manageable resource has properties, has operations, and generates events, and the resource usually represents sensor data as properties, management actions as operations or mutable properties, and changes to some properties as events. The resource level data collection code, which is called an agent, can be either placed inside a resource or placed externally to the resource, and the former approach is preferred since the agent has access to more sensor data sitting close to the resource. However, due to security concerns, simplicity, and legacy resources, external agents are also in use. Manageable resources expose information to managers using either the pull model, where managers have to explicitly ask for information, or the push model, where resources push information as events to managers. With the advent of event and notification systems, the push approach is favored. However, there are cases like legacy systems monitoring where the pull-based approach is useful. Furthermore, in adaptive or autonomic resources, the data collection code may include a control-loop that makes adjustment to the resource based on sensor data. We will discuss control-loops in greater depth under decision models.

2.2.3

Monitoring

This section discusses system level data collection, and we shall explore monitoringa-system under three topics: abstractions provided to upper layers, architectures used, and update frequency.

2.2.3.1

Abstractions to Upper Layers

As discussed in the earlier section, manageable resources expose sensor data as resource properties and events, and the monitoring layer collects that sensor data and presents it to higher layers via one of the following abstractions.


21

The first type of abstraction is events, where decision layers see sensor data as events and make decisions using complex event-processing methods. Examples of this include DREAM [44], Hifi [30] and Smart Subscriptions [57]. The second type of abstraction is meta-models, where the monitoring layer builds a meta-model—an external representation of the managed system—and keeps updating the meta-model to reflect changes happen in the managed system. Decision layers reason using the meta-model, and since the meta-model is updated to reflect changes to the managed system, it is an approximation to reason using the system. Meta-models comes in three flavors: in-memory (e.g. Marvel [78]), database-based (e.g. e.g. Hyperic [13]), and repositorybased (e.g. MIBs in Astorable [103], the CIM repository in Vambenepe et al. [120], and the event recording service in Adms et al. [42]). The third type of abstraction is an on-demand query interface, where decision layers can query sensor data. For example, ACME [97] sends queries down a spanning tree and collects results bottom up, and Lim et al. [110] and PIPER [72] also use an on-demand model. Moreover, Autopilot [121] and P2P network Management [98] provide a front-end to sensors, and both allow a user to discover sensors and perform on-demand queries. Among abstractions, the third type is not common and most systems use a push-based approach with the first or the second type.

2.2.3.2

Monitoring Architecture

Among preferred properties of an architecture are scalable data collections, resource discovery, the ability to seamlessly handle resources that join and leave, and the absence of a single point of failure. However, scalability is the primary challenge, which is to achieve the above properties while managing tens of thousands of resources. A wide variety of architectures have been experimented with monitoring frameworks,


22

a few of which are described below. The naive implementation that includes one manager processing events generated from all resources is neither scalable nor robust. Consequently, a group of managers, where each resource is assigned to a manager, is used. However, to derive a global view of the system, some form of a control structure is required, and a few different structures have been proposed. A hierarchy of managers with resources assigned to leaf nodes is the most common control structure. This is where each node sends sensor data as events to the parent who aggregates these sensor data from children. There are few more variations to this idea. For example, Gadgil et al. [60] use a publish/subscribe broker hierarchy for communication between nodes, Astrolabe [42] uses a gossip protocol to aggregate information over the hierarchy and to replicate data into other nodes in the same level, and Gangila [49] uses a hierarchy of clusters. In the last two cases, monitoring information about a node is replicated across all nodes in the same level, and therefore, in those systems, information is not lost even if some nodes fail. Another level of data collection called gauges sits between resources and managers and process events, filter events, and generate composite events. Rainbow [63] and eXtreme [74] are two examples that use gauges. The above methods assume a hierarchy exists and, therefore, do not discuss how to establish and maintain the control structure. Setting up a control structure manually or using something already in existence like DNS hierarchy are two possible solutions. Furthermore, to recover from failures, possible solutions include recovering failed nodes, electing new nodes in the place of failed nodes, or using replication. Among more advanced solutions, P2P systems—a scalable, robust, and dynamic approach—can be used to establish a control structure. The routing algorithm of a P2P network gives rise to an inherent spanning tree, which automatically recovers from failures. This spanning tree can be used as the hierarchy for monitoring. For example, Yalagandula et al. [128] has extended Astrolabe [42]


23

to use the spanning tree of a P2P network as its hierarchy. Furthermore, Hasthi uses managers and a coordinator elected among them to provide a control structure. Elections are performed using broadcast over a P2P network composed of managers. An alternative approach is publishing sensor data to a publish/subscribe broker hierarchy and using Complex Event Processing (event stream processing) to monitor events and fire corrective actions. A positive aspect of this approach is that because all communications happen via topics, resources and managers do not need to know each others addresses. On the other hand, Galaxy [124] clusters-management-framework uses a hybrid model where a group-communication based system, which provides tightly-coupled control, manages each cluster, and a gossip-architecture is used for inter-cluster management. Finally, since it is difficult to manually keep track of resources in a system, automatic discovery of resources is important. A management framework discovers new resources via messages sent to a well-known location, broadcasts, or a well-known topic in a publish/subscribe system.

2.2.3.3

Update Frequency

The frequency-of-update-propagation has a major effect on the scalability of a manager, making it an important parameter in a monitoring system. There are four basic options: heartbeats (data in propagated every T time), propagate on read (pull), propagate on write (push), and gossiping. However, interesting extensions exist. For example, in Yalagandula et al. [128], each agent chooses one of the first three options based on the read-write-ratio of the property, and in a priority-based model, high priority events trigger a data transfer while low priority events are propagated with heartbeat messages or high priority events.


2.2.3.4

24

Data Aggregation and Summarization

If all the collected data are transferred to upper layers of control, when a system scales, the sensor data will overwhelm upper layers. Therefore, at each layer, data should be processed, analyzed, and only necessary high-level data should be propagated to upper layers. Sensor data can be processed and compacted by aggregation or summarization where aggregation applies one of the aggregation functions, SUM, MEAN, AVG, MIN, MAX, MEADIAN, COUNT, and VALUE to data (e.g. Astrolabe [42]), and on the other hand, summarizations only keep the most informative properties in sensor data. Alternatively, upper levels may inject code into lower layers that process and compact data (e.g. Monalisa [96]). However, executing remote codes may lead to potential security issues. Therefore, this method requires lower level nodes to trust upper level nodes. Another possibility is to limit injected code to queries (e.g. SQL, or Event Query Language), where the risk of injected code is greatly reduced.

2.2.4

Resource Level Control

We call the brain of a manager—a unit that recommends management actions based on sensor data and fits inside one manager—as a decision-unit, and this section discusses associated design choices and its uses. The decision-unit is a function that accepts the state of a system as an input and outputs management actions, and it can be realized using a wide variety of technologies like decision tables, programmatic-logic, forecasting models, rule-based systems, utility functions, control theory, decision trees, and artificial neural networks. Often provided by users, the logic that decides how a decision unit behaves is called


25

management logic, and decision tables, rules in a rule-based-model, code in a programmaticmodel, and utility functions are examples of management logic provided by users. On the other hand, using supervised or unsupervised learning, some models figure out (evolve) the management logic by itself. Artificial neural networks are an example of this. Furthermore, models with fixed (static) management logic (e.g. [70], [55]) are typically rigid and non-flexible; however, following are two efforts to make them more flexible. The JADE framework [41] provides management logic as static yet reusable components that can be composed, and policy-based approaches parameterize static implementations with policies (e.g. [35]), which is an approach used by many agent-based systems (e.g. [73]). Let us look at few examples. A decision table is a simple lookup table (e.g. a problem solution database), utility function-based models try to optimize a user defined utility function (e.g. Kumar et al. [80]), and forecasting models predict outcomes and use them for decisions. For example, ACME Architectural model ?? uses a meta-model of the system to simulate expected outcomes and makes decisions accordingly. Also used by Hasthi, Rule-based models fire actions when conditions specified in rules are met, and they use simple IF/THEN rules or prolog-like rules that are capable of deriving facts. Furthermore, presented as code fragments, programmatic logic is also used for taking decisions. For example, Gadgil et al. [60] execute user-provided Java code to manage a service.

Following are two design decisions associated with decision models. The first is deciding the number of decision models in the system. A system can be managed with one decision model or several specialized decision models where each addresses a different aspect of control. In the latter method, models can be placed in different managers (e.g. Sweitzer et al. [100]). However, specialization raises the problem of coordination between decision models, and among solutions are planning actions, conflict

26


Rules

Conflict resolution

Verifier

DIOS++ [83]

If/then

yes, use priority

No

Rainbow [63] InfoSpect [104]

If/then Prolog Like

Marvel-1995 [78]

pre and post condition

Naik et al. [93] Sophia [126] RecipeBased [114] HiFi [30] & DREAM [44] IrisLog [94] ACME [97]

only monitoring

Yes Prolog like Java Code Pub/Sub Filters and action DB triggers timer, sensor, completion conditions trigger actions

ReactiveSys [87] if/then Policy2Action [35] define conditions and actions as a policy

Batch Mode used? Yes, next iteration

has metamodel? Yes

Planning

No

No No

No

Yes Yes

No No

Yes

No

Yes

No

Yes No

Yes No No No

No No No

Yes No No No

No possible with timer conditions No

No No

No No

Yes No

No Yes

Yes No

No

No No

No No

No Yes, based on runtime state and history

No associated tests decide the activation of actions.

Table 2.1: Summary of Rule-based and Related Decision Models


27

resolution by priority (e.g. Jagadish et al. [125]), and conflict resolution by first-come-firstserved methods (e.g. Steenkiste et al. [114]).

The second is when to evaluate the decision-unit, and two possible approaches are evaluating whenever a change happens, and evaluating periodically. In the latter method, changes are collected over a period time and processed as a batch, which is done primarily for performance reasons.

The table 2.1 presents a summary of rule-based and related decision frameworks. The first column presents the nature of rules used in the system, the second asserts the support for conflicts resolution, and the third presents the ability to introduce sanity checks to the system. Rules are evaluated either using a batch mode, where they are evaluated once per time-period or whenever an update takes place, and the fourth column specifies the choice. If rules are evaluated using a meta-model of the system, rules have a broader view of the system and consequently provide better global control. The fifth column assesses the existence of a meta-model. Finally, the column six discusses the availability of planning, which would resolves conflicts among actions and selects the best timing to perform them.

In Summary, compared to the decision frameworks discussed, the main distinction of the decision framework of Hasthi is its distributed nature. Specifically, we support local rules for resource control and global rules for global control. Furthermore, In contrast to IF/THEN rules that could only evaluate facts, the proposed system uses a Prolog-like object oriented rule language (Drools [9]), which is capable of deriving information by processing presented facts. Moreover, Rules are evaluated using a meta-model with batch processing; thus, rules have a complete view of the system. Finally, the system is backed by a robust and scalable architecture that lends robustness and scalability to the decision model.


2.2.5

28

System Level Control

This section is concerned with global (system level) control over a system. In systems managed with one manager, the manager has a global view of the system; therefore, one of the decision units discussed in the earlier section can manage them. However, one manager cannot scale to handle a large-scale system, so resources monitoring and control had to be distributed across many managers, and as a result, no single manager has a global view of the system, which complicates the system level control tremendously. In this setting, each manager will have its own decision unit and operate with a partial view of the system, and the system level control achieves global control using these managers. There are many proposed solutions. One solution is arranging managers in a hierarchy, assigning resources to leaf nodes, and aggregating monitoring information from child nodes to parents as described in Section 2.2.3.2. In this setting, nodes at higher levels take increasingly high level decisions using aggregated data available at the level. A downside of this approach is that with data aggregation, available data is averaged across resources and information about individual resources are lost, which can be useful in some situations. Another solution is to define a coordinator in order to provide global control, which is typically elected among other managers. For example, the model proposed by Hasthi summarizes the sensor data such that it fits inside the coordinator memory, and using the summarized data, it issues commands to coordinate managers and control resources. In another variation of the elected coordinator proposed by Schoenwaelder [109], managers work as peers and the coordinator is used for conflict resolutions and synchronization. Decentralized control, where each manager acts with local knowledge in such a way that a global control emerges from local decisions, is a very scalable approach, and DMonA [91] is an example. However, designing local management logic so that a global control will


29

emerge from those properties is far from trivial. Therefore, this method can only be used with some usecases.

Two more interesting approaches are Georgiadis et al. [64], where all managers are given a consistent view of the system by broadcasting sensor data and management actions to all managers using a reliable ordered multicast, and Wildcat [73], where a hierarchy of managers is built using an agent framework with whiteboards at each level of the hierarchy for communication among managers at the same level.

Among efforts of coordination, in Naik et al. [93], all triggered actions are sent to a planner, which collects, analyzes, and refines them to ensure the system consistency, and if refinements are not possible, human intervention is sought.

Furthermore, the following three cases focus on coordination between managers working in a peer fashion. Cheng et al. [47] provide a translation infrastructure between types, elements, operation, and errors for heterogeneous managers. They also discuss possibility of using voting mechanisms to arrive at decisions. Furthermore, Kephart et al. [75] discuss coordination between two heterogeneous managers. On the other hand, Deugo et al. [51] presents a “negotiating agents pattern” that could be used by independent agents to arrive at a conclusion. Agents share action plans with other agents, and each agent may accept, provide a counter proposal, or reject actions where agents re-plan based on policies that decide which actions take precedence. However, systems that work without a coordinating structure (e.g. coordinator, hierarchy) and depend on agreements between individual managers do not scale because with n managers, the number of agreements required are n(n − 1)/2; thus, overhead of decisions will grow rapidly with more managers.


2.3

30

Management Systems

Earlier sections focused on system management in general, and this section compares and contrasts Hasthi with specific systems and classes of systems.

The table 2.2 classifies management frameworks where rows are organized by different communication patterns (e.g. Publish/Subscribe, Peer to Peer, Group Communication) and columns are organized by functionalities and global control. For this discussion, we will arrange systems based on the global control they provide.

2.3.1

Centralized Managers

RefArchi [38] presents an instance of Sensor-Brain-Actuator loop using a centralized decision-making unit. Gauges, an intermediate aggregation layer between resources and managers, are found in Rainbow [63] and ACME-CMU [108], where the former aggregates sensor data using gauges and the latter aggregates sensor data published to a sensor-bus and republishes them to a gauge-bus. In both systems, information collected by gauges are used by a centralized decision-making unit that triggers corrective actions using IF/THEN rules. Using a different approach, Autopilot [121] registers sensors in a registry, and a client, a centralized fuzzy logic-based decision-making unit, finds sensors from the registry and subscribes to them to trigger corrective actions based on sensor events. Vambenepe et al. [120] provide a framework for deploying, monitoring, and redeploying components where each node runs a deployment service, which provides deployment and redeployment functionalities, and a health monitoring service that generates heartbeats and failure events for services in the node. An adaptation engine listens to events via an

31


Monitoring Only

Pull Events

/

Pub/Sub hierarchy P2P

Hierarchy

Gossip Spanning Tree (Network/P2P) Group Communication (multicast) Distributed Queue

One Manager without Coordination

One Manager with Coordination InfoSpect [104], Rainbow [63], Naik et Sophia [126] Unity [48], al. [93] , Waheed et Real. [125] fArchi [38], Autopilot [121] Scale [120] ACMECMU [108] PIPER [72]

Monalisa [96] Gangila [49] MDS [129] Paradyn MRNet [129], WRMFDS [127] Astorable [103] GEMS [116]

Managers Managers with without Coordi- Coordination nation

JADE [41], DMonA [91] Tivoli [36]

ScaMng [42] DREAM [44] SmartSub [57] Automate & Accord [29], Mng4P2P [98] WSRFContainer [102] HiFi [30], IrisLog [94], JINI-Fed [34] eXtreme [74] ScWARM [55]

NaradaMng [60]

WildCat [73], Dubey et al. [54], ANDREA [105], Willow [77]

Galaxy [124] ScInfoMng [128] ACME [97] ReactiveSys [87] McstCoord [109], Galaxy [124] SfOrgArchi [64]

AutWFEng [70]

Table 2.2: Management Frameworks by Architecture and Global Control


32

event bus and updates a repository to reflect the state of the system, which is used by a centralized decision-making engine. The aforementioned systems use a centralized decision-making unit and, therefore, avoid the need for global control. However, they do not scale up to manage systems with thousands of nodes.

2.3.2

Group of Managers

The Extreme system [74] uses sensors and gauges similar to ACME-CMU [108]; however, it has controllers (managers) that evaluate rules, and management actions are performed by an agent framework. Furthermore, JiniFed [34] provides a Java JINI-based architecture, which enables a set of managers to manage SNMP-enabled resources using events generated by sensors of the resources. However, neither system discusses global control among managers. Tivoli management suite [36], by IBM cooperation, uses an Enterprise Service Bus in which each service includes a management agent, which publishes management events to the message bus, and these events are received and processed by managers. Global control is not provided automatically, but provided by users who execute actions using a management console. Similarly, in Adams et al. [42], sensors in managed resources publish events to a topic in a decentralized publish/subscribe broker hierarchy, and a distributed event-recordingservice listens to events and captures the system state in an information model, which is used by managers to make management decisions. However, the system is designed as a framework for management services, and it addresses neither the details of the decision models nor the global control associated with management services. DREAM [44], Hifi [30], and Smart Subscriptions [57] are built on top of a distributed


33

publish/subscribe broker hierarchy, and they use actions triggered by Complex Event Processing to manage a system. Furthermore, IrisLog [94] uses triggers in a distributed XML database to initiate management actions. Moreover, ACME [97] is based on a spanning tree of a P2P network where each node runs a query processor, sensors, and actuators. The root node runs a trigger engine, and queries are sent down the spanning tree and aggregated from the bottom up. This system also supports continuous queries, where if query conditions are met, the trigger engine executes designated actions. Even though these systems are scalable, the event action model has limited memory and, therefore, has limited knowledge of the overall system. As a result, providing global control among decisions in this model is non-trivial, yet none of the above systems discuss global control. Jade [41] provides a component model for building autonomic systems, and each component has controllers to support introspection and reconfiguration. Management controlloops are implemented as reusable jade components, which are in turn managed using the same infrastructure. However, Jade does not provide global control among management components, even though there are interesting possibilities in that direction. Galaxy [70] uses a hybrid approach, where the managed system consists of tightly coupled clusters and loosely coupled farms. Clusters use group communication for intracluster communication, and the inter-cluster communication is carried out by placing all nodes in loosely coupled groups that use a gossip protocol. A farm typically represents a single administrative unit, and administrators may form clusters at runtime. Management is provided by a cluster called the management cluster, where Galaxy provides a framework for management services but still does not address global control. The ReactiveSys [88] provides a group-communication based process management framework in which each process has sensors and actuators and managers monitor processes by subscribing to process groups and process sensors. When a process is added or


34

removed, the group-communication-middleware notifies managers, who perform corrective (using actuators) actions based on rules. Passive replication is used with managers to guard against failures. Nevertheless, group communication usually does not scale for large systems, and authors do not discuss global control.

2.3.3

Group of Managers with Global Control

In our classification, only few systems are identified to provide global control, and let us look at them in detail. Based on an agent framework, Wildcat [73] groups together agents (managers) to build a management hierarchy. Agents use blackboards at each level of the hierarchy to coordinate with each other, and top-level agents (managers) control the next levels by modifying policies. However, this system suffers from a single point of failure at the top of the hierarchy, and as authors pointed out, the scalability of blackboards is not clearly established. Our solution differs from Wildcat because it employs an election-based model for robustness, maintains a meta-model, and uses rules instead of policies to build the decision model. DMonA [91] provides a decentralized network management system where each node monitors itself as well as its neighborhood and makes decisions. Each node keeps track of the topology of neighbors and their state while running a control-loop (monitor, process, plan and execute) to manage itself and neighbors, and the global control is expected to emerge from local decisions. However, designing local control to ensure emergent global control is far from trivial, so this model cannot be used with every usecase. Gadgil et al. [60] provide a management hierarchy in which the topmost layer is replicated to guard against failures. Resource-level management is performed by a user-providedcode, which provides fault tolerance among other management functions while the static global-level control makes sure enough managers are there and provides fault tolerance.


35

However, this system assumes the existence of a scalable and reliable registry, whereas our solution does not depend on a central entity and also provides rule support that enables users to define custom management rules. Moreover, an election-based approach can tolerate more failures than a replicated approach. Willow [77] survivability architecture consists of a hierarchy of state machines, and as it receives monitoring events, it updates those state machines. Upper levels of the hierarchy detect increasingly higher-level conditions. Users describe state machines using a policy language, and this approach and detecting composite events using Complex Event Processing has some similarities where the latter usually evaluate user-defined composite queries by maintaining a state machine to track composite events. In contrast to detecting composite events, Hasthi provides a global view of the system using a meta-model, which provides more flexibility for implementing management actions. For instance, in contrast to Willow, Hasthi can use the global view while carrying out management actions after actions are triggered, and the global view is useful to decide the optimal action (e.g. find the best place to connect the new broker) or to take specific decisions (e.g. pinpoint error using more details) after an action is triggered. Furthermore, Willow does not discuss recovery from failures to the management framework. In the field of network management, Schoenwaelder [109] presents a group of cooperating managers where each manager broadcasts heartbeats to other managers using IP multicast and maintains group membership details locally by adding new members for new senders and removing members when corresponding heartbeats are missing. A master agent (coordinator) is elected among managers, but managers make independent decisions and the coordinator is used for synchronizing operations and resolving conflicts; therefore, the coordinator depends on multicast for the normal operation in contrast to Hasthi, which only uses p2p-broadcast for the coordinator recovery. The main difference between Hasthi and Schoenwaelder [109] is that the latter has a group of independent managers,


36

which resolves conflicts using the coordinator, whereas the former has a coordinator that runs a global control-loop to manage the system. Moreover, Schoenwaelder [109] does not provide details about its decision model. ANDREA [105] is a hybrid between hierarchical (bottom up) and collaborative (peerto-peer) control. Each manager delegates task it cannot handle to other managers. To implement the delegation, the management logic include delegation statements and the system delegate those tasks to other managers; therefore, ANDREA creates the control hierarchy on demand from other managers in collaborative manner. This approach is similar to hierarchy based approaches and can potentially scale, but higher levels will only have a general view like other hierarchy based approaches. However, users have to handle the added complexity of delegation and resulting scenarios, and in contrast, Hasthi enables users to provide explicit management logic using a global view of the managed system. Furthermore, ANDREA does not discuss recovery from failures to the management framework. Georgiadis et al. [64] present a self-organizing architecture in which managers collaborate via group communication to preserve architectural constraints of the system while nodes join and leave the system. Changes are triggered by “join” and “leave” messages provided by the group communication middleware. All managers are equal, and they perform actions using a total ordered multicast, so every manager has the same view of the system and is able to make decisions with complete knowledge of the system. However, since group communication is used, the scalability of the system is limited. Dubey et al. [54] present a model-based management system, which uses a statemachine to track the current state of the system and perform corrective actions when state transitions occur. The system is geared toward clusters and real-time systems and consists of local managers that install sensors in the nodes of a cluster, regional managers that listen to heartbeats and maintain a state machine for the region and take corrective actions, and


37

a global manager—typically collocated with the cluster head node—that supports resource planning and new job submissions across regional managers. This system includes special techniques to control jitter and synchronizing sensor activations. One downside of this approach is that for large and complex systems, the state machine formulation may be difficult because such action would call for the understanding of most of its behaviors. Therefore, compared to rules, specifying management logic is harder for the user with this approach.

Table 2.3 presents an outline of different approaches that have been used for global control based on scalability, robustness, and ease of writing management logic. In contrast, using Hasthi, users can write management logic that utilize a global view of the managed system to detect failures, to carry out corrective actions, or both. The main difference of Hasthi and other approaches is the support for a global view, which simplifies the management logic authoring. For example, with Complex Event Processing, although it supports detecting global conditions, the management actions triggered by the condition do not have a global view. Moreover, it is not easy to identify behaviors of a complex system in details to define a state model for the system, which limits the adaptation of state machine based management systems. It is true that supporting a global view imposes scalability limits, but results suggest that the limit is acceptable.

2.3.4

Monitoring Systems

Finally, the first column of the table 2.2 presents monitoring systems, and we shall briefly visit few interesting systems among them. Astorable [42] is based on a hierarchy of nodes, where managed resources are placed as leaf nodes, which expose sensor data as MIBs (Management Information Bases), and

38


Approach

Scalable

Robust

Ease of Details Writing management logic Hard Hard for users to write rules to achieve emergent behavior

Decentralized Highly control (e.g. DMonA [91] , and Deugo et al. [51]) Complex Event Yes processing (DREAM [44])

Yes

Possible

Not Easy

Consistent view across managers (e.g. Georgiadis et al. [64] ) Hierarchical control with aggregation (Monalisa [96]) Hierarchy with Policies at each level (e. g. WildCat [73]) State Machine (Dubey et al. [54] )

No

Yes

Yes

Highly

Possible

Not Easy

Yes

Possible

Possible

Yes

Possible

Not Easy

Possible

Not Clear

Collaborative Man- Possible agers

Event model has limited Memory, and the global view is not available to actions. Need ordered reliable multicast does not scale

Lose identity of a single resource due to aggregation Policies are not as explicit as rules. Users have to construct this state machine, which is hard Agent negations, Dynamic hierarchies (e.g. ANDREA [105])

Table 2.3: Summary of Other Approaches to Global Control


39

each node has a copy of the MIBs of all other nodes in the same level. MIB values are aggregated over the hierarchy, and aggregated values are updated using a gossip protocol that synchronizes the MIB of each node with nodes at the same level and immediate children nodes. Yalagandula et al. [128] use a similar gossip protocol; however, they utilize the internal tree of an Autonomous Distributed Hash table to build the hierarchy of nodes and update frequencies of aggregated values are decided by the read-write-ratio of attributes. Furthermore, Ganglia [66] cluster monitoring framework uses broadcast to monitor a cluster so each node in the cluster keeps track of monitoring data from all nodes in the cluster, and a hierarchy, which provide data aggregation, is built using clusters. Infospect [104] provides a centralized monitoring-and-diagnose-system, in which monitoring information is fed as Prolog facts and Prolog rules diagnose problems. Furthermore, Sophia [126] allows users to evaluate logic expressions in a distributed system, and when a query is submitted, it is decomposed and distributed to nodes closer to dependencies. Unlike monitoring systems, Hasthi provides a decision model and global control.

2.4

Summary

We have presented an architectural stack for system management, and by using that stack as a base, we have provided a survey of design choices associated with a management framework. We have compared and contrasted Hasthi with classes of management systems available in the literature, and we have also provided an one-on-one comparison with systems that support global control over managed systems.

3 Hasthi Management Framework Architecture

This chapter presents the design and the architecture of the Hasthi management framework and lays the foundation for the rest of the discussion. We start with an overview of Hasthi, which sketches basic subsystems and how they are related. Hasthi’s three subsystems are illustrated in the following four sections. The sixth section demonstrates possible implementations of management actions and rules that can be used to program the user-defined management logic. The seventh section revisits the architecture. Finally, the discussion section recapitulates the architecture, discusses limitations, and examines how assumptions affect real life usecases. For this discussion, unless otherwise specified, we assume that network partitions do not occur, any transient transport error can be recovered by simple retying, and the instrumentation that converts a resource to a manageable resource does not fail independently. The discussion section revisits these issues and analyzes their ramifications on real-life use cases. 40

3. Hasthi Management Framework Architecture

3.1

41

Hasthi Overview

As described in the introduction, the goal of Hasthi is to enforce user-defined management logic(functions) in large-scale systems. There are many ways to express these functions; however, due to their declarative nature, we have chosen production rules to express management logic. There are two types of management functions. There are local functions that make decisions based on the state of a single resource, and there are global functions that make decisions based on the states of multiple resources. The basic building block of Hasthi is a service called a “manager,” and apart from supporting a service interface, each manager has a control a thread that activates periodically, performs bookkeeping, evaluates assigned resources, and performs corrective actions. There is also a designated manager called the “coordinator,” which oversees other managers. This section describes the behaviors and life-cycles of managers, the coordinator, and managed resources. The process for making resources compatible with Hasthi is described in Chapter 4. Figure 3.1 presents a 10,000-foot view of Hasthi. Hasthi maintains a meta-model of the system—a remotely stored snapshot of system’s management state—with a guarantee that the information in the meta-model is not older than a predefined time interval. We call this freshness guarantee the delta-consistency [112], a term borrowed from the shared memory processors. Furthermore, a decision framework periodically evaluates user-defined rules using the meta-model and performs corrective actions. Figure 3.2 presents a more detailed view of Hasthi. Components of Hasthi: managers, resources, and the coordinator are arranged according to the hierarchy given in Figure 3.2, where managers are assigned to the coordinator and resources are assigned to a manager. Furthermore, both managers and resources send periodic heartbeats to the coordinator and the assigned manager respectively.

42


!" ! ##$ % ##$

Figure 3.1: Hasthi Outline

Figure 3.2: Hasthi Architecture

We define the Hasthi architecture using three sub systems. The first subsystem, the “manager-cloud,” maintains the aforementioned hierarchy by dynamically adding new components and recovering from component failures. The second subsystem, the “meta-model,” builds the meta-model described previously and maintains delta-consistency. We provide a correctness proof of these two subsystems in Chapter 5, where we will prove that regardless of the initial state, if managers do not join or leave for a defined time interval, Hasthi will build the aforementioned hierarchy and a meta-model that exhibits


43

delta-consistency. In addition, we will prove that Hasthi will continue to keep the structure and delta-consistency as long as managers do not fails.

Finally, the third subsystem, “the decision framework,” utilizes the hierarchy and the meta-model to enforce user-defined management logic on the managed system.

Let us briefly walk through the normal operation of Hasthi. The coordinator is elected among managers to control them and to enforce global management logic. At the startup, each resource is assigned to a manager, which locally creates a meta-object of the resource that represents the manageable state exposed by the resource. The manager keeps the meta-object up to date by listening to heartbeat messages sent by the resource and updating the meta-object. Furthermore, in the absence of persistent communication failures, Hasthi guarantees that any changes to a managed resource, including failures, are reflected in the meta-object within a fixed time interval (delta-consistency). Hasthi represents the management logic as rules and categorizes them as local rules and global rules. Each local rule only needs to evaluate a single resource while each global rule needs to evaluate multiple resources. Managers periodically evaluate local rules against meta-objects of resources maintained locally. On the other hand, to evaluate global rules, the coordinator creates a summary of each meta-object locally, keeps it up to date by applying all major changes happened to meta-objects that are sent via manager heartbeat messages, and periodically evaluates global rules against these summarized meta-objects, which fit inside the coordinator memory. Both evaluations may trigger actions and a corresponding manager or the coordinator executes these actions.

The following sections discuss each subsystem in detail.


3.2

44

Manager-Cloud (MCloud)

The Manager-Cloud, which we called “MCloud”, is comprised of managers and bootstrap nodes. We discussed managers earlier. Bootstrap nodes run in well-known ports and act as entry points to the cloud by routing messages to managers and, thus, aid outsiders and new comers in finding the current coordinator. The cloud is a logical collection of managers acting as a unit that provides the architectural foundation for Hasthi. Specifically, the MCloud recovers from failures, discovers new managers and resources, and permits new managers and resources to join at the runtime. However, it is important to note that despite its name, MCould does not have any relationship to the cloud computing initiative. Let us now briefly look at the associated design choices. As described in Chapter 2, many inter-manager control structures have been proposed, and among them are a hierarchy (e.g. Monalisa [96], ganglia [49]), a shared consistent view of the system (e.g. Georgiadis et al. [64]), peers working together to achieve emergent control (e.g. [91]), and a system built around a designated controller (e.g. [60]). For Hasthi, we have chosen the latter, where a designated manager, called “coordinator,” controls managers and acts as the point of contact in the manager-cloud. Furthermore, replication (e.g. [60]), self-healing structures (e.g. [128]), or recovery using elections (e.g. [109]) are used to achieve high availability and robustness, and we have chosen elections due to their flexibility and self-recovering nature. The following discussion illustrates how initialization, normal operation, and recovery are achieved in the MCloud. A formal proof of MCloud correctness is given in Chapter 5. Managers and the coordinator are a part of a Peer-to-Peer network (P2P), and therefore, they can talk to each other either using the P2P network or using HTTP based SOAP. In the normal operation, for instance when sending heartbeats, they use HTTP based SOAP, but use broadcast and anycast over the P2P network for initialization, recovery, and advertising


45

current coordinators. Bootstrap nodes run on well-known addresses where all other components know those addresses; therefore, they act as entry points to the MCloud. Furthermore, bootstrap nodes remember the current coordinator, and if the coordinator is not known, whenever a bootstrap node needs the coordinator, it performs an anycast to inquire other managers in the system about the coordinator. At the initialization, each manager inquires a bootstrap node about the current coordinator; the bootstrap node tries to find the coordinator if it not already known, and either responds with the coordinator, or notifies the manager that a coordinator does not exist. In the absence of a coordinator, new managers periodically retry the search, and after a timeout, assume the coordinator role. It is possible that two coordinators are elected from the above algorithm. Therefore, to mitigate this, each coordinator announces itself by broadcasting “coordinator heartbeat” messages periodically and if two coordinators are present, they discover each other from coordinator heartbeats, break the tie using hashes of their addresses, and one coordinator resigns. Furthermore, if a network partition has occurred and recovered, the above algorithm ensures that manager-clouds in each partition will merge. On the other hand, if the new manager receives a coordinator from the bootstrap node, it joins the MCloud by sending a join message. The coordinator keeps track of the membership in the cloud using a soft-state protocol, where each manager sends periodic heartbeats to the coordinator, and if heartbeat messages from a manager are missing, the coordinator initiates a failure detection and removes the manager if it has failed. Coordinator failures are addressed by electing a new coordinator among managers. If the coordinator fails, manager heartbeats will fail, and each manager will start an election after a random wait, in which the initiator of the election broadcasts a nomination request. Subsequently, based on responses to the nomination request(e.g. nominations), it invites the next-manager-inline to become the coordinator. Currently, we use the age of managers


46

to choose the best new coordinator and approximate the age using local clocks; therefore, this age might be different from the real age. However, the exact order of managers is not significant for the algorithm, nor to the functionality of MCloud; hence we can ignore unsynchronized clocks. Since the p2p broadcast is not reliable, two coordinators may elect despite the above algorithm. However, similar to the initialization, they eventually discover each other via coordinator heartbeats, and one steps down in favor of the manager who has a higher age. Furthermore, if the manager that stepped down received heartbeats from other managers, it redirects those managers to the new coordinator. The new coordinator rebuilds the assignment information by collecting them from the existing managers. To join the manager-cloud, any resource sends a “ManageMe” message to a bootstrap node, the bootstrap node routes it to the coordinator via other managers, the coordinator assigns the resource to a manager, and the manager notifies the resource about the assignment. If a coordinator is not available, the message is dropped. However, the resource retries the message until a manager is assigned and the coordinator filters duplicate messages. Therefore, when a new coordinator is eventually elected, the resource will be assigned to a manager. Furthermore, if the assigned manager has failed, the resource resumes sending “ManageMe” messages, thus, restarting the join process. Managers keep track of assigned resources via a soft state protocol, in which resources send periodic heartbeat messages to the manager, and if heartbeat messages are missing, the manager starts a failure detection and updates resource monitoring data accordingly. The table 3.2 outlines the initialization, operation, and recovery procedures described in this section.

47


Component Initialization Coordinator First manager becomes coordinator, and others find the coordinator via bootstrap nodes Manager Search for the coordinator and join the manager-cloud

Normal Operation Control the managers

Recovery from Failures Failures are detected by managers, and the next manager in line is elected

Manage assigned resources, and relay updates to assigned resources via periodic heartbeat to the coordinator.

Resource

Send periodic heartbeats, which include changes to the resource state, to the assigned manager Acts as the entry point and route messages to the coordinator

Failures are detected by the coordinator when heartbeats from a manager are absent and by resources when sending heartbeats to managers failed. The coordinator removes the manager from managercloud, and resources restart the join process and rejoin the managercloud. Failures are detected by the manager, and act upon by the decision framework

Bootstrap node

Send “Manage Me” messages to the coordinator via bootstrap nodes N/A

Replicated and stateless, no recovery.

Table 3.1: Summary Of the Manager-Cloud

3.3

Meta-Model: An Abstraction for Monitored Information

Each resource in a managed system is monitored by integrated instrumentations that collect monitoring information, and this section explains an abstraction, which exposes the collected monitoring information through well-defined interfaces. As explained in Chapter 2, in related works, there are three approaches used to expose monitoring information. They are exposing information via events and processing them

48


using complex event processing (Hifi [44], Dream [30]), maintaining a meta-model of the system that reflects the system state (e.g. Marvel [78]), and collecting the information on demand (e.g. [97]). We have chosen a meta-model based approach for Hasthi, because it provides a representation similar to the actual system, thus, enabling users to reason using the meta-model the same way they do with the actual system. Furthermore, there are few different varieties of meta-models, such as in-memory (e.g. Marvel [78]), database-based (e.g. Hyperic [13]), and remote repository based(e.g. CIM, MIBs, LDAP). We have chosen an in-memory meta-model distributed across managers and extended it with a summarized version that can fit inside the coordinator memory. The in-memory model fits efficiently with the Rete-algorithm-based-decision-model of Hasthi, and since the manager-cloud is capable of recreating the meta-model in case of a failure, preserving the meta-model state is not required.

Figure 3.3: Meta-Model Architecture

Figure 3.4 depicts an outline of the Hasthi meta-model, and as described earlier, the


49

meta-model is a collection of meta-objects, where the meta-object of a resource is a remotely stored copy of resource properties exposed by the managed resource. Resource properties are categorized as configurations and matrices, where the former represents the resource state explicitly set by users (e.g. number of maximum threads), and the latter represents metrics collected from the resource (e.g. memory usage, or number of requests pending). After a resource is assigned to a manager, it sends periodic heartbeats to the manager. The heartbeat period is called “resource epoch,” and a resource heartbeat includes the current metric values and the changes to configurations since the last heartbeat message. Furthermore, once a resource is assigned to a manager, the manager creates a meta-object for each assigned resource and updates the meta-object using the resource properties contained in the heartbeats. The set of all meta-objects in managers creates a meta-model of the system. However, as illustrated in Figure 3.4, those meta-objects are distributed across managers. As described earlier in this chapter, global rules depend on more than one resource, and there is no guarantee that any given two resources will be assigned to the same manager; hence a manager can not evaluate global rules. Therefore, to facilitate global decisions, a summary of each meta-object is kept in the coordinator memory. The summary includes few properties like name, management endpoint, and operational status (e.g. Busy, Saturated, and Crashed); therefore, a resource summary does not take excessive memory, and it is seldom updated. Furthermore, if changes to a meta-object change its summary, those changes are sent to the coordinator with periodic manager heartbeats, and the coordinator updates the local summarized meta-object. This summarization is motivated by the fact that for system-level decisions, a high-level summary of resources suffices, and this is a crucial contribution that enables efficient global control over a large system. It is true that the summarized data contained in the coordinator places an upper limit on the scale of the


50

managed system; however, empirical results have demonstrated that this limit is high. We present empirical results in Chapter 6. If heartbeat messages from a resource are missing, the manager performs a failure detection, where the failure detection algorithms are pluggable and could be chosen based on characteristics of the managed system. Given that the communication is reliable, the following two properties hold.

1. The monitored data is reflected in the meta-model within a bounded time, so the management system can react to the data. 2. The changes from the same resource reach the meta-model in the order they are collected.

These two properties are analogues with guarantees provided by the delta constancy model [112] used in shared memory processors, and a proof of this result is given in Chapter 5. To summarize, following are few characteristics of the meta-model. 1. Meta-model is read only; however, it reflects the changes that have occurred in its sources (actual managed resources). 2. The state of each meta-object can always retrieved from its source; therefore, the state need not be preserved, and it can be reconstructed after a failure. 3. Meta-model exhibits delta-consistency [112] (if network and managers do not fail), meaning that changes are reflected in the meta-model within a bounded time. 4. Data collection times spanning different objects are not synchronized; however, the time difference is always less than the resource epoch time (collection period).


3.4

51

Case for Delta-Consistency

We have observed that Hasthi provides a delta-consistency guarantee, and this section presents a case for delta-consistency by arguing that it is sufficient for representing monitoring information. Since the state of each resource must be transferred via messages, the information seen by a manager always lags behind the real system by at least the message latency, and in the absence of a global clock, data-collected times from different resources cannot be synchronized. This means, essentially, that the delta-consistency cannot be avoided, and the rest of the section argues that it does not impose major limitations. While monitoring a system, monitoring data are collected periodically and transferred to a manager, effectively sampling property values of system resources (e.g. CPU usage). In this context, the Sampling Theorem [19] says that a signal should be sampled with a sample frequency twice as high as the highest frequency component of the signal. However, since the change frequencies of properties are not known, the theorem cannot be used directly to calculate the sampling frequency. But if a property changes slowly, its highest frequency component is small, and therefore, according to the theorem, smaller sampling rates are sufficient for properties that change slowly. Resource properties are categorized as configurations and metrics, and in this section, we argue that both types change slowly. Furthermore, using this result, we argue that smaller sampling rates suffice for system management. Let us look at each type.

1. The configurations are usually static, and change only if the resource is reconfigured, which happens rarely. Because of this, they change slowly. 2. Usually metrics are either relatively stable properties (like memory consumption and operational status), or averages collected across a period of time, which are also stable. This is because management is concerned with stable conditions, and since


52

corrective operations are expensive, it is not feasible to respond to momentary conditions in the system. For example, if a service was overloaded for a short period of time (e.g. 10 seconds) and recovers, the management framework does not need to act on it. However, if it was overloaded for a longer time period (e.g. 5 minutes) and continues to be overloaded, the framework may need to act. Furthermore, the CPU usage is always measured over time, but not instantaneously because 100% CPU usage at a given point in time has a little significance; however, an average 95% CPU usage over 5 minutes might suggest that the system is overloaded. Similar arguments can be made for properties like load average, the number of successful requests, the number of failed requests, maximum request time, and average request time, which are collected over time, not instantaneously. Therefore, from the above 1 and 2, we argue that resource properties are stable, and hence the sampling theorem suggests that low sampling rates are sufficient. As a result, even though properties are periodically updated in the meta-model, they remain representative of the load of the original resource. With lower sampling rates, Hasthi may not see a condition that occurred only for a short time, and it takes at least a sample time before it responds to a change. However, as illustrated earlier, management is concerned with stable conditions; therefore, it can tolerate missing a condition that only lasted for a brief period. In contrast, with manual error corrections, recovery from failures may take hours, and in most cases, automated recovery that only takes minutes is a substantial improvement. More often than not, response times of seconds or minutes are acceptable; therefore, a management framework can make decisions based on stable conditions that occur over time. As a result, we argue that delta-consistency does not significantly affect decisions and, therefore, exposing information via delta-consistency is sufficient for the system management.


3.5

53

Decision Framework

Hasthi monitors a system though sensors and controls it through actuators, and the decision framework acts as a bridge between the sensor information and actuators. As described earlier, the goal of Hasthi is to enforce user-defined global management logic over largescale systems. Among the techniques used to realize control-loops are rules, static logic, decision tables, fuzzy logic, and Artificial Intelligence techniques like neural networks. However, since the control instructions are provided by end users, a declarative medium is preferred (e.g. Rules). Hasthi uses a forward-chaining rule engine because it provides an intuitive, declarative, and powerful medium, which allows users to define how the system should be managed. In earlier efforts, as described in Chapter 2, the global control has been realized using decentralized control (e.g. [91], [111]), a hierarchy of managers (e.g. Wildcat [73]), and complex event processing (e.g. Hifi [44], Dream [30]). Among these, Hasthi takes a coordinator-based approach and implements a global control-loop, which is evaluated against the summary of the system meta-model. Since the summary of the metamodel is contained within the coordinator, the global control-loop has a complete view of the system. Furthermore, each resource is controlled using a local loop, which provides resource level control. As depicted by Figure 3.4, the decision framework is composed of decision units distributed across managers and the coordinator. A decision unit accepts management rules and a set of meta-objects at initialization, and when triggered, it evaluates meta-objects using management rules and triggers corrective actions in response to failure conditions. Decision units use Drools rule engine [9], which uses Rete algorithm [58], for evaluating rules. There are two types of decision units: a global decision unit that is initialized with the global rules and periodically triggered by the coordinator control-loop and local decision units that are initialized with the local rules and periodically triggered by manager control-loops.

54


Figure 3.4: Decision Framework

Local decision units are placed in every manager and they evaluate the meta-objects of resources assigned to that manager, where local rules associated with them take the local decisions and perform resource-level-corrective actions. For example, a local rule can check the state of a resource and decide whether or not it is faulty and then restart the service if needed. On the other hand, the global decision unit resides in the coordinator. It manages all resources in the system by evaluating the summarized meta-model of the system locally stored in the coordinator, and the associated global rules make system level decisions. Each global rule needs to assert more than one resource for taking decisions and carry out actions. For example, a global rule may be designed to ensure that the system has five message brokers, and it may ensure this condition by creating new brokers if a broker has failed. However, to do so, it must affirm all brokers in the system, must find & interact with other brokers to add a new broker to the system, and, therefore, needs a global view of the system. This global view is provided by the summarized-meta-model. Furthermore, if management


55

actions need properties that are not included in the summary, the summarized meta-objects automatically fetch those properties from the associated meta-objects on demand, and this happens transparently to the management logic. If the coordinator fails, a new coordinator is elected; however, while the election is in progress, for brief periods of time, the system may have two coordinators, thus giving rise to two global control-loops. To mitigate this problem, for few initial epochs after being elected, the new coordinator turns off the decision framework. As we will demonstrate in Chapter 5, if broadcasts over the p2p network are intact, all coordinators except for the one with the highest rank resigns within this time period. Therefore, it is unlikely that two coordinators will evaluate the global control-loop at the same time.

3.6

Programming Hasthi to Manage a System

As described in the previous chapter, using management rules, users can program Hasthi to manage a given system by activating management actions when faulty conditions are detected. This section describes the associated programming model by illustrating management actions, rules, and the resource lifecycle.

3.6.1

Management Actions

A management action is initiated by a management rule in Hasthi to change the state of a resource or a group of resources, and they are effectors, which enforce the will of the management framework on a managed system. Typically, resource-level actions are performed through actuator interfaces of manageable resources, and composite actions are created by composing these resource-level actions. There are many possible management actions, and following are few common classes.


56

1. Create a New service – either start a service that is already installed or install the service and start it. 2. Restart a running service or recover a failed service. 3. Relocate a service – this action includes saving any state that need to be preserved and moving a service to a different host or container. 4. Tune and configure a resource – this action has two variations. The first variation changes the configuration of a resource and the second changes the structure of the system (Typically by adding or removing inter-service dependencies). 5. Initiate a checkpoint - the service state might be checkpointed before performing a potentially damaging action to a service. 6. Perform a micro-reboot [45] - this action reboots subsystems or parts of resources. 7. Diagnose a potential error or perform a functional ping to assess the health of a service. 8. Send notifications or perform user interactions - this action sends an email to administrators and receives their inputs, thus incorporating user inputs into the decision process. This action is described in Chapter 4.

Chapter 4 illustrates implementation of management actions supported by Hasthi. However, management actions are highly dependent on the nature of the managed system, and therefore, to support a wide range of management actions, Hasthi has introduced an action framework that enables users to define custom management actions. Hasthi provides an initial set of management actions; however, users can write their own actions in Java and make them available by placing them in the system classpath.


57

Listing 3.1: A Sample System Profile < i n s t a l l D i r >/ u s r / l o c a l / x r e g i s t r y x r e g− s t a r t . s h SRCF ILE x r e g−s t o p . sh s i l k t r e e . c s . i n d i a n a . edu t y r 1 6 . c s . i n d i a n a . edu mysql ...

Action Create a New Service

Implementation Using the start command defined in the resource profile Stop a Service Using the stop command defined in the resource profile. Configure a service or edit the sys- Using WSDM setProperty() tem structure Relocate a Service Hasthi performs shutdown and passes the storage location to the restart command if a location is defined. User Interaction Sends an email message to the user and the message will present a HTTP form, and when the form is submitted, results are send back to Hasthi as a REST Web Service invocation, which will trigger a callback registered with the action. Send Email Using an Email API Table 3.2: Summary Of Management Action Implementations

Every management action includes two types of information: the logic that outlines how a given action should be carried out and configurations that define targets of actions.


58

For example, the create-service action has programing logic that starts a new service, but configurations decide where the service should be started. By defining a configuration file that describes a profile for each resource and implementing management actions that read configurations from the associated resource profile, Hasthi has separated configurations and logic. An example of a resource profile is given in Listing 3.1. For example, the profile of the resource to be created is passed into the create-service action, which reads the configurations from the profile and creates a service according to that configurations. Therefore, by using different resource profiles, the same action implementation can be used while applying the action to different resources.

On the other hand, the main concern with management actions is that their executions take comparatively long times, and if decision loops have to wait for management actions to be completed, it will slow down the decision process. To mitigate this problem, the action framework of Hasthi supports asynchronous execution of management actions, and different actions are composed together using callbacks, which are triggered when a given action is completed. Due to its asynchronous nature, the resulting programming model is more complex than the synchronous programming model, yet it is more efficient and frees the main decision loops from the overhead of executing actions.

The Hasthi rule environment provides a global object called “system,” which provides an API to perform management actions and register callbacks. Furthermore, management actions can be composed using callbacks to create composite actions. An example of such a composition is found in the then-clause of Listing 3.2. Table 3.2 illustrates the different management actions supported by Hasthi and how they are implemented. To run remote shell commands, Hasthi uses either host agents running in each host or grid-based invocations. We shall revisit action implementations in Chapter 4.


3.6.2

59

Management Rules

Hasthi has enabled end users to express recovery actions for abnormal system conditions using management rules written with the Drools rule language [9]. To aid in evaluating a system, Hasthi has mapped the system into a meta-model, which resides in managers, and by evaluating the meta-model using rules, Hasthi effectively evaluates the system. Listing 3.2: A Sample Rule rule ” R e s t a r t F a i l e d S e r v i c e s ” when s e r v i c e : ManagedService ( s t a t e = = ” C r a s h e d S t a t e ” ) ; h o s t : H o s t ( s t a t e ! = ” C r a s h e d S t a t e ” , s e r v i c e . h o s t = = name ) ; then f i n a l ManagedService f a i l e d S e r v i c e = s e r v i c e ; f i n a l ActionCenter finalSystem = system ; s y s t e m . i n v o k e ( new R e s t a r t A c t i o n ( s e r v i c e ) , new A c t i o n C a l l b a c k ( ) { p u b l i c v o i d a c t i o n S u c e s s f u l ( ManagementAction a c t i o n ) { MngActionUtils . s e t R e s o u r c e S t a t e ( a c t i o n . getActionContext ( ) , failedService , ” RepairedState ” ); } p u b l i c v o i d a c t i o n F a i l e d ( ManagementAction a c t i o n , T h r o w a b l e e ) { MngActionUtils . s e t R e s o u r c e S t a t e ( a c t i o n . getActionContext ( ) , failedService , ” UnRepairableState ” ) ; try{ f i n a l S y s t e m . i n v o k e ( new UserInteractionAction ( finalSystem , failedService , action , e ) ) ; } c a t c h ( E x c e p t i o n e1 ) { e1 . p r i n t S t a c k T r a c e ( ) } }); end

Listing 3.2 presents a sample management rule, where each rule has two parts: a “whenclause,” which defines a condition and a “then-clause,” which defines an action to be carried out. The when-clause is based on an object query language in which the query expression host:Host(state == “CrashedState”) selects any object of the type Host whose state is


60

“CrashedState”. A when-expression may include multiple query expressions, and for each rule, the rule engine searches for objects among meta-objects that match with all query expressions defined by the rule. For every object set that matches query expressions, the rule engine executes the then-clause. For example, in the rule given in Listing 3.2, the when-clause searches for any crashed service whose corresponding host hasn’t failed, and if a match is found, the rule restarts those matching services. Furthermore, this rule illustrates how to use callbacks in order to compose management actions; for example, the rule registers a callback with the action, and if the restart-action fails, the actionFailed() method of the callback is invoked, which in turn performs a user interaction to notify a user about the failure and asks him to fix the problem. The procedure of matching when-clauses and triggering then-clauses is performed by the Rete algorithm [58], which is the state-of-art in implementing forward-chaining-ruleengines. The algorithm remembers results from old evaluations of facts and, therefore, works incrementally. For example, if a resource in the system changes, the rule engine only needs to evaluate that resource and any changes to old data caused by the resource change. From the design’s point of view, the Rete algorithm trades memory for speed by remembering results of old evaluations and, therefore, is suited to evaluate a system that undergoes incremental changes. Any Java-code can go in the then-clause, and it can use any Java library placed in the system class path. Furthermore, since they support any Java logic, management actions are Turing Complete; therefore, if enough information is exposed by resources, any deterministic management algorithm can be coded using rules.

3.6.3

Resource Life Cycle within Decision Model

With Hasthi, each resource in the system is represented by a meta-object in the metamodel that exhibits delta-consistency [112]. This section presents the lifecycle and details

61


of meta objects, which represent resources within Hasthi. Each meta-object has properties “name,” “state,” and “category”: the “name” is a unique identifier to the resource, the “state” is the operational state of the resource, and the “category” is the membership of the resource in one of the categories defined in Figure 3.5(a).

(a) Resource Hierarchy support by the Decision Framework

(b) Lifecycle of a Resource

Figure 3.5: Resource Hierarchy and Resource Lifecycle Figure 3.5(a) depicts a categorization of managed resources, and the meta-objects of each category are mapped to a particular type (e.g. managed services are mapped to the ManagedService class and managed hosts are mapped to the Host class etc). Furthermore, managed services have a property called “type,” which is a name for an abstract description of their functionality; therefore, if “type” property of two services are the same, both services are functionally equivalent. Moreover, meta-objects contain custom properties defined by each resource, and those properties are illustrated in Table 8.1, Chapter 4. Figure 3.5(b) illustrates the lifecycle of a resource, and the current lifecycle state of a resource is represented by the “state” property of the meta-object. If a resource is healthy and depicts normal behavior, it is in one of the operational states: “Idle,” “Busy,” or “Saturated,” and the management agent of the resource, which is integrated with the resource,


62

decides among these states. On the other hand, if a resource has crashed or became faulty, it is in namesake states, which are “Crashed” and “Faulty”. When heartbeat messages are missing from a particular resources or when someone (e.g. another resource) notifies Hasthi that a resource is suspected to have failed, the assigned manager for the resource triggers a failure detector. If the resource has failed, the detector marks the resource as “Crashed” or “Faulty”. Furthermore, rules may identify faulty resources by detecting certain abnormal conditions (E.g. failed requests/successful requests ratio is high). On the other hand, a resource is in the “Repairing” state while it is being repaired by rules, and when the repair has completed, either the resource returns to an operational state, or the repair has created an alternative resource in its place. In the latter case, the original is marked as “Repaired” and removed after few heartbeat epochs. However, if the recovery process fails at any point, the resource is marked as “Unrecoverable,”. To summarize, users can author management rules that instruct Hasthi how it should react to different faulty conditions observed via monitoring information collected from the managed system. Most error conditions are observed as transitions in the lifecycle of resources, and using rules, users connect the observed conditions and management actions. We will look at some example scenarios in Chapter 9.

3.7

How does it all work?

This section recapitulates Hasthi architecture and illustrates how different parts of the system work together to manage a system. To that end, the next subsection illustrates Hasthi as a black box by defining end-user interfaces, and the following subsection exemplifies an application of Hasthi by presenting a usecase.

63


' &

$

"#

!

)# ' (

&

!

!

$

% !

!

Figure 3.6: Hasthi from User’s Perspective

3.7.1 User Perception Figure 3.6 illustrates Hasthi as perceived by end-users, and we shall describe the figure by walking though the process of using Hasthi to manage a system. We have identified three user roles for Hasthi: system designers, system deployers, and day-to-day administrators. It is possible for all three roles to be played by the same group. To determine if a system can be managed with Hasthi, system designers analyze the system to understand necessary guarantees and management usecases, and Chapter 7 presents a detailed discussion of the application scope of Hasthi. If the system falls within the application scope of Hasthi, it can be managed with Hasthi. The next step is to understand the monitoring information—resource properties exposed by managed resources of the system—where system designers instrument each resource to


64

support WSDM and expose a representative subset of its state as resource properties. As illustrated in Chapter 4, Hasthi provides extensible and generic instrumentation tools to aid the instrumentation. Subsequently, system designers must identify common management usecases and author local and global rules to manage the system.

After system designers integrate Hasthi and the target system, the system depoyers take over the deployment, and it is their responsibility to provide configurations for Hasthi. Configurations include a system profile, which provides information about resources to management actions, and settings, which enable users to setup Hasthi by changing properties like epoch times and providing management rules. Examples of other properties are failure detectors, which enable users to embed custom failure detection logic into Hasthi, and value converters, which decide the format of monitoring information sent over messages. After configurations are prepared, Hasthi can be deployed and initialized by setting up configurations, starting bootstrap nodes and managers, configuring each resource to use bootstrap nodes for discovering the management framework, and starting resources.

The arrowed boxes in the left side of Figure 3.6 depict user-interfaces exposed by Hasthi. The arrowed boxes on the right side show management actions initiated by it. Specifically, managed-resources join the Manager-Cloud using the “resource join” operation, send updates using the “sending heartbeat” operation, and discover other services using the “dependency-discovery” operation. To illustrate the last case, let us look at an example. If a service depends on a registry, it may find a registry instance via the dependencydiscovery lookup operation, either at the startup or as an alternative for a failed registry, and use the new registry. Furthermore, if a resource suspects that another resource is faulty, it may notify Hasthi using the “failure suspect” operation. Moreover, system administrators can use actions and event-triggering ports to signal events or perform actions, and they can query the managed system by querying the meta-model provided by Hasthi.


3.7.2

65

Motivating Usecase

Let us consider a distributed workflow system as a motivating usecase. Assume there exists a workflow engine, which checkpoints progress of each workflow it executes, orchestrates workflows composed of stateless services whose service invocations have a small cost, and supports the workflow engine state recovery in case of a failure. In such a system, after a failure, services can be restarted, the workflow engine can be recovered, and workflows can be resumed from a checkpoint. Hasthi, if assigned to manage the workflow system, assigns each resource to a manager and creates a meta-model of the system that exhibits delta-consistency. Managers monitor resources using heartbeat messages, and failure detectors are initiated by the absence of heartbeats or the reception of faulty suspect messages from other services. Each manager periodically runs a local control-loop, which inspects assigned resources and performs local actions like detecting faulty services and shutting down old services. Also, the coordinator keeps a summary of each resource in memory and periodically evaluates global rules, which provides global control. Among examples of supported usecases are recovering failed services, relocating services if a host is overloaded, creating or shutting down services if a particular service type is over or under used, and resurrecting failed workflows after the system has recovered from a failure. This particular example is simple because services are stateless and an occasional wrong decision (e.g. fail positive) is not critical. Such a simple example was chosen for clarity. In Chapter 7, we shall dive into more complex systems. It is important to note that the decision model is an abstract management rule framework, and the way the system behaves is up to users who write management rules. As a rule of thumb, Hasthi is unlikely to make decisions better than a human administrator, but since the rule language is Turing complete, any deterministic decision process performed by a human administrator can be implemented with rules. For example, consider that a service is unreachable, and if a human cannot decide between a network failure or a service failure, neither can rules,


66

but if there is a test that a human can always use to detect the failure, the same test can be performed using a rule.

3.8

Discussion

Hasthi is a dynamic and robust management framework, which enables users to maintain a global view of large-scale systems and also enables them to control it using userdefined-rules. By integrating management agents to resources and authoring management rules, which capture common management scenarios, Hasthi can be used to manage a system. Once integrated, Hasthi keeps track of and controls the resources of the system and is capable of recovering from resource and management infrastructure failures. Furthermore, by maintaining a global view of the system, Hasthi enables users to define their management rules naturally, similar to the way they would typically reason about the system. This chapter detailed the architecture and demonstrated it using a motivating usecase. In addition, this section revisits assumptions made in this chapter and analyzes them to identify their ramifications on real life usecases. Results associated with Hasthi hold if the communication is reliable, and there are two forms of errors in the communication media: transient errors and network partitions. The effects of the former can be mitigated by building retries into the heartbeat-sending code, which would cover transient errors for most practical cases. Furthermore, if failure detectors are available, they can be used to help with retry choices. On the other hand, the latter type of errors is caused by network partitions, but since typically large-scale systems are backed by reliable and redundant networks, network partitions are rare and only caused by severe failures from which chances of recovery are small. Nevertheless, if a partition occurs, unless information about network health is available, it is indistinguishable from service failures. There are two ways to approach the problem.


67

One is to compensate and try to run an independent system using resources available in each partition. The other option is to do nothing until partitions are resolved. By default, Hasthi uses the first approach and automatically performs an election where each partition elects a coordinator, and if partitions are connected later, coordinators will merge to form a single system. In the partitioned system, the coordinator in each partition will try to compensate for disconnected services, and the outcome will depend on management rules. On the other hand, if forming independent systems in different partitions is undesirable, Hasthi can be extended to turn off control-loops until the partition is resolved. This method needs to detect partitions, and it can use the network health information and a sudden lost of a significant portion of resources in the system to detect a network partition. Furthermore, the above discussion assumes that a resource and the associated management agent do not fail independently. To approximate this assumption, tests can be built into agents that periodically perform active probes on the resource, notify failures, and turn off the agent if the resource has failed. On the other hand, to guard against independent agent failures, failure detection algorithms can be used. For example, if heartbeat messages are missing but the service is deemed healthy by failure detectors, this would suggest an agent failure. and to approximate atomic failures of the agent and the resource, either the agent can be restarted or the resource can be stopped. This chapter discussed the Hasthi architecture, and the following chapters demonstrate claims we made in this chapter and also illustrate applications of Hasthi. The next chapter discusses instrumentations supported by the framework, and the following chapters present a correctness proof of the design and empirical analysis results. Chapters 7, 8, 9 discuss applications of Hasthi, and Chapter 10 concludes the discussion.

4 Managed Resources and Instrumentations

4.1

Introduction

As described in Chapter 3, Hasthi uses Web Service Distributed Management (WSDM) Specification [26] as the interface between itself and resources. Furthermore, the Hasthi project includes tools, which aid users to expose their resources as WSDM resources. This chapter discusses those tools. As detailed in Chapter 2 under related works, each resource-to-be-managed has specific instrumentations, which can be used to measure its health and state at any given point of time. These instrumentations are defined by different levels like hardware, Operating System, Service containers, and applications. Hasthi uses those values to assess the health of the system and to decide on corrective actions. WSDM groups all instrumentations of a given resource together and expose them through a well-known interface, which includes resource properties that expose the state of the resource, operations that expose management actions supported by the resource, and events that expose changes to the resource 68

4. Managed Resources and Instrumentations

69

state. We call a resource that is integrated with WSDM a “Managed Resource,” or “Manageable Resource”. Hasthi includes a WSDM-runtime—an implementation of the WSDM specification, which aids users in converting their resources to manageable resources. This chapter describes the WSDM-runtime and identifies different approaches used to integrate resources with the runtime. Identifying, documenting, and implementing different instrumentations are contributions of this chapter. The rest of the chapter is organized as follows. The second section discusses the WSDM-runtime provided by Hasthi, and the following section discusses its integration with resources-to-be-managed. The forth section describes the implementation of management actions, and finally Section 4.5 concludes the discussion.

4.2

Hasthi WSDM-Runtime

The WSDM-runtime is a web-service-based-interface that allows outside authorities to perform the following actions via Web Service invocations. 1. Monitoring information of the resource is represented as resource properties, and the runtime collects that information from different instrumentations, enables users to query those properties, and sends events notifying changes to those properties. 2. Some resource properties like resource configurations are mutable, and the runtime enables managers or human administrators to change those properties via Web service invocations. 3. The resource may support management actions like shutting down a service, and the runtime enables users to execute such actions via Web service invocations.

70


Figure 4.1 presents the WSDM-runtime, which has many extension points. It also has a Web service interface, which is either a dedicated service or an existing service shared with the underline managed resource. Let us explore how a management request is processed by the runtime. As shown by Figure 4.1, when a user sends a management request to a managed resource, it is directed to the WSDM-core, which verifies that it is targeted to this resource and searches for a capability that can support this request. The capabilities are an extensible mechanism of the WSDM-runtime, and the idea has been implemented by Apache Muse project [7] earlier. In Hasthi, all functionalities of a managed resource are implemented as capabilities. A capability is a specific code added via configurations, which locates zero or more properties, performs management actions, and generates events. For example, the metrics capability knows how to locate the metrics of the resource, such as the number of successful or pending requests or the memory consumption, and the shutdown capability knows how to shut down the resource. Furthermore, there are some capabilities that work in the background: for example, the heartbeat capability starts a thread that will periodically send heartbeats. Hasthi provides a set of default capabilities, but users may write their own custom capabilities to extend Hasthi.

Figure 4.1: Architecture of the WSDM-runtime


71

If a matching capability is found, the WSDM-core delegates request processing to the capability, which processes the request and returns results. For processing the requests, capabilities may use an interface called the “system-handle,” which knows details about a specific resource type. Since the interface is resource type specific, each resource type needs its own instance of the system-handle interface. However, Hasthi provides system-handleinterfaces for many resource types, and they are described in the next section. Nevertheless, users may extend the WSDM-runtime by writing their own system-handle-interfaces. With this architecture, new functionalities can be added by adding new capabilities, and the WSDM-runtime can be integrated with a wide variety of resources by writing appropriate system-handles for those resources. We will present examples in the remainder of this chapter.

4.3

Instrumentations Levels

The earlier section described the WSDM-runtime architecture, which enables users to convert resources into managed resources. By writing different capabilities and system handles, WSDM-runtime can be used to instrument a wide variety of resources, and this section will look at some of those extensions Hasthi has chosen to implement. Figure 4.2 illustrates different management agents. Agents either reside in the same address space as the managed resources or they reside outside. Since they are in the same address space, the first type can use method invocations to collect data, whereas the second type has to use remote invocations. As described before, each agent type is implemented by writing a specific system-handle, which acts as a bridge between the resourceto-be-managed and the WSDM-runtime. Each agent periodically reads metrics about the managed-resource and exposes them as resource properties, generates events based on changes to those properties, and supports management actions. The Table 4.1 presents

72


! " #

Figure 4.2: Different Types of Management Agents

the different properties exposed as well as the resources supported by each agent.

Let us look at each agent briefly.

4.3.1 In-Memory Agent to Instrument Services The in-memory-agent is placed in the same address space as the resource-to-be-managed and typically monitors Web Services and shares the same Web Service interface with the managed service. The agent intercepts all messages coming to the service and redirects all management messages to the underline WSDM runtime. However, if a message is not

73


Instrumentation Host Agent

Supported Resources All hosts

In Memory Agent

Most Web Services

Polling Monitoring Agent

Any resource that exposes status via a web page Any resource that uses a supported logging framework (e.g. log4j) Any Unix / Windows processes Any resource that supports JMX (e.g. Tomcat )

Log Based Agent

Process Monitor JMX

Script based monitoring

Supported Management Properties Operational Status, Uptime Hours, Memory Usage (As a percentage), Swap Usage, Process Count, Thread Count, Load Average in last 5 minutes, data transferred via the network in last 30 seconds Operational Status , Last Request Received Time, Number Of Pending Requests, Number Of Successful Requests, Number Of Failed Requests, Last Response Time, Service Thread Count, Current Service File Descriptions, Max Response Time, Host, Type, Service Memory Usage Operational Status, Number of XML requests, Start Time Operational Status, Current Service Memory Usage, Service Thread Count, Number of warning, Number of errors Operational Status, Process Memory Usage, Process CPU Usage Current Service Memory Usage, Service Thread Count, Current Service Open File Descriptions User defined properties

Table 4.1: Properties exposed by Different Management Agents


74

intended for the management code, the overhead induced is minimal. By intercepting messages, this agent collects statistics about requests like their overhead, success, or failures, and it may have access to some metrics about the resource like CPU utilization, memory and service container statistics. Since the in-memory agent provides the most flexibility, it is the preferred method for instrumentation. This message interception is implemented using the extensibility mechanisms available with most Web Service frameworks. For example, the Axis2 Web Service container [4] enables users to write an extension called module that injects code (Axis2 handlers or interceptors) to the Axis2 processing pipeline and, therefore, supports the Chain of Responsibility Pattern [122]. In other words, modules enable users to register code with Axis2, and that code is invoked whenever a message is received or sent by Axis2. Indeed, the in-memory agent for Axis2 is implemented as an Axis2 Hasthi module, which consists of a handler that intercepts messages, redirects management messages to the WSDM runtime, and collects statistics about the service by inspecting non-management messages to the service. Consequently, the module can be integrated with an existing Axis2 service purely though Axis2 configurations without any changes to the service implementation. This feature has reduced the adaptation cost and made Hasthi accessible to a larger audience.

4.3.2

Host Agent

The Host agent monitors hosts, and as illustrated by Figure 4.2, it is comprised of an agent running in the host-to-be-managed, which we call the host agent. The host agent periodically reads metrics about the host using an instrumentation framework called Sigar [14], which can extract metrics from both windows and UNIX-based systems, and exposes those properties as resource properties of the host.


4.3.3

75

Polling Based Agent

This agent is used to monitor any resource that exposes its metrics via a web page in which a specific system-handle (written for this agent) periodically reads the web page, extracts the metrics by parsing the response, and exposes the metrics as resource properties.

4.3.4

Process Monitor

This agent monitors an UNIX or a Windows process; therefore, it must run on the same host as the process. Using the Sigar [14] instrumentation framework, a specific systemhandle periodically reads and exposes the metrics of the process.

4.3.5

JMX Based Agent

This agent is used to monitor any resource that supports Java Management Extension (JMX) [90]. A specific system-handle written for this agent periodically reads properties from JMX and exposes them via WSDM.

4.3.6

Script Based Agent

This agent periodically executes a shell script, finds metrics by parsing the standardoutput generated by the execution, and exposes them as resource properties.

4.3.7

Logging Based Agent

This agent monitors an application (a resource) by listening to the logging events generated by the application, and it can be used to monitor any application that uses a supported


76

logging framework. Our current implementation supports any resource that uses the Log4j framework [6], where the agent is implemented as a custom Appender, which is a part of the Log4j extensibility mechanisms. Whenever the application generates a logging event, it is received by the agent, which uses logging events to evaluate the health of the resource. This agent can be integrated with existing applications by adding a new Appender to the Log4j configurations. Consequently, users can integrate this agent through Log4J configurations—the log4j.properties file—without changing existing applications. Hence, the main advantage of this approach is the ease of integration with existing resources.

4.4

Implementing Management Actions

Under related works, Chapter 2 described different approaches that are used to implement management actions, and this section describes the implementations of management actions supported by Hasthi. As described in Chapter 3, management actions are used by management rules in Hasthi in order to perform changes to the system. Management actions in Hasthi are extensible; however, Hasthi provides a default set of actions, which are sufficient for most usecases. We categorized these supported actions under three categories: WSDM-runtime based actions, shell script-based actions, and user-interactions. Figure 4.3 illustrates their implementations.

4.4.1

WSDM-Runtime Based Actions

The WSDM-runtime supports changing the resource properties of a resource and performing custom management actions. As illustrated in Figure 4.3(a), to execute actions, rules send a message to the WSDM-runtime, which performs an action on the underline

77


Figure 4.3: Management Action Implementations

resource.

Changing resource properties is used to tune services, configure services, or change the structure of the system by adding or removing inter-service dependencies. For instance, the management logic may change the size of a thread pool in an event processing service, which is an example of the tuning services via resource property changes. Moreover, configuring other services to use a registry after starting it is an example of changing the system structure via resource property changes.

Among examples for custom management actions are shutting down a service, performing a functional ping to determine the health of a service, and initiating a checkpoint before performing a potentially damaging action.


4.4.2

78

Shell Scripts Based Actions

Most management actions can be written as shell scripts, and as illustrated in Figure 4.3(b), when these actions are fired, the corresponding scripts are executed either by a host agent running on each host or a remote invocation mechanism like SSH or Grid Job manager services. Among examples of shell scripts-based actions are deploying and starting a service, restarting a service, relocating a service, and shutting down a service. If a service needs to preserve its state, it should expose its storage location as a resource property. In this setting, if the service has failed and being recovered, Hasthi passes that storage location as a parameter to the shell command that recovers failed service. We provide a special treatment for tomcat-based services for which users only need to provide the location of the tomcat installation. In this case, Hasthi finds, parses the configuration file of tomcat(the server.xml file), and determines all configurations required to start and stop the tomcat. This task is performed by a host agent running in the same host where tomcat is or should be running.

4.4.3

User Interactions

Finally, a user-interaction based action performs a user-interaction with a human user, retrieves an input from the user, and runs a custom action that depends on those inputs. These actions are used to incorporate user-inputs into the decision process. As illustrated by Figure 4.3(c), a user-interaction action is triggered by management rules evaluated by a manager, and the action sends an email that contains a web form, which presents details and asks the user to provide inputs and make selections. When the form is submitted by the user, either the browser or the html-supporting-email-client submits the form back to the manager as a HTTP request, and the request is mapped to a


79

REST-based Web service invocation in the manager. An event correlator, which resides in the manager, identifies the event and executes a code registered with the user interaction, passing user inputs as parameters. For example, when a resource has failed, but the cause of error cannot be determined, Hasthi may send an email to a user asking him for his input. Subsequently, he can choose between shutting down, restarting, or manually fixing by clicking an associated link, and his choice is transferred to associated logic when he clicks submit in the form. The following code segment shows a user-interaction action within a rule. When the rule condition is met, Hasthi executes the user interaction action, and when the user responds by submitting the form, the “OnetimeEventCallback” is invoked. The user-inputs are represented by the Hasthi event object, which passed into the callback. Listing 4.1: A User Interaction Rule rule ” NotifyUnknownError ” when / / condition to detect errors then s y s t e m . i n v o k e ( new U s e r I n t e r a c t i o n A c t i o n ( admin@lead . i u . edu ,< h t m l form >, new O n e t i m e E v e n t C a l l b a c k ( ) { public

void eventOccuered ( HasthiEvent event ){

/ / r e a c t b a s e d on u s e r i n p u t } } )); end

4.5

Summary

We discussed the implementation of WSDM-runtime, different options for its integration with resources-to-be-managed, and implementations of management actions in Hasthi.


80

We believe WSDM-runtime and its applications are useful for Manageable resource designers.

5 Proof Of Correctness

5.1

Introduction

This chapter presents a theoretical analysis of the underline algorithm of Hasthi, which we call the “Manager-Cloud Algorithm”. The analysis proves its robustness, dynamic nature, and delta-consistency properties and derives the availability of the resulting management framework. Sections 5.2 and 5.3 describe basic definitions, assumptions, and the algorithm. The following Section 5.4 presents a proof of its correctness, and Section 5.5 apply the results to Hasthi. Section 5.6 derives the availability of a management framework that uses the algorithm, and finally, Section 5.7 discusses the impacts of results while revisiting assumptions and discussing their ramifications. 81

5. Proof Of Correctness

5.2 5.2.1

82

System Definition Basic Definition and Notations

Listed are the basic sets that define the environment: N AM ES – the set of all names, V ALS – the set of all values, T – the set of all points of time, ∆T – the set of all time periods. Furthermore, We define a map, a data structure that allows us to store and lookup data via a key, and we define the set of all maps as M AP S = {m ⊆ N AM ES × V ALS|(k, v1 ), (k, v2 ) ∈ m =⇒ v1 = v2 }. When m ∈ M AP S, we define following three functions on maps. value(m, k) = {v ∈ V ALU ES|(k, v) ∈ m}

5.2.2

(5.2.1)

keys(m) = {k ∈ N AM ES|∃v ∈ V ALU ES s.t. (k, v) ∈ m}

(5.2.2)

values(m) = {v ∈ V ALU ES|∃k ∈ N AM ES s.t. (k, v) ∈ m}

(5.2.3)

Basic System Definition and Notations

A system consists of resources that are working together to perform useful tasks, and they communicate only by passing messages to each other. Each resource performs localprocessing and message interactions with other resources. We write RES to denote the set of all resources. Each message has a name and values, and messages are sent from a resource (the sender) to one or more resources (the receivers). The process of sending a message is called a message interaction. Furthermore, we represent a message as a tuple, where the first entry is the message name and others are values contained in the message. For example, the message (CoorHb, c) has “CoorHb” as the name and the “c” as its value. Therefore, we write M to denote the set of all possible messages and M ⊂ N AM ES × 2V ALS .

83


Furthermore, we define a message interaction as a tuple, (r1 , m, r2 ), r1 , r2 ∈ RES, m ∈ m

M and write r1 − → r2 . Message interactions are two forms: from a sender to a receiver and a sender to all resources in the system (broadcast). We define the following operations to denote both modes.

1. send : RES × M → (M ∪ φ) where send(r, m) = m ˆ sends the message m to the resource r and may return a response message m. ˆ 2. broadcast : M → (2M ∪φ ) where broadcast(m) = mr ⊂ M sends the message m to all resources in Brbag , the broadcast set, and optionally returns a set of responses.

5.2.3

Representing State

We assume all resources expose their state as name values pairs, which we will call properties. For example, the WSDM specification [26] exposes the monitoring information about a resources as properties. Furthermore, when r ∈ RES, p ∈ N AM ES, t ∈ T , to represent r’s state we define following functions. I(r) = s ∈ M AP S, s is the state of r I(r, p) = φ or v ∈ V ALU ES and (p, v) ∈ I(r)

(5.2.4) (5.2.5)

Furthermore, we extend the function to represent the value I(r, p) and I(r) captured at a given time. I(r, p, t) = I(r, p) at the time t

(5.2.6)

S(r, t) = I(r) at the time t

(5.2.7)

84


5.3

Manager-Cloud Algorithm

We described Hasthi and its architecture in Chapter 3. The manager-cloud, which consists of managers and a coordinator elected among them, provides the architectural basis for Hasthi. As illustrated in Chapter 3, the algorithm binds managers, a coordinator, and management agents running in each resource together as a single cohesive unit. We call the algorithm “Manager-Cloud Algorithm”.

5.3.1

Managed System Definition

We define a managed system at the time t0 , as a tuple (C, M, R, B, t0 ) where Sets (C, M, R, B) correspond to the coordinator, managers, resources, and bootstrap nodes respectively. All components are placed in a complete network consists of M ∪ R ∪ B, and the broadcast set, which defines recipients of a broadcast, is Brset = M ∪ B.

Figure 5.1: State and Lifecycles of Components

Let us briefly look at resources, managers, and bootstrap nodes.

85


1. Resources (R ⊆ RES) – R represents the set of all resources managed by the system. The state of each resource r is comprised of a property called manager that points to an assigned manager and the resource specific state represented by S(r). For compactness, the property manager is represented by m in Figure 5.1. Furthermore, the resource has a control-loop, which either periodically sends heartbeat messages if a manager has been assigned, or otherwise, periodically sends “ManageMe” messages to a bootstrap node until a manager is assigned. 2. Managers (M ⊆ RES, but M ∩R = φ) – M represents the set of all managers. Each manager has properties, coordinator that points to the coordinator, isCoordinator that is set only if the manager is also the coordinator, age that counts the number of timesteps elapsed since the start, and Rbag that tracks the status of resources assigned to itself. The properties coordinator and isCoordinator are represented by C and isC in Figure 5.1. Using the algorithm described in Listing 3, Rbag guarantees that it contains the most recent resource snapshots (send with heartbeats) received from each resource assigned to this manager. In other words, it provides the following guarantee:

(ResHb,S(r,t1 ))

I(m, Rbag , t) ={S(r, t1 )|r −−−−−−−−−→ m, t1 < t, (ResHb, S(r, t1 )) is the most recent heartbeat send by r and received by m by t} (5.3.1) Furthermore, each manager has a control-loop that periodically sends a heartbeat message to the coordinator if a coordinator is known and starts an election to elect a new coordinator if the coordinator has failed. Moreover, the control-loop removes any resources snapshots of resources that haven’t sent a heartbeat message for a given timeout interval.

86


3. Coordinator (C = {ci ∈ M |I(ci , isCoordinator) = true}) – The Coordinator is a manager whose isCoordinator is set to true, and it has properties Mbag that tracks the state of managers who have joined the coordinator and m2r a mapping from managers to resources. Using the algorithm described in Listing 4, Mbag guarantees that it contains the most recent manager snapshots (send with heartbeats) received from each manager. In other words, it provides the following guarantee:

(M ngHb,S(m,t1 ))

I(c, Mbag , t) = {S(m, t1 )|m −−−−−−−−−−→ c, t1 < t, (M ngHb, S(m, t1 )) is the most recent heartbeat send by m and received by c by t} (5.3.2) Furthermore, the coordinator periodically broadcasts coordinator heartbeats, and its control-loop periodically removes any manager snapshots of managers that haven’t sent a heartbeat message for a given timeout interval. A coordinator resigns if it receives a “Nominated” or “CoorHb” message from a better coordinator.

4. Bootstrap Nodes(B) – These nodes run on well-known addresses. Others use them as the entry point to the system, and their main responsibility is to forward messages to the coordinator. It has a property called coordinator, which holds the current coordinator address (c in Figure 5.1).

Figure 5.1 illustrates life cycles and the state of each component, and states of components are defined in terms of aforementioned properties. The figure shows how states change when certain messages are received or sent.

87


5.3.2

Constants in a Managed System

As described above, managers, the coordinator, and resources have control-loops that are periodically executed once every em , ec , and er respectively. Each manager control-loop sends heartbeat messages to the coordinator, and if heartbeats from a manager are missing for tm , the entry for the manager is removed from the coordinator’s Mbag . We call the associate time the manager timeout period. Similarly, each resource sends heartbeat messages to managers, and entries for resources are removed if heartbeat messages are missing for tr . Using these two scenarios, we define following constants about the management system. 1. top - an upper bound for the time taken by a send-receive operation. 2. tbroadcast (b ∈ N ) - an upper bound for the time taken to perform a broadcast operation in a broadcast network with n nodes. 3. tr , tm - resource and manager timeout periods. 4. er , em , ec - resource, manager, and coordinator heartbeat intervals (epoch times). Those constants are subjected to following restrictions. 4top < er < er + 5top < tr em + top < tm 4top < tm + ec

5.3.3

(5.3.3) (5.3.4) (5.3.5)

Algorithm Sudo Code

The Manager-Cloud Algorithm is illustrated in the following four listings, which depict the logic for four types of components in the algorithm. Each component is written as a

88


class and has two special methods: start() and receive(). The first method is executed at initialization. The second, receive() is executed when a component receives a message, and case statements in the method select a code segment based on the message type and execute it. Furthermore, the special call, T imer.schedule(demon loop(), t) periodically executes the demon loop() once every t time. The algorithm was described in Chapter 3. Listing 5.1: Bootstrap Code Bootstrap{ coordinator = null ; 3 receive (){ 4 c a s e ( ManageMe , r ) : 5 i f ( c o o r d i n a t o r ! = n u l l ){ 6 s e n d ( c o o r d i n a t o r , ( ManageMe , r ) ) 7 } 8 c a s e ( CoorHb , c ) : 9 checkAndSet ( c ) ; 10 c a s e ( Nominated ,m ) : 11 c h e c k A n d S e t (m ) ; 12 c a s e ( GetMng ) : 13 return ( GetMng , c o o r d i n a t o r ) 14 } 15 s y n c h r o n i z e d c h e c k A n d S e t ( newc ) { 16 send ( ( CheckHealth ) ) ; 17 i f ( c o o r d i n a t o r = = n u l l | | I ( c o o r d i n a t o r , r a n k ) > I ( newc , r a n k ) ) { 18 c o o r d i n a t o r = newc ; 19 }else{ 20 t r y { s e n d ( c o o r d i n a t o r , ( P i n g ) ) ; } c a t c h ( e ) { c o o r d i n a t o r = newc ; } 21 } 22 } 23 } 1 2

Listing 5.2: Resource Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Resource{ manager = n u l l ; s t a r t (){ Timer . s c h e d u l e ( demon loop , er ) ; } demon loop ( ) { try{ i f ( manager ! = n u l l ) { s e n d ( manager , ( ResHB , S ( s e l f , now ) ) ) ; } } catch ( e ){ manager = n u l l ; } i f ( manager = = n u l l ) { s e n d ( b o o t s t r a p , ( ManageMe , s e l f ) ) ; } } receive (){ c a s e ( I a m A s s i g n e d ,m ) : i f ( manager = = n u l l ) { I ( s e l f , manager ) = m; }else{

89


t r y { s e n d ( manager , ( P i n g ) ) } catch ( e ){ I ( s e l f , manager ) = m; }

23 24 25 26

}

27 28

}

29 30

}

Listing 5.3: Manager Code 1 2 3 4 5 6 7 8 9

Manager { age = 0 ; s t a r t i n g = t r u e ; Rbag = {} coordinator = null ; rank = unique rank ( ) amIcoordinator = false ; s t a r t (){ Timer . s c h e d u l e ( demon loop , em ) ; }

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

demon loop ( ) { age = age + em Rbag = Rbag − {S(r, t)|(tnow − t) > tr } try{ i f ( c o o r d i n a t o r = = n u l l ){ ( GetCoord , cnew ) = s e n d ( b , ( GetCoord ) ) ; c o o r d i n a t o r = cnew } i f ( c o o r d i n a t o r ! = n u l l ){ s e n d ( c o o r d i n a t o r , ( mngHb , S ( s e l f , now ) ) ) ; } e l s e i f ( age > nb ec + em AND s t a r t i n g = = t r u e ) { a m I c o o r d i n a t o r = t r u e ; / / become c o o r d i n a t o r } } catch ( e ){ i f ( e = = ( ResignedError , newcoordinator ) ) coordinator = newcoordinator }else{ coordinator = null ; starting = false ; } } synchronized ( s e l f ){ i f ( c o o r d i n a t o r = = n u l l ){ nominations = broadcast ( ( nomination ) ) ; b e s t c a n d i d a t e = Max{ n o m i n a t i o n s } ; ( InviteRes , t c ) = send ( b e s t c a n d i d a t e , ( I n v i t e ) ) ; coordinator = tc ; } }

39 40 41 42 43 44 45 46 47 48 49

} receive (){ case ( assign , r ) : s e n d ( r , ( I a m A s s i g n e d ,m ) ) ; c a s e ( CoorHb , c ) : s y n c h r o n i z e d i f ( c o o r d i n a t o r == n u l l | | I ( coordinator , rank ) > I ( c , rank )){ coordinator = c ; } e l s e { try { send ( c o o r d i n a t o r , ( Ping ) ) ; } catch ( e ){ c o o r d i n a t o r = c ;}} case ( nomination ) : r e t u r n ( nomination , s e l f , rank )

90


case ( I n v i t e ) : synchronized i f ( c o o r d i n a t o r ! = n u l l && c o o r d i n a t o r ! = s e l f ) { return ( InviteRes , coordinator ) ; } e l s e i f ( a m I c o o r d i n a t o r = = f a l s e ){ b r o a d c a s t ( ( Nominated , s e l f ) ) ; amIcoordinator = true ; } return ( InviteRes , s e l f ) ; c a s e ( ResHB , S ( r , t ) ) : Rbag = (Rbag − {S(r, tˆ)|tˆ ∈ T }) ∪ S(r, t) c a s e ( Nominated ,m ) : s y n c h r o n i z e d i f ( I ( c o o r d i n a t o r , r a n k ) > I (m, r a n k ) ) { c o o r d i n a t o r = m; } c a s e ( MngHB, S (m, t ) ) : r e t u r n ( ResignedError , coordinator ) case ( Ping ) : r e t u r n ( Ping )

50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

}

67 68

}

Listing 5.4: Coordinator Code 1 2 3 4 5 6

Coordinator{ s t a r t (){ Timer . s c h e d u l e ( demon loop , em ) ; } Mbag = {} m2r = {}

7

demon loop ( ) { rmbag = {S(m, t)|(tnow − t) > tm } Mbag = Mbag − rmbag m2r = m2r − {(m, r)|m ∈ keys(rmbag ), r ∈ R} ;

8 9 10 11 12

b r o a d c a s t ( msg coorhb ) ;

13 14

/ / o t h e r work a g e ++;

15 16

}

17 18

receive (){ c a s e ( mngMe , r ) : s y n c h r o n i z e d if (r ∈ / values(m2r) ) { assignedm = s e l e c t m a n a g e r ( ) ; }else{ a s s i g n e d m = v a l u e ( m2r , k ) } send ( assignedm , ( assign , r ) ) ; m2r = m2r ∪ (assignedm, r) } c a s e ( CoorHb , c ) : i f ( I ( self , rank ) > I ( c , rank )){ amIcoordinator = false ; } c a s e ( MngHB, S (m, t ) ) : Mbag = (Mbag − {S(m, tˆ)|tˆ ∈ T }) ∪ S(m, t) }

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

}


5.3.4

91

Terminology

For simplicity, when r ∈ R and m ∈ M , we write r ∈ I(m, Rbag ) when there exists a t s.t. S(r, t) ∈ I(m, Rbag ). Definition Let (C, M, R, B, t) be a managed system, and we say that the manager-cloud (C, M, B) is consistent at the time t, if the following conditions are met. • The cloud has one coordinator and it has the best rank. • All active managers and bootstrap nodes know about the coordinator. • All active managers have joined the manager-cloud. • All failed managers have been removed from the manager-cloud. Definition Let (C, M, R, B, t0 ) be a managed system, and at the time t0 , if following conditions are met, we say that the resource r ∈ R is ∆t-consistent. • The manager-cloud is consistent at the time t0 . • No active resource, r ∈ R, is assigned to two managers at the time t0 • Given that the manager-cloud will remain consistent in the future t > t0 and communication failures do not happen, at least one of the following are true. 1. The resource r has been already assigned to a manager, and as long as the resource is active its state snapshot (S(r, tsnapshot ) ∈ I(m, Rbag , tf )) will never lag more than ∆t from tf ( tf − tsnapshot < ∆t). Furthermore, if the resource has failed, the state-snapshot will be removed within a ∆t. 2. The resource r will be assigned to a manager within a ∆t time of its start, and after added, it will continue to exhibit the first property.


92

It is worth noting the future tense used in the second part of the third condition. We say that the system is consistent if it has the potential to exhibit expected behavior in the absence of failures. For example, even if a resource has just started but not joined the resource cloud, we say it is t-consistent given that it will join the system within a time t and then continue to exhibit Property 1. Definition Let (C, M, R, B, t) be a managed system, we say the system is healthy at the time t if and only if • The manager-cloud is consistent. • ∀r ∈ R, is (tr + em )-consistent.

5.4

Proof

This section proves that there exists a constant time th , which is a function of constants we defined in Section 5.3.3 and the manager-cloud size, such that regardless of the initial state, if managers do not fail and communication failures do not happen for a continuous th period, after the th interval, the manager-cloud is and will continue to be healthy as long as managers or the coordinator do not fail and communication failures do not occur. Furthermore, we will derive a similar result for recovery from manager failures and discovery of new managers and resources. We make the following assumptions regarding the system and the environment.

5.4.1

Assumptions

• Assumption 1: All resources are in active or failed state (Crash failure semantic). If a resource has failed, it will never come back to an active state, will not generate


93

any messages, and any send() operation trying to send messages to that resource will fail. • Assumption 2: Given that nodes do not join or leave the broadcast network, if a node broadcasts the same message repeatedly, there is a number nb such that every node in the broadcast network will receive the message within nb retires. • Assumption 3: There is always at least one bootstrap node available, and all resources know its address. For the proof, we use two tools. The first, assuming top is the maximum time taken by a service invocation, we argue that in the absence of failures, some events that occur in the system, like a new resource join, give rise to forced sequences of other conditions within a bounded time. The second, by considering all possible initial states, we argue that those force sequences happen regardless of the initial state, thus establishing that the recovery will always happen within a bounded time. Subsections 5.4.2 to 5.4.5 prove the behavior of resources when the manager-cloud is healthy, the recovery behavior of managers, the recovery of the coordinator, and the recovery behavior of the system respectively. Finally Section 5.4.6 puts all the results together by showing that, regardless of its initial state, a system will go through steps of recovery where it will first have one coordinator, then a consistent manager cloud, and finally, a healthy system. In self-stabilization [52] proof techniques, this technique is called convergence stairs.

5.4.2

Resource Behavior

Proposition 5.4.1 Let Sys(t0 ) = ({c}, M, R, {b}, t0 ) a managed system at the time t0 and no communication failures happened after t0 . Let the resource r ∈ R is up at the time interval [ts , te ], ts < t0 < te , and failed just after te . Then following conditions are true.

94


1. For any m and t1 ∈ [t0 , te ] such that r ∈ I(m, Rbag , t1 ), I(r, manager, t1 ) = m, and m does not fail. =⇒

for all t2 ∈ [t1 , te ], r ∈ I(m, Rbag , t2 ), I(r, manager, t2 ) = m, and there exists a S(r, tˆ) such that S(r, tˆ) ∈ I(m, Rbag , t2 ) and t2 − tˆ < er + top .

2. If the Manager-Cloud of Sys(t) is consistent for all times t ≥ t0 , then regardless of the initial state of r at the time t0 , r is assigned to some m ∈ M for all t ∈ [t0 + (er + 5top ), te ]. (We say r is assigned to m if I(r, manager, t) = m and either r ∈ I(m, Rbag ) or r ∈ I(m, Rbag ) will happen within a er + top time.) 3. For all t > te + (tr + em ), r ∈ / I(m, Rbag , t) 4. For any t > t0 + em + tr , active manager m ∈ M , and active resource r ∈ R r ∈ I(m, Rbag , t)

I(r, manager, t) = m

=⇒

Figure 5.2: Resource Time line


95

Part 1 An inspection of the code Resource:6-30 shows that if I(r, manager) = m, it changes only if a heartbeat from r to m fails. However, since m does not fail and communication failures do not happen, heartbeats do not fail. Therefore, in this setting, I(r, manager) does not change once it is set to a manager. On the other hand, lets us consider Figure 5.2, which shows the timeline of a resource. Let us assume that the resource sends the first heartbeat at th . In this setting, as long as r is alive, it sends a heartbeat once every er [the code Resource:7-16]; therefore, it sends the xth heartbeat (ResHb, S(r, t(x))) at t(x) = th + xer as long as t(x) < te . Since all messages are processed within top time, every heartbeat sent at t(k) is received and processed by the time t(k) + top . It is given that for some t1 ∈ [ts , te ], r ∈ I(m, Rbag , t1 ); therefore, the property Rbag of m has an entry for r at the time t1 . Only the code Manager:13 removes entries (resource snapshots) from Rbag of managers, which is executed once every em , and lets assume that the code is executed at the time t3 > t1 . Let tlast is the timestamp of the last resource heartbeat sent by r and received by m before t3 , then by the definition given in Equation (5.3.1), S(r, tlast ) ∈ I(m, Rbag , t3 ). However, it follows from the code Manager:13, the entry S(r, tlast ) is removed only if t3 − tlast > tr . We shall prove that it never happens. Consider any t(k) such that t1 < t(k) < t3 where t(k) is the time the k th heartbeat was sent. An example is shown in Figure 5.2. Then according to the definition of division and remainder, there exists some integer n and ∆t such that ∆t < er , n ≥ 0, and the following condition is true. t3 − t(k) = ner + ∆t =⇒ t3 − t(k + n) = ∆t since the t(x) is an additive function.

(5.4.1) (5.4.2)


96

Since r is active, it continues to send heartbeats at times t(k + 1), t(k + 2) . . . , and within a top time, m receives each heartbeat and adds it to I(m, Rbag ). Let us consider following cases based on the sizes of ∆t, k, and n, and we will show at the each case that t3 − tlast < er + top . 1. If ∆t ≥ top , then m has received the t(k + n) heartbeat message by t3 ; thus, tlast ≥ t(k +n). However, ∆t < er , and substituting both of these results to Equation (5.4.2) yields t3 − tlast < ∆t < er 2. Else if ∆t < top and k + n > 0, then by t3 , m has received the earlier heartbeat sent at t(k + n − 1) because er > top . Hence, tlast ≥ t(k + n − 1). However, ∆t < top and substituting both of these results to Equation (5.4.2) yields that t3 − t(k + n − 1) = (∆t + er ) =⇒ (t3 − tlast ) ≤ (∆t + er ) since tlast ≥ t(k + n − 1) =⇒ (t3 − tlast ) < (top + er ) since ∆t < top 3. Otherwise (that is k + n = 0 and ∆t < top ) t(0) = th , and then it follows from Equation (5.4.2) that t3 − th = ∆t. In this setting, r has sent only one heartbeat message and that was at th . Since it is given that the entry has been already added to Rbag of m, m has received at least one heartbeat. Therefore, tlast ≥ th . Hence, t3 − th = ∆t =⇒ t3 − tlast ≤ ∆t < er since tlast ≥ th and ∆t < er In all 3 cases, the upper limit is t3 − tlast < er + top , and this proves the second part. Furthermore, by the definition of constants (e.g. Equation (5.3.3)), er + top < tr ; therefore,


97

t3 − tlast < tr . Consequently, while r is up, the entry S(r, tlast ) is never removed. This completes Part 1. Part2: Let us show that regardless of the initial state, if r does not fail and the managercloud is consistent, then r joins the resource cloud within a constant time. It is given that the manager-cloud is consistent, bootstrap nodes, managers, and the coordinator do not fail, and communication failures do not occur. Therefore, the following message sequences are forced, and once initiated, they proceed to the end. We shall refer to the line numbers of the algorithm next to each step. At a t0 > ts , r may be in one of the following states. We shall prove that each state gives rise to a forced sequence that adds the resource to the cloud within a fix time. 1. Case I(r, manager, t0 ) = null: Then within a er time, r sends a ManageMe message to b [the code Resource:8-10], and b forwards the ManageMe to the coordinator [the code Bootstrap:5-7]. Subsequently, the coordinator assigns r to some manager m [the code Coordinator:20-27], and the manager notifies r [the code Manager:4243]. Since each operation finishes within a top , the sequence finishes within a er +4top time. Furthermore, since r will send heartbeat to m within er and an entry for r will be added to I(m, Rbag ) within er + top , by definition, r is assigned to m by that time. 2. Case I(r, manager, t0 ) = m ∈ / Active(m, ts ): In this case, within the next er , r tries to send a heartbeat and it fails, and subsequently, it sets the I(r, manager) = null [the code Resource:7-13]. Now r sends a ManageMe message, and similar to the case 1, r will join within a 4top time from this point. Therefore, r will join within er + 5top . 3. Case I(r, manager, t0 ) = m ∈ Active(m, ts ): The manager property of r is set to m only through the above sequence, and we know m is active. Hence, r is already asigned to m.


98

Let tjoin = er + 5top , then in all three cases, r is assigned to m. Furthermore, it follows from Part 1 that once added r is never removed unless it has failed. Furthermore, since m does not fail I(r, manager) does not change as well. This completes the proof of the second part.

Part3: Let us prove that if a resource has failed, it is removed from the assigned manager within a tr + em time. If r fails at te , it does not send any heartbeat messages after that point. Let m is the manager to whom r is assigned. If there is no such m, r is already removed and our proof is complete. Let t1 > te + tr + em . But each manager executes the cleanup code once every em [the code Manager:13] . Therefore, at some t4 ∈ [t1 − em , t1 ], m has executed the cleanup code. Then t4 > t1 − em , which implies t4 > te + tr (by substituting t1 > te + tr + em ). On the other hand, if the resource is assigned to m at t4 , there is an entry in Rbag of m: S(r, t2 ) ∈ I(m, Rbag , t4 ) for some t2 . However, since r failed at te and does not generate any messages after the failure, t2 < te . These two results, t2 < te and t4 > te + tr =⇒ t4 − t2 > tr . As a result, the manager cleanup code removes S(r, t2 ) from I(m, Rbag ) at the time t4 , and since r has failed and does not generate messages, it does not get added back. This completes the proof of Part 3.

Part4: Let us show that if communication failures do not occur for a em + tr time, then r ∈ I(m, Rbag , t) =⇒ I(r, manager) = m for all future t as long as communication failures do not happen. Let t1 ≥ t0 + em + tr . We make following two observations.


99

Observation 1: r ∈ I(m, Rbag , t) only if r has sent a heartbeat to m. However, an inspection of the manager code shows that a heartbeat is sent from r only after I(r, manager) is set to m. Observation 2: An inspection of the code Resource:7-26 shows that I(r, manager) = m changes only if either a heartbeat or a ping message to m has failed. Since communications do not fail, either of those messages fails only if m has failed. However, it is given that m does not fail. Therefore, after t0 , I(r, manager) will not change once it is set to a manager. To prove the result, we will consider following four cases at the time t0 . However, since we only need to prove the forward direction, we shall only consider cases where r ∈ I(m, Rbag , t1 ) at the time t1 .

1. Case r ∈ I(m, Rbag , t0 ) and I(r, m, t0 ) 6= m: Unless I(r, manager) is set to m, r will not send heartbeats to m. Therefore, as shown by Preposition 5.4.1(3), the entry / I(m, Rbag , t1 ). for r will be removed from the I(m, Rbag ) by t1 ; therefore, r ∈ On the other hand, the only case where the entry for r is not removed is if I(r, manager) is set to m and a heartbeat was sent from r to m before the entry for r is removed. Furthermore, by Observation 2, once set to m, I(r, manager, t0 ) does not change. Therefore, in this case, r ∈ I(m, Rbag , t1 ) and I(r, manager, t1 ) = m and the result still holds. 2. Case r ∈ I(m, Rbag , t0 ) and I(r, manager, t0 ) = m: It follows from the Part 1 that r ∈ I(m, Rbag , t1 ) and I(r, manager, t1 ) = m. 3. Case r ∈ / I(m, Rbag , t0 ) and I(r, manager, t0 ) = m: Within er , r will send a heartbeat to m, and within another top , an entry for r will be added to I(m, Rbag ). Furthermore, as shown by Part 1, the entry for r will not be removed, and by Observation 2,


100

I(r, manager) will not change. Hence, I(r, manager) = m and r ∈ I(m, Rbag , m) at all t > t1 . / I(m, Rbag , t) at any point 4. Case r ∈ / I(m, Rbag , t0 ) and I(r, manager, t0 ) 6= m: If r ∈ of time t > t0 , then it does not affect the condition we are trying to prove. Otherwise, if r ∈ I(m, Rbag , t) at any time t, then by Observation 1, I(r, manager) must have assigned to m before t. However, by Observation 2, I(r, manager) does not change after assigned to m, and therefore, r ∈ I(m, Rbag , m, t) =⇒ I(r, manager) = m. This completes the proof.

5.4.3

Manager Consistency

Proposition 5.4.2 Let Sys(t0 ) = ({c}, M1 , R1 , {b}, t0 ) is a managed system with a single coordinator at the time t0 , bootstrap nodes know about the coordinator, the coordinator does not fail, and communication failures do not happen. Then any new manager joins the coordinator within a 2top time. Proof According to Assumption 3, the manager knows about a bootstrap node, and once started, it sends a (GetCoord) message to the bootstrap node, and the bootstrap node, which knows the coordinator, responds back with the current coordinator address. Subsequently, the new manager sends a heartbeat to the coordinator and joins the coordinator. This process takes two invocations and completes within a 2top time. . Proposition 5.4.3 Let Sys(t0 ) = ({c}, M, R, {b}, t0 ) be a managed system with one coordinator c that does not fail, communication failures do not happen, and the manager m ∈ M is up in the interval [ts , te ] but failed just after te . Then following conditions are true.


101

1. For some t1 ∈ [t0 , te ], m ∈ I(c, Mbag , t1 ) and I(m, coordinator, t1 ) = c =⇒

for all t ∈ [t1 , te ], m ∈ I(c, Mbag , t) and I(m, coordinator) = c and there is a S(m, tlast ) such that S(m, tlast ) ∈ I(c, Mbag , t) and t − tlast ≤ em + top .

2. If both m and the bootstrap node know about the coordinator, then for all t ∈ [t0 + (em + top ), te ], m ∈ I(c, Mbag , t) and I(m, coordinator) = c 3. For all times t > te + (tm + ec ), m ∈ / I(c, Mbag , t)

Proof Part1: Proof is similar to the proof of the first part in Preposition 5.4.1. Part2: Let t1 > t0 + em + top If the manager m knows about the coordinator, it sends a heartbeat message within the next em and the coordinator adds an entry to I(c, Mbag ) [the code Coordinator:33]. This happens within top time from the heartbeat was sent, and therefore, the entry is added by t1 . Furthermore, according to Part 1, once added, the entry for m is never removed from I(c, Mbag ) and since the coordinator does not fail and communication failures do not happen for all t < te , all future heartbeats from m to c will be successful; therefore, I(m, coordinator) does not change. Consequently, for all t ∈ [ts +em +top , te ], m ∈ I(c, Mbag , t) and I(m, coordinator) = c. Part3: The Proof is similar to the proof of Preposition 5.4.1, part 3.


5.4.4

102

Election

Proposition 5.4.4 Let Sys(t0 ) = (C, M, R, {b}, t0 ) is a managed system at the time t0 , communication failures do not happen, mbest is the manager with the best rank at t0 , and every new manager who will join the system after t0 has a lesser rank. Then regardless of the state of the system at t0 , following conditions holds. 1. If mbest does not fail, within a em + 3top + 2tbroadcast (|M | + 1) time period, mbest will be a coordinator and continue to be one as long as it does not fail. 2. If mbest has become a coordinator and managers do not join or leave the system for a max(tm + ec , nb × ec ) time period, mbest will become the only coordinator and all active managers and bootstrap nodes will know about it. Also, all failed managers will be removed from the coordinator mbest by that time. Part1: Each manager m executes the manager loop once every em . If I(m, coordinator) is not null, the manager loop sends a heartbeat to the coordinator, and if the heartbeat failed, it sets the I(m, coordinator) to null. On the other hand, if I(m, coordinator) is null, it starts a nomination, selects the next coordinator from responses to the nomination, and invites the selected manager to become the coordinator. By inspecting the manager code, we can make following assertions about managers. Fact 1 I(m, coordinator) is set to null only within the manager loop [the code Manager:1429], which happens only if sending a heartbeat has failed, and since communication failures do not occur, this happens only if the coordinator has failed. Fact 2 A manager never accepts a coordinator with an inferior rank than itself [the code Manager:44-47,60-63], and it will never resign from the coordinator position for a coordinator with an inferior rank [the code Coordinator:29-32].


103

At the time t0 , mbest is the manager with the best rank, and because all new managers will have lesser ranks, mbest will continue to be the manager with the best rank as long as it is up. Therefore, according to Fact 2, mbest neither accepts another coordinator nor resigns from the coordinator if it has become one. Since mbest never accepts a coordinator that is lesser than himself, mbest is at one of the following states at t0 . Let us consider each case and show that it will become a coordinator. 1. Case mbest is already the coordinator: Nothing to prove, go to the next. 2. Case mbest is not the coordinator and I(mbest , coordinator) = cf ail , which has failed: Then within the next em , it will try to send a heartbeat and it will fail, and I(mbest , coordinator) will be set to null. Go to case 3. 3. Case mbest is not the coordinator and I(mbest , coordinator) = null: mbest starts the nomination process by broadcasting a nomination message, and any node that receives it responds with his own rank. mbest searches for a candidate among responses and himself, and since it has the best rank, it will choose himself and invite himself regardless of other responses. When it receives the nomination, it broadcasts a nominated-message and becomes the coordinator. While the nomination process is in progress, mbest may receive an invitation from someone else and become the coordinator earlier, and that yields the same outcome. Furthermore, it might receive nominated messages or heartbeat messages for other coordinators, but since it does not accept a lesser coordinator, it neglects all those messages. Once it becomes the coordinator, it never accepts another coordinator or resigns (from Fact 2). Therefore, it will be a coordinator as long as it is up and the longest sequence takes a em + 3top + 2tbroadcast (|M | + 1) time (due to two broadcasts). This completes the first part of the proof.


104

Part2: Having completed the part 1, the system has a set of coordinators(C), |C| ≥ 1, every coordinator repeats a heartbeat message (CoorHb, c) once every ec , and mbest is the coordinator with the best rank. It follows from Assumption 2 that every manager and bootstrap node receive (CoorHb, mbest ) within a telect = max(tm + ec , nb × ec ) time. 1. Due to the code Coordinator:29-32, on the reception of (CoorHb, mbest ), every other coordinator in the system resigns. By code Manager 45-47, every manager m sets I(m, coordinator) = mbest because mbest is the live manager with the best rank. 2. Assume some manager m other than mbest received an invitation to become the coordinator. Since the code Manager:44-47, 60-63 and 50-56 are synchronized, m receives and processes (CoorHb, mbest ) either before or after processing the invitation. If the invitation is processed before, m becomes a coordinator but resigns on the reception of (CoorHb, mbest ). Otherwise, m ignores the invitation due to the condition at the code Manager:61. Therefore, after the reception of (CoorHb, mbest ), no other manager m will stay as or become a coordinator. 3. As explained by Fact 1, a manager sets I(m, coordinator) to null if and only if the current coordinator has failed or it has resigned. But mbest does not fail or resign. Therefore, after receiving (CoorHb, mbest ) and setting I(m, coordinator) = mbest , managers do not set their I(m, coordinator) to null. 4. Inspecting the code Manager 60-63,44-47 shows that every manager m updates the property I(m, coordinator) such a way that the new coordinator has a better rank than the old one, but the current coordinator has the best rank. Therefore, every manager will have I(m, coordinator) = mbest after the reception of (CoorHb, mbest ), and would not accept any other coordinator after that point. Consequently, after a telect time, I(m, coordinator) does not change for any manager m. Furthermore, since heartbeats to mbest do not fail, new elections do not start.


105

Based on the above 1-4, it follows that within a telect time period, every other coordinator resigns, no new coordinators are elected, and all managers have I(m, coordinator) = mbest . Furthermore, for any manager m, the property I(m, coordinator) does not change after that point of time. Therefore, after a telect time, mbest will be the only coordinator, and other managers will know about it. Since the broadcast is received by all nodes, the bootstrap nodes that are in the broadcast group also receive the (CoorHb, mbest ) message and, therefore, will know about the current coordinator. Hence, after a telect time, there is only one coordinator and all managers and bootstrap nodes know about it. Since managers do not fail after t0 , if mbest becomes a coordinator after t0 , it does not have failed managers. Otherwise, if mbest was the coordinator at t0 , by Preposition 5.4.3(3), all failed managers are removed within a tm + ec time period where tm + ec ≤ telect . Therefore, in both cases, mbest does not have any failed managers after telect . This completes the proof.

5.4.5

System Consistency

Proposition 5.4.5 Let Sys(t0 ) = ({c}, M1 , R1 , {b}, t0 ) is a managed system with a single coordinator, all managers and bootstrap nodes know about the coordinator, and communication failures do not happen. 1. If managers do not join or leave the system, the manager-cloud reaches a consistent state within tm + ec , and remains consistent. 2. If managers do not join or leave the system and all failed managers had been removed from the coordinator by the time t0 , the manager-cloud reaches a consistent state


106

within em + top and remains consistent. Proof It follows from Preposition 5.4.3(2) that all managers will join the manager-cloud in a em + top time and according to Preposition 5.4.3(1), they would not leave once joined. Furthermore, it follows from Preposition 5.4.3(3) that any failed managers are removed within a tm + ec time. Therefore, the manager-cloud is consistent by t0 +tm +ec and continues to be consistent. On the other hand, it follows from prepositions 5.4.3(1) and 5.4.3(2) that if failed managers do not need to be removed, the system becomes consistent within a em + top time period. This completes the proof. Proposition 5.4.6 Let Sys(t0 ) = ({c}, M1 , R1 , {b}, t0 ) is a managed system where the manager-cloud of the system is consistent at the time t0 and communications failures have not happened since t0 − em − tr . If the manager-cloud will remain consistent and communication failures will not happen for all t > t0 , then after t0 + er + 5top , all resources of the system are (tr + em )-consistent. Proof Let t1 = t0 + er + 5top , ∆t = tr + em , and Rt0 , Rt1 be the set of all resources at t0 and t1 . It is given that the manager-cloud is consistent. Rt1 = Rt0 ∪ Rjoin − Rf ail where Rjoin are resources joined after t0 and Rf ail are resources failed after t0 . We shall show that each resource is ∆t-consistent. 1. It is given that communication failure did not happen for the last tr + em time. Therefore, it follows from Preposition 5.4.1(4) that for all t > t0 , r ∈ Rt0 , and ˆ holds. Howm ˆ ∈ M , the result r ∈ I(m, ˆ Rbag , t) =⇒ I(r, manager, t) = m ever, I(r, manager) can have only one value. Therefore, any resource is assigned


107

to at most one manager for t > t0 . Furthermore, since the manager-cloud is consistent and communications failures do not occur, r ∈ Rjoin that joins after t0 will be assigned to only one manager. 2. Since the manager-cloud is consistant after t0 , by Perposition 5.4.1(2), any new resource that joins after t0 will be assigned to a manager within a er + 5top < ∆t, and by Perposition 5.4.1(1), any changes to the resource will be reflected in the manager within a ∆t time. Therefore, those resource are ∆t-consistent for all t > t0 . 3. According to Preposition 5.4.1(3), any failed resource will be removed within a ∆t time from their failure; therefore, all failed resources are ∆t-consistent for all t > t0 . 4. Let r ∈ Rt1 . If a manager is not already assigned to r, it follows from Preposition 5.4.1(2) that the resources will be assigned to a manager within a er + 5top . After r is assigned to a manager (say m), an entry for r will be added to I(m, Rbag ), and it follows from Preposition 5.4.1(1) that as long as the resource is active, its state snapshot in m, (S(r, tsnapshot ) ∈ I(m, Rbag , tf )), will never lag more than ∆t from tf for any tf (that is tf − tsnapshot < ∆t). Therefore, all r is ∆t-consistent for all t > t0 + er + 5top . These conditions complete the requirements for ∆t consistency for all resources. This completes the proof.

5.4.6

Final Results

We shall bring together all the results under following two theorems. Theorem 5.4.7 Let Sys(t0 ) = ({c}, M1 , R1 , {b}, t0 ) is a system managed by the ManagerCloud Algorithm.


108

Then there exist values trstart = em + 3top + 2top (|M | + 1) and theal = 6top + em + er + max(nb × ec , tm + ec , tr + em ) such that regardless of the system state at t0 , if managers do not join or fail and communication errors do not happen, the system will reach a healthy state within a trstart + theal time period and will remain in the healthy state as long as managers will not join or leave and communication errors will not happen. The theorem has a stronger form.

Remark If the manager with the best rank does not fail before or on the time t0 + trstart , and after t0 + trstart managers do not join or fail for a theal time period, then regardless of its state at t0 , the system will reach a healthy state by t0 + trstart + theal and remain in the healthy state as long as managers do not join or leave the system and communication failures do not happen.

Proof We will prove the stronger form of the proof, and the relaxed version trivially follows from it. Let trstart = em + 3top + 2tbroadcast (|M | + 1) and assume the manager with the best rank does not fail and after ts = t0 + trstart , managers do not join or leave. Let t1 > ts + max(tm + ec , nb × ec , tr + em ); then by Preposition 5.4.4, there will be one coordinator at the time t1 , all managers and bootstrap nodes will know about the coordinator, and all failed managers have been removed from the coordinator. Let t2 > t1 + em + top ; then by Preposition 5.4.5, at any t2 , the manager-cloud will be consistent and remain so as long as managers do not join or leave the system. Let t3 > t2 +er +5top . The manager-cloud is consistent and communication errors have not happened for the last em +tr time period, and therefore, by Preposition 5.4.6, S(t) is ∆t consistent for ∀t > t3 , and theal = t3 −ts = 6top +em +er +max(nb ×ec , tm +ec, tr +em ). This completes the proof.


109

Theorem 5.4.8 Let the Sys(t0 ) = ({c}, M1 , R1 , {b}, t0 ) is a system managed by the ManagerCloud Algorithm, the system is healthy at the time t0 , and the coordinator does not fail. However, by the time t1 , Mjoin managers and Rjoin resources have joined and Mf ail managers and Rf ail resources have failed. Then there exists a value trecovery = er + 5top + max(tm + ec , tr + em ) such that if managers do not join or leave and communications failures do not happen for a trecovery time period after t1 , then the system reaches a healthy state and remains healthy as long as managers do not join or leave the system and communications failures do not happen. Proof It is given that managers do not join or leave the system after t1 , let t2 > t1 + max(tm + ec , tr + em ). 1. By Proposition 5.4.3(3) all failed managers (Mf ail ) are removed from the managercloud by t2 . 2. By Proposition 5.4.3(1), active managers remain in the manager-cloud at t2 . 3. By Proposition 5.4.2 all new managers (Mjoin ) have joined the manager-cloud by t2 . Therefore, from 1-3, it follows that all failed manager are removed, new managers are added, and others remain in the manager-cloud. In other words, at the time t2 , all active managers have joined the coordinator and all failed managers have been removed. Since the coordinator does not change, bootstrap nodes know about the coordinator. This completes the requirements for manager-cloud consistency, and therefore, the managercloud is consistent at t2 . Let t3 > t2 + er + 5top , The manager-cloud is consistent and communication errors have not happened for a last em + tr time, and therefore, it follows from Preposition 5.4.6 that Sys(t) is ∆t consistent for ∀t > t3 . The time to recovery is t3 − t1 = er + 5top + max(tm + ec , tr + em ). This completes the proof.


5.5

110

Application to Hasthi

This section extends the above results to Hasthi architecture described in Chapter 3. As proved in Proposition 5.4.2, if the system is healthy, every resource r has a copy of its state stored in the Rbag of the assigned manager m, and that copy is updated within er + top . Therefore, the copy exhibits delta-consistency. Consequently, the Rbag acts as a partial meta-model described in Chapter 3. Furthermore, Hasthi has defined the function S(m, t), which is the state snapshot for a manager m, to include a summary of each entry in the I(m, Rbag , t), and therefore, Mbag of the coordinator, which keeps these snapshots, becomes a collection of those resource summaries that includes all information that should be stored in a summarized meta-model of the system described in Chapter 3. Using Mbag , we define the summarized meta-model as follows. (Here S(r, tˆ) is a state snapshot of some resource.) Definition I(c, Rsum , t) = {S(r, tˆ)| there is Rbag ∈ I(c, Mbag , t) such that S(r, tˆ) ∈ Rbag } In this setting, we make the following observation. Corollary 5.5.1 Let ({c}, M, R, {b}, t0) a managed system at the time t0 , for all times t > t0 , the system is healthy, communication failures do not happen, and we define S(m, t), which is the state snapshot for a manager m, to include a summary of each entry I(m, Rbag , t). Then given any resource r ∈ R and for all t1 > t0 + er + em + 6top , all changes to r are reflected in Rsum within a tr + em + top time period. Proof Periodically each manager sends I(m, Rbag , t) to the coordinator with heartbeats and it is added to the Mbag of the coordinator. Since S(m, t) = I(m, Rbag , t), as described before, the Rsum defined using Mbag acts as a summarized meta-model of the system. Given a resource, there are three states of its life-cycle: creation, changes, and failure. Let us see how changes are propagated to Rsum in each case.


111

1. According to Preposition 5.4.1(2), for any resource, a snapshot for the resource is added to the Rbag of the assigned manager within a er + 5top time period. Subsequently, since the system is healthy, there is a coordinator, and the code Manager:20 sends a snapshot of Rbag to the coordinator within a em time where the coordinator adds the snapshot of Rbag to the Mbag . This operation finishes within a top time, and therefore, an entry for r is added to the I(c, Mbag ) within a er +6top +em time period. Consequently, by definition, the entry become a part of I(c, Rsum ). 2. According to Preposition 5.4.1(1), if r is active, changes to r are reflected in the I(m, Rbag ) within a er + top time. Similar to the first part, changes are sent to the coordinator within a em time-period and they are received and added to the I(c, Mbag ) within another top . Therefore, changes to a active resource are reflected in I(c, Rsum ) within er + em + 2top . 3. On the other hand, if r has failed, according to Preposition 5.4.1(3), the resource is removed from I(m, Rbag ) within a tr + em time. Furthermore, as shown in the proof of the 5.4.1(3), this update is done by the manager control loop. Therefore, once the I(m, Rbag ) is updated, the manager logic that immediately follows the logic that removes the resource [the code Manager:20] sends the updated I(m, Rbag ) to the coordinator. Hence, the removal of the resource propagates I(c, Mbag ) within a tr + em + top time, and according to the definition of Rsum , the resource leaves the Rsum as a result. Therefore, any changes to a resource are reflected in the summarized meta-model within a max(er + em + 2top , tr + em + top ) = tr + em + top time period. Furthermore, using rules, each manager evaluates its Rbag once every em and the coordinator evaluates its Rsum once every ec . Both carry out corrective actions triggered by rule evaluations. Therefore, the following observation holds.


112

Remark Changes to a resource are evaluated by a manager within tr + em , and evaluated by the coordinator loop within a tr + em + ec + top . While the recovery taking place, the system may have more than one coordinator; therefore, the system may have more than one control-loop at a time. To mitigate this problem, each coordinator waits for a theal = 6top + er + em + max(nb ec , tm + ec ) time to start the control-loop. This is the time period in the proof where the best coordinator waits for other coordinators to resign, and it is already a part of the derived recovery time. In this setting, we made the following observation. Corollary 5.5.2 Given that the broadcast is intact, despite failures, if each coordinator runs a control-loop from its demon thread but only starts it after waiting for a theal time, two control-loops do not exist in the system simultaneously. Proof If there are two coordinators running control-loops, both have been the coordinator for a theal time. However, after they become the coordinator, coordinators broadcast heartbeat messages periodically, and according to Assumption 2, both have received heartbeat messages from the other coordinator. Therefore, one of them must have resigned. Hence, this is a contradiction.

5.6

Availability of the Manager-Cloud Algorithm

This section calculates the availability of the management framework resulting from the algorithm. Let us assume a system managed with n managers, each manager having a MTTF (Mean Time To Failure) of θ. Furthermore, let us assume that managers are independent and communication failures do not happen. Therefore, we can use an exponential distribution to model their failures, which, according to Srinivasan [113], is the most common approach used to model failures in a system.

113


Furthermore, according to Srinivasan [113], modeling failures with an exponential distribution means that given a component with MTTF of θ and a time period ∆t, the probability that failures do not occur at that ∆t time period(reliability) is given by the following equation.

∆t

R(∆t) = e− θ , ∆t > 0

(5.6.1)

Furthermore, the availability is defined by the following equation.

Availability(A) = M T T F/(M T T F + M T T R)

(5.6.2)

Although the MTTF of Hasthi can be easily derived from θ, to calculate availability, we have to calculate the MTTR (mean time to recovery) for Hasthi. The theorems we proved describe the recovery behavior, and using these theorems, let us derive the MTTR of Hasthi as a function of the MTTF of a manager(θ). Since theorems present upper bounds, the results will yield an upper bound for the MTTR, and we shall use that to derive a lower bound on the availability of Hasthi. As described by the two theorems we derived, the recovery in Hasthi has two cases: recovering from manager failures and recovering from coordinator failures, and let us denote the MTTR in these cases as hm and hc respectively. Assuming both the coordinator and managers have the same MTTF, manager failures happen n − 1 times more often, and therefore, we can calculate the MTTR using the following equation.

MTTR =

hc + (n − 1)hm n

(5.6.3)

The first theorem says that regardless of the initial state, if managers do not fail for a

114


continuous time interval r, the system will be healthy. The second says that regardless of the initial state, if managers do not fail for a continuous time interval m, the system will recover from manager failures and additions. To derive hc and hm in terms of m and r, let us define the following random variable associated with this problem.

Definition N F (x) = the time elapsed for the first continuous time interval x with no manager failures to occur.

If x is given in seconds, for a x continuous time interval without a error to happen, a “x” number of one second intervals without a error should occur back to back, and if a failure occurs, we have to restart the counting. This is analogous to the event that x continuous HEADs occurring in a biased coin, and for continues x HEADs to occur, we most probably need more than x throws of the coin. Once r or m continuous time interval occurs, system recovers. Since hc and hm are defined as mean values(e.g. MTTR), we can calculate both of them as expected values of N F (x) where hc = E[N F (r)], hm = E[N F (m)]. Modeling N F (x) as a discrete process, we can show that E[N F (x)] is given by the following equation, and the derivation of the equation is given in the Appendix D. Here, x is measured in seconds, all MTTFs are measured in seconds, and p is the probability that a manager is healthy throughout a time period of a second. Hence, the expected value is also in seconds.

E[N F (x)] =

1 − px px (1 − p)

(5.6.4)

As derived by Baumann [39], the MTTF of n independent managers is given by the MTTF of one manager divided by n. Hence, by applying these results to Equation 5.6.1 for

115


R(t), p can be calculated as follows.

p = e−

(n−1) θ

(5.6.5)

Therefore

E[N F (x)] = hc = hm =

1 − e− e−

x(n−1) θ

(1 − e−

1 − e− e−

r(n−1) θ

m(n−1) θ

(n−1) θ

(5.6.6) )

r(n−1) θ

(1 − e−

1 − e− e−

x(n−1) θ

(n−1) θ

(5.6.7) )

m(n−1) θ

(1 − e−

(n−1) θ

(5.6.8) )

Applying these results to Equation (5.6.9) and Equation (5.6.3) yields the following equation. Availability(A) ≥

θ n

+

θ n hc +(n−1)hm n

(5.6.9)

θ

Availability(A) ≥ θ+

r(n−1) θ 1−e r(n−1) (n−1) − θ e (1−e− θ ) −

+ (n − 1)

m(n−1) θ m(n−1) (n−1) − θ e (1−e− θ )

(5.6.10)

1−e−

To understand these results, let us see how the availability behaves for different values. Assuming top is 1 seconds, and nb = 4, Table 5.6 presents values of m and r using theorems 1 and 2. The first row shows different epoch times, and each column shows the size of the continuous time interval required to recover from both manager and coordinator failures when the system is setup with the epoch time associated with the column.

116


Epoch times 30 seconds m (required continuous time interval to 125 recover from manager failures) r (required continuous time interval to re- 221 cover from coordinator failure)

10 seconds 45

5 seconds 25

81

46

Table 5.1: Recovery Times Against Epoch times

Availability is defined as the percentage of time a system is up and running. Availability of systems are categorized under availability classes defined by Gray et al. [68]. These classes are named from “unmanaged” to “ultra-available” systems based on the number of nines in the availability.

30 seconds epoch 10 seconds epoch 5 seconds epoch

0.1 Unavailability

(b)Effect of Coordinator Recovery Coordinator MTTR/Overall MTTR

(a)Unavailability vs. MTTF of a Manager 1

0

6 month

1 month

0.001

1 week

0.01

1000 2000 3000 4000 5000 MTTF of a Manager(Hours)

6000

0.3 0.25

Manager MTTF a day Manager MTTF a week Manager MTTF a month Manager MTTF 3 month

0.2 0.15 0.1 0.05 0 25 50 75 100 125 150 175 200 225 250 275 300 Number of Managers

Figure 5.3: Availability of Hasthi

By applying m and r values given in Table 5.6 to Equation (5.6.9), Figure 5.3 represents how the availability of the management framework changes with the MTTF of a single manager. Since a log scale easily enables us to identify different availability classes, for the simplicity of presentation, we have plotted unavailability (1-A) instead of availability (A).

117


As shown by Figure 5.3(a), the availability of Hasthi is guaranteed to be better than “managed” or “well-managed” classes for moderate MTTF values (e.g. less than one month MTTF for a manager), and “well-managed” and “fault-tolerant” classes for higher MTTF values (e.g. higher than 6 months MTTF for a manager). It is worth noting that the X axis represents the MTTF of a manager—the MTTF of a single unit, not of a system. With reliable hardware, a few months or a year of a unit MTTF is not unrealistic; therefore, Hasthi can potentially reach “well-managed” and “fault-tolerance” classes. We believe this result is significant because most practical systems rarely do better than “well-managed” and “fault-tolerant” classes. For example, the “well-managed” class has 9 hours downtime per year and the fault tolerant class has an one-hour-downtime per year. Furthermore, it is worth noting that the calculated availability is a lower bound, and actual values may be higher. We will revisit these results in the conclusion section of the thesis. Furthermore, Figure 5.3(b) shows the contribution of the coordinator MTTR to the MTTR of the management framework, which is a measure of the effect of coordinator recovery overhead on the system recovery. As illustrated in the figure, the contribution of the coordinator recovery time rapidly decreases with the number of managers, and it is not affected by the MTTF values of a manager. Therefore, with a large number of managers (> 50 managers), the availability can be approximated by the following simpler version of the above equation. Availability(A) =

5.7

θ θ + (n − 1)hm

(5.6.11)

Discussion

In this chapter, we proved that given system managed with “Manager-Cloud Algorithm,” there exists a constant th for that system such that regardless of the initial state of the system, if managers do not join or leave and communication failures do not happen for


118

a continuous th time interval, after the th interval, Hasthi is and will continue to be healthy as long as aforementioned errors do not occur. Furthermore, we derived the availability of the resulting framework. This section revisits those results in order to understand their implications and to analyze assumptions we made in their derivations. We believe the first result has two significant properties. According to Dolev [52], we say a system is self-stabilizing if the system reaches a safe state regardless of the initial state and continues to be at a safe state. Therefore, the first result shows that Hasthi is self-stabilizing, which is the first property. Furthermore, the result shows that in absence of failures, the system will self-stabilize in a constant time, which we call “the constant time recovery property”. Taken together, these two properties are powerful because they guarantee that the manager-cloud is never left at an inconsistent state. As we demonstrated, the system recovery requires two conditions: while recovery, the managers should not join or leave and communication failures should not happen. Let us consider their ramifications. The first condition associated with the algorithm is that the system recovers only if managers do not join or leave for a continuous time period of a given length, which we call the fail-free interval. However, we can show using Equation 5.6.6 that the average increase of actual the recovery time due to the need for a fail-free-interval is marginal. For example, if we calculate the recovery time using Equation 5.6.3, even with a MTTF of 100 hours per manager(an error once every three days), a fail-free-interval of 250 seconds occurs within 259 seconds in average, and this is only a small increase (about 4%). Therefore, we can argue that the requirement for a failure-free continuous time-interval only marginally increases the actual recovery time, hence does not impose significant limitations. Furthermore, the recovery assumes the absence of communication failures, and we have discussed this problem in the discussion section of Chapter 3. On the other hand, the state-independent-recovery property guarantees that even if a communication failure has


119

happened, once the communication failure is fixed, the system will recover. Consequently, rare communication errors would not impose significant limitations. We calculated the availability of Hasthi using the above results, and we made following observations. Even with a moderate number of managers, we observed in Figure 5.3(b) that the contribution of the coordinator MTTR in the final MTTR value is small (e.g. 10% with 10 managers, 1% with 100 managers). Hence, the effect of election overhead on the availability is also small. Moreover, We say Hasthi is available when it is healthy. We observed in Figure 5.3(a) that Hasthi is available 99%-99.99% of the time, and the exact value is decided based on the MTTF of a manager. Furthermore, as shown in the remark with Corollary 5.5.1, it is guaranteed that Hasthi will respond to a failure in the system within 4 epoch times, which ranges from 20 seconds to 2 minutes depending on the epoch time. We believe these numbers are a significant improvement for most systems, esspecially considering that without management, recovery may take hours. On the other hand, the proof has made three assumptions about the managed system. Let us briefly discuss assumptions and their ramifications on real life usecases. The first assumption is about crash failures. Since the algorithm is relatively simple to implement, we may able to approximate the crash failures of managers via careful coding. On the other hand, each resource includes a management agent that exposes the management state and sends heartbeats, which could fail independently. We have discussed mitigating this problem in the discussion section of Chapter 3. The second assumption states that if nodes do not leave or join the communication network, a broadcast message that is periodically repeated will be delivered to all the active nodes within a constant time. This assumption is motivated by the lazy repair available


120

in some communication topologies. For example, P2P networks (e.g. FreePastry [10]) repair their underline routing tables periodically, and if nodes are not added or removed for a some time, these networks recover fully given that no partitions have occurred. Hence, given some time, the network will recover, and broadcasts done on the recovered network will be successful. In fact, we have used P2P networks for the implementation of Hasthi and, therefore, believe that this assumption is a reasonable one. The P2P broadcast implementation we used in Hasthi is successful with high probability, and we are yet to encounter a broadcast failure in our tests. Therefore, in our implementation of Hasthi, we approximate the algorithm by each coordinator sending out heartbeats only in first four ec period after it is selected, sending a heartbeat once every 10 ec after that. Broadcast algorithm uses controlled flooding, and therefore, even though some managers failed while broadcasts, the broadcasts will be successful with very high probability. Hence, the above approximation should be sufficient to make sure that the system has one coordinator with very high probability. Sending a heartbeat once every 10 ec should handle the network partitions. The third assumption states that each resource and manager know an active bootstrap node address, and to support this case, we can use a set of bootstrap nodes where each resource or manager is given a list of bootstrap nodes at the startup via configurations. In this setting, if one bootstrap node has failed, another one from the list is used. Furthermore, the only state held by the bootstrap node is the current coordinator address, and every bootstrap node finds the current coordinator from periodic coordinator heartbeats. In the final analysis, we have presented the “Manager-Cloud Algorithm”, a proof of its correctness, and an analysis of the availability of the management framework built using the algorithm. Furthermore, we have showed that the management framework manages the system 99%-99.99% of the time and election overhead only has a small effect on the availability. Moreover, we identified two significant properties: the self-stablization property and the constant time recovery property. Also, we revisited assumptions used for the proof


121

and argued that they can be reasonably approximated in real life usecases. Finally, in conclusion, we argue that these results demonstrate the soundness of the underline algorithm of Hasthi and Hasthi’s applicability to real life settings.

6 Empirical Analysis The goal of this chapter is to empirically measure and establish scalability, recovery, and sensitivity to operational parameters of Hasthi. To that end, it presents a series of experiments and their results. Hasthi targets large-scale systems that have thousands of nodes. Since those systems are complex and resource hungry, designing the experiment was a major challenge. For our experiments, we needed both a suitable test system and a workload, which are both simple enough to explain and light enough to operate with a moderate amount of resources. Using a large-scale e-science workflow system as a basis, we have designed a scalability benchmark, which is one of the contributions of this chapter. The rest of the chapter is organized as follows. The first section explains the experiment setup, and its subsections illustrate the workload, dependent and independent parameters, associated configurations, and the test environment respectively. Experiments are discussed under five sections: The first experiment studies scalability and the second studies sensitivity to management workloads, epoch time periods, and rule complexity. The third studies the behavior of elections and recovery, the forth experiment compares and contrasts Hasthi with a different management framework, and finally, the last experiment tests Hasthi applied to a real-life usecase. 122

6. Empirical Analysis

123

With each experiment result, we highlight our observations and discuss their implications. Finally, we revisit all the results at the discussion section, weaving them together to arrive at system-wide ramifications.

6.1 6.1.1

Experiment Setup Workload

As noted in the introduction, designing a workload was one of the main challenges of this study, and this subsection describes the workload. A large-scale management workload includes two parts: a test system that composed of thousands of resources and management scenarios to manage those resources. Let us look at both these aspects of the proposed workload. While observing systems that are managed using management frameworks, we made the following observations: A management framework sees a managed resource (a resource being managed by the framework) as an external entity whose state changes with time, and the framework monitors that state and performs corrective actions to keep the resource within acceptable bounds. Furthermore, Instrumentations expose state of each resource as resource properties, which we call the management state of a resource, and since management actions changes target resource state, their outcome is also perceived as changes to resource properties. Using these observations, we have developed a “Test-Service,” which mimics the management state and the behavior of a typical managed service using a randomized algorithm. Specifically, using the randomized algorithm, the Test-Service simulates receiving requests, their outcome (e.g. their success or failure), the time taken to process requests,


124

and service failures. Consequently, the Test-Service has resource properties like Operational Status, Last Request Time, and Number Of Pending Requests, which are calculated using the aforementioned randomized algorithm. Appendix A illustrates the algorithm. Furthermore, we integrated the WSDM runtime described in Chapter 4 with Test-Services, which exposes their resource properties according to WSDM specification [26]; therefore, Test-Services can be managed using Hasthi. In this setting, since Test-Services mimic the management state and the behavior, the management framework does not see any difference between a Test-Service and a regular managed service. Therefore, we used Test-Services to build a management workload. Our test system was motivated by an e-science project called Linked Environments for Atmospheric Discovery (LEAD) [53]. Comprised of transient application services that are created on demand and persistent services (e.g. registry, workflow engine) that provide the execution environment, the LEAD project enables meteorologists to wrap commandline weather applications as Web Services, compose those services to create workflows, and execute workflows to forecast weather. The test system, which models a large-scale deployment of LEAD, is composed of self-sufficient replica units, each containing a complete copy of the LEAD stack and a group of front-end servers that distribute requests among replica units. Since replica units do not share data with each other, this setup is said to follow the Shared Nothing Architecture [115]. We modeled this test system using the aforementioned Test-Services, in which Test-Services mimic different roles in the LEAD architecture. By varying the number of replica units, we changed the scale of the system and we performed tests using those different sized systems. Since each Test-Service has real life semantics, we can write meaningful management rules to manage the system, and we have defined the following management scenarios for managing the test system. 1. If a persistent service fails, create a new service to replace it.


125

2. If the number of transient services of a particular type is low, create new instances to compensate. 3. If a transient service is overloaded, remove it from the service registry and later add them back when the service has recovered. 4. Shut down old transient services. 5. After processing 10 requests, if a service has generated more faulty responses than successful ones, decide it is faulty and shutdown the service. Scenarios 1, 2, and 3 were implemented as global rules, and 4 and 5 were implemented as local rules. All management rules can be found in Appendix B.

6.1.2

Factors and Metrics

Let us briefly revisit the architecture of Hasthi to understand associated Factors, Metrics, and Configurations. Factors denote independent parameters under human control, Metrics denote dependent or measurable parameters, and Configurations denote fix settings of the test system that stay fix throughout tests As explained in Chapter 3, Hasthi consists of a coordinator, managers, and resources where resources send heartbeats to managers and managers send heartbeats to the coordinator. Managers and the coordinator have control-loops, which periodically wake up to perform book-keeping and evaluate the health of the system. In this setting, the following metrics can be directly measured while Hasthi is in operation. 1. Resource-heartbeat latency – Resources send heartbeat messages to the manager, and the manager processes the message and sends back a response. This Metric is the latency from the heartbeat created time until the response is received for the


126

heartbeat. This is a measure of the cost associated with receiving a resource heartbeat and updating the meta-model. 2. Manager-control-loop overhead – Hasthi activates manager control-loops periodically, and this is the latency from the start to the end of a manager-loop execution. 3. Manager-heartbeat latency – Managers send heartbeat messages to the coordinator, and the coordinator processes the message and sends back a response. This is the latency from the heartbeat created time until the response is received for the heartbeat. This is a measure of the cost associated with the coordinator receiving manager heartbeats and updating the summarized meta-model. 4. Coordinator-control-loop overhead – The coordinator periodically activates the coordinator control-loop, and this is the latency from the start to the end of a coordinatorloop execution. 5. Memory consumption of managers and the coordinator. In a Hasthi deployment, the following independent parameters (Factors) are under the control of the test designer, and experiments measure the aforementioned metrics while changing the following Factors. 1. Number of resources in the system. 2. Number of managers in the system. 3. Coordinator, Manager, and Resource epoch time periods – once in this period, each entity either sends heartbeats or executes control-loops. We refer to all three as epoch time periods. 4. Management workload – we measure this in terms of failures that occur per hour, per service (Mean Time To Failure for a service).


127

5. Complexity of management rules.

Furthermore, configurations of hosts, networks, Java virtual machine configurations, and operating systems are examples of configurations, and they are fixed throughout tests.

6.1.3

Test Environment and Settings

All tests were conducted using a cluster of 128 nodes, each node having a Dual AMD 2.0GHz processor, 4GB Memory, 1 GB Ethernet, operating system Red Hat Linux, and unless otherwise specified, each manager was given a host exclusively. Because we needed to run thousands of Test-Services to model a large-scale system, the efficient use of the test infrastructure is of paramount importance. Therefore, we have conducted an experiment to understand how many Test-Services can run in one host without affecting each other. The experiment involves running a group of Test-Services in a host and measuring the resulting network, I/O, and CPU load of that host. We conducted tests for 100, 200, and 300 resources, and for each test run, we ran the test for one hour and measured the amount of data transferred over the network interface and the Load Average [17] once every 30 seconds. Figure 6.1 illustrates the results. The first measurement is trivial to understand, but the second needs an explanation. The Load Average, which is the standard measure for a load in a UNIX system, measures the number of processes waiting for CPU on average, and it should be interpreted in terms of the number of CPUs in a host. For this instance, since test nodes have two CPUs, a 2.0 load average represents a complete utilization. As shown in Figure 6.1, running 200 services introduced only a minimal overhead. For example, the host used only 0.04MB/s data capacity out of possible 1Gb/s data capacity that is available with a 1Gb network over 30 seconds and only 0.02 out of the possible

128


2.0 CPU capacity, thus, using less than 1% of network bandwidth and 2% of CPU power. Therefore, we concluded that placing 200 Test-Services in a host would not adversely affect

Load Average vs. Number of Resources per Host Data Transffered vs. Number of Resources per Host 0.045 1.6 Load average Data Transfered 0.04 1.4 0.035 1.2 0.03 1 0.025 0.8 0.02 0.6 0.015 Data Transffered(MB/30s)

Load Average

results.

0.01 0

50 100 150 200 250 300 350 Number of Resources per Host

0.4 0

50 100 150 200 250 300 350 Number of Resources per Host

Figure 6.1: Overhead on a Host while running Test-Services

Test Method: We placed one replica unit that contains a copy of the LEAD system, a unit in Shared-Nothing-Architecture, in one host. A replica consists of 200 Test-Services including a workflow engine, persistent services, and transient services. We changed the size of the test system by changing the number of replica units (e.g. 5000 resources = 25 hosts with 200 resources each), and we deployed Hasthi to manage test systems with 0.01 service failures per service per hour (that is MTTF of 50 hours per service). Furthermore, we used 30 seconds as epoch time periods, which are the time periods between two controlloop evaluations and or two heartbeats. Furthermore, the management rules used to support management scenarios are listed in Appendix B. We use the term, “MXN test run” to denote a test run where N resources are managed with M + 1 managers (one manager will become the coordinator), and for each test run, we let Hasthi manage the system for one hour and measured the aforementioned four metrics throughout the hour. Unless otherwise specified, all data points in graphs are averages, and every data point


129

includes the 95% confidence interval plotted as error bars. One interesting point to note is that, while analyzing the coordinator-loop overhead, we observed an initialization cost that lead to outliers. In every test, the first ten readings of the coordinator-loop overhead average over a thousand milliseconds, whereas all others average around 10 milliseconds. This behavior can be attributed to the Rete algorithm used in the rule evaluation, which remembers old results and only evaluates new facts at each evaluation. Since all resources join within the first few evaluations, the first few readings are high; after which, in normal operation, evaluations have much less overhead. Since the first few readings are clearly outliers that are caused by initialization of the Rete algorithm, we have excluded the first 15 readings from our average calculations.

6.2

Scalability Analysis

According to Newman [95], scalability of a system can be defined in many dimensions. However, with a management framework, the critical dimension for scalability is the number of resources managed by the framework. Therefore, we define the scalability of a management framework as its ability to manage more resources by adding more managers into the system. To assess the scalability of Hasthi, we have explored the limits of a manager, the system’s the behavior with multiple managers, and limits of the coordinator. The experiments are described below.

6.2.1

Limits of a Manager

Experiment 1: Limits of the Manager – To find the maximum number of resources a manager could handle, a series of 1XR test runs were performed in which R had values of 1000-8000 resources. Hasthi had one manager and one coordinator. Figure 6.2 presents the results of the experiment.

130


(b)Manager Loop Overhead vs. Resources per Manager

Average response time

25 20 15 10 5

800

Average Overhead

700 600 500 400 300 200 100 0

0

1000 2000 3000 4000 5000 6000 7000 8000 Resources Per Manager

0

(c)Manager Heartbeat Latency vs. Resources Count Manager Heartbeat Latency(ms)

30

Manager Loop Overhead(ms)

Resource Heartbeat Latency(ms)

(a)Resource Heartbeat Latency vs. Resources per Manager

1000 2000 3000 4000 5000 6000 7000 8000 Resources per Manager

Average Overhead

14 12 10 8 6 4 2 0 0

1000 2000 3000 4000 5000 6000 7000 8000 Resources per Manager

Figure 6.2: Limits of a Manager

As depicted by Figure 6.2, a manager can manage 5000-8000 resources, and the resourceheartbeat latency and the manager-loop overhead both exhibit a linear trend, which undergoes a marginal rise at 8000 resources. The linear rise in the heartbeat latency and the manager-loop overhead can be attributed to the increase of the number of heartbeat messages and the increased overhead of evaluating rules with more resources. However, the linear trend suggests that the overhead is not prohibitive and that Hasthi is able to keep up. Furthermore, at 7000 resources, the maximum observed heartbeat latency was 3.5 seconds and the maximum manager-control-loop overhead was 1.5 seconds(not shown in the graph, which shows the average). These values are less than 10% of the epoch time. Hence, both periodic resource heartbeats processing and manager-control-loops finish well before the next epoch; therefore, processing in one epoch does not affect the next. This is further evidence that the manager is well within its operational range. On the other hand, the manager-heartbeat latency stayed reasonably constant, but with more resources, more information should be transferred and processed via manager heartbeats, which could increase the heartbeat latency. However, we believe that the managerheartbeat latency stayed uniform because the resulting overhead from heartbeat processing is insignificant compared to the communication overhead in this case. In conclusion, the key observation drawn from the above results is given below.

131


Observation 1: One manager scale to 5000-8000 resources

6.2.2

Load Behavior of Hasthi

Experiment 2: Load behavior of Hasthi – To find how Hasthi behaves with the load, a series of MXR test runs were performed, in which managers are M = 5, 10, 20, and

Coordinator Loop Overhead(ms)

(a) Coordinator Loop Overhead 8 7 6 5 4 3 2 1 0

5 Managers 10 Managers 20 Managers

0

5000 10000 Resources Count

15000

Manager Heartbeat Latency(ms)

resources are R = 3000, 5000, 10,000, 15,000. Figure 6.3 presents results of the experiment.

(b) Manager Heartbeat Latency 14 12 10 8 6 4 2 0

5 Managers 10 Managers 20 Managers

0


15000

Figure 6.3: Hasthi Load Behavior

As depicted by Figure 6.3, with more managers, Hasthi scaled past 8000 resources, and it managed up to 15000 resources, which was the largest test system we tested in this experiment. Similar to the first experiment, we can see that the coordinator-loop overhead increases linearly and the manager-heartbeat latency stays stable, and we believe that the same explanations we offered for the behavior of the first experiment apply to this case as well. However, adding more than 5 managers does not seem to make a sizable difference. As seen in Figure 6.3, all three lines are clustered together, a behavior that is understandable given that a manager is managing only 1000 to 3000 resources compared to its 8000


132

resource capacity we observed in the first experiment. To summarize, with more managers, Hasthi scaled to manage all systems we tested, and even when managing 15000 resources, the largest system we tested, each manager was only handling 10-30% of its 8000 resource capacity observed in the experiment 1. We believe this observation suggests that Hasthi may scale past 15000 resources, and we conclude that more experiments are required to understand its limits.

6.2.3

Limits of the Coordinator

With 200 services per node, a 128-node cluster only allows us to run a system close to 20,000 resources. However, the earlier experiment suggests that Hasthi may scale over that limit, and therefore, testing Hasthi to its limits called for an alternative test design. While exploring an alternative test design, we made following observations. A manager controls each resource, and the coordinator controls managers. Therefore, if we could write a Test-Manager that mimics all messages and behaviors depicted by a real manager managing a group of resources, we can test the coordinator to its limits without having to run tens of thousands of resources. We have developed such a test manager and its functionality is described below. In the normal operation of Hasthi, when a managed resource starts up, it sends a “ManageMe” message to the coordinator, which assigns the resource to a manager, and the assigned manager subscribes to resource heartbeats and includes all major resource changes in its manager heartbeats to the coordinator. The coordinator updates a meta-model of the system according to changes included in manager heartbeats and periodically evaluates the meta-model and performs management actions. The Test-Manager acts as a normal manager, but in addition to this, it also acts as though it is managing a group of resources. For the experiment, the coordinator is set up with

133


Test-Managers, where each Test-Manager is given the number of resources it should be emulating as an argument at the startup, and the following algorithm mimics all messages exchanged between the coordinator and a conventional manager.

! " #

!

Figure 6.4: Test Setup of Hasthi with and without Test-Managers

Figure 6.4 illustrates a normal Hasthi test setup on the left and a test-managers-basedsetup on the right, and as it can be seen in the figure, Test-Managers mimic resources by simulating them locally. Having joined the manager-cloud like any other manager at the startup, each Test-Manager periodically sends heartbeat messages to the coordinator. Furthermore, for each resource to be emulated, the Test-Manager sends a ManageMe message to the coordinator, and the coordinator assigns the resource described in the message to a manager in the manager-cloud, which is also a Test-Manager in this case. The assigned Test-Manager creates an in-memory object that simulates properties of the resource using the same algorithm used by Test-Services in Experiment 1. Furthermore, with its heartbeats that are periodically sent to the coordinator, the test manager sends all major updates that happened to simulated resources, and the coordinator perceives resources as real, not as simulated ones. Moreover, the management endpoint of the resource is also mapped to the assigned Test-Manager, and when the coordinator performs an action, it is also emulated by the Test-Manager. Therefore, with Test-Managers, all messages, their order, and timing behave as if real resources exist behind Test-Managers. Hence, we argue that experiments

134


done using the Test-Manager-based workload is representative of a real workload generated by a system managed by Hasthi on the coordinator. Experiment 3:Limits of the Coordinator – To find limits of the coordinator, a series of MXR test runs were performed with M = 50, 100, 500, 1000 Test-Managers and R = 10k, 20k, 30k, 40k...100k resources emulated by Test-Managers. For each test run, one normal manager was used as the coordinator, and it was set up with 1024MB as the maximum heap size. For each test, Test-Managers were distributed across 10 hosts. The

(a)Manager Heartbeat Latency vs. Resources 60

50 Managers 100 Managers 500 Managers 1000 Managers

50 40 30 20 10 0 0

20000 40000 60000 80000 100000 Resources Count



same configurations used for Experiments 1 and 2 are also used in this setting.

(b)Coordinator Loop Overhead vs. Resources 50

50 Managers 100 Managers 500 Managers 1000 Managers

40 30 20 10 0 0

20000 40000 60000 80000 100000 Resources Count

Figure 6.5: Limits of the Coordinator

As illustrated by Figure 6.5, the manager-heartbeat latency behaves linearly with minor disturbances toward the end, and the coordinator-loop overhead behaves linearly with the curve turning upward slightly. Both behaviors can be attributed to the fact that with more resources, more information needs to be transferred, processed via heartbeats, and evaluated from the coordinator-loop. Most lines are clustered together, indicating that the number of managers makes a minimal difference to the system, which also suggests that the system is limited by the coordinator. Furthermore, overheads were well within 30 seconds epoch times. For example, with


135

1000 managers and 100,000 resources, the maximum coordinator-loop overhead was less than 1% of the epoch time (not shown in the figure, which shows averages), and the maximum heartbeat-latency was less than 10% of the epoch time. Therefore, each periodic heartbeat or evaluation finishes well before the next heartbeat or evaluation, and this suggests that Hasthi is within its operational range. Our key observation from these results, which we will further discuss, is given below. Observation 2: The coordinator scales to manage 100,000 resources and up to 1000 managers This is one of the key results of this dissertation, and therefore, warrants a detailed analysis. Let us analyze the coordinator for bottlenecks and try to identify any characteristics of the architecture that made these results possible. Even though the number of resources handled by Hasthi can be increased by adding more managers, Hasthi depends on the coordinator for the control of managers and the global control. However, the coordinator is one node, hence has limited resources. This may lead to several bottlenecks. Let us look at possible bottlenecks and architectural traits of Hasthi, which mitigate those bottlenecks. The first bottleneck is that Hasthi has to track the state of all resources in the system and the size of all states could be prohibitive. As discussed in the architecture chapter, to mitigate this, the coordinator keeps only a summary of each resource locally in the memory. This summary is small, and is updated only when the resource behavior has significantly changed (e.g. Crashed, Saturated). Therefore, this reduces the space (memory) requirements of resources. The second bottleneck is that the coordinator should receive heartbeat messages and keep the summarized meta-model of the system up to date by applying changes included in


136

heartbeat messages, and this process incurs a significant overhead. To mitigate that, only changes to resource summaries are propagated to the coordinator. Resource summaries change slowly; therefore, the amount of data need to be sent with heartbeats is limited. The third bottleneck is that the coordinator has to periodically analyze information about resources kept locally, and the cost of this evaluation could be prohibitive. Hasthi mitigates this problem by using the Rete algorithm for evaluating management rules. The algorithm provides a tradeoff between space and time (processing overhead) by remembering the evaluated results. Hence, at each evaluation, only new facts or changed facts need to be evaluated. In this setting, as explained in the previous paragraph, information about a resource stored within the coordinator changes only when the summary of the resource changes, which happens only when the resource undergoes a major change. Therefore, at each step, the coordinator only receives few changes, and the Rete algorithm only has to evaluate those changes. Consequently, this approach increases the scalability of the coordinator.

6.2.4

Verifying Independence of Managers

We have observed that a single manager can scale up to 5000-8000 resources, and the coordinator can scale up to 1000 managers and 100,000 resources. However, these results do not necessarily mean that the system would scale for the same range. To verify that the system scales up to 100,000 resources, we need to assert that the load on a manager only depends on resources assigned to it and is not affected by the existence of other managers. In favor of this hypothesis, we provide two pieces of evidence. First, managers do not have any interactions with other managers, except for elections, which only happen if the coordinator has failed. However, coordinator failures occur rarely and will not pose any serious performance burden to the general operation. On the other hand, every manager sends heartbeat messages to the coordinator and this is the only time when managers could

137


be affected by other managers and the existence of other resources. But, as seen from clustered lines in Figure 6.5, adding additional managers does not make a difference to heartbeats. Furthermore, as seen from the slowly rising curve of the figure 6.5(a), adding more resources has only a marginal effect on the heartbeats latency of managers. Both this evidence suggest that managers are not affected by other managers in the system. Furthermore, as the second evidence, we verified the hypothesis using the data available from Experiments 1 and 2. Figure 6.6 shows the resource-heartbeat latency, the managerloop overhead, and the manager-heartbeat latency plotted as a scatter plot against the number of resources assigned to a manager (calculated as the number of resources in the system divided by the number of managers).

Average response time

9

(b)Manager Loop Overhead vs. Resources per Manager

8 7 6 5 4 3 2 1

500 Average Overhead 400 300 200 100 0

0

500

1000 1500 2000 Resources Per Manager

2500

0

500

1000 1500 2000 Resources per Manager

(c)Manager Heartbeat Latency vs. Resources Count Manager Heartbeat Latency(ms)

10

Manager Loop Overhead(ms)

Resource Heartbeat Latency(ms)

(a)Resource Heartbeat Latency vs. Resources per Manager

2500

12 Average Overhead 10 8 6 4 2 0 0

500

1000 1500 2000 Resources per Manager

2500

Figure 6.6: Correlation between Resources per Manager and Manager Overheads

We can clearly see that a correlation and a trend in data, and deferent values measured with the same number of resources per manager (X values) are reasonably close to each other. Therefore, we argue that the behavior of managers is not significantly affected by other managers or resources in the system. This is strong evidence in favor of the hypothesis. Therefore, our third observation is following. Observation 3: The overhead of a manager is primarily decided by the number of resources assigned to a manager and affects of other managers and resources in the system are minimal. We empirically verified the result till 2000 resources per manager


6.2.5

138

Scalability of Hasthi

Observations 1 and 2 demonstrate that one manager can scale to manage 5000-8000 resources and the coordinator scales to manage 1000 managers and 100,000 resources. Furthermore, the observation 3 says that the load on a manager is independent of other managers and resources in the system, but depends on the number of resources assigned to the manager. We empirically verified the result till 2000 resources per manager. If we have more than 50 managers, resources per manager will be less than 2000, which can be easily done given that Hasthi can have up to 1000 managers. Observations 1 and 3 suggest that more resources can be handled by adding more managers and distributing resources among them. Observation 2 suggests that this can be done until 100,000 resources (which is the limit of the coordinator). Therefore, these observations provide strong evidence that Hasthi can scale to 100,000 resources.

6.3

Sensitivity to Operational Conditions

The goal of this section is to identify the sensitivity of Hasthi to different operational variables, thus establishing the effective operational range of Hasthi. We have chosen the management workload, epoch time intervals, and the complexity of rules as three parameters of the system that we will explore. To assess the response of Hasthi to different conditions, we have conducted following experiments.

6.3.1

Sensitivity to Management Workload

Management workload—the amount of corrections a management framework has to perform—could affect the framework’s performance and response time and could potentially overwhelm a management framework. Furthermore, from time to time, errors and

139


other conditions in systems may cause high management workloads. To measure the behavior of Hasthi while handling such workloads, we have conducted the following experiment. Management workload is closely related to the failure probability of the system. For this experiment, we measure the workload in terms of the failure probability, and we have changed the failure probability of services (MTTF) in the test system used for the scalability test to create different management workloads. When a failure occurs, it is propagated to the coordinator, which may create an alternative service depending on management rules. However, we did not test Hasthi against catastrophic failures (e.g. a failure of 50% or more of all resources) because chances of recovery in those cases are slim. Rather, we measured the response of Hasthi up to high workloads (e.g. 600 service failures/minute). Similar to earlier experiments, we set up a coordinator and a test system that could generate a workload of 20 managers and 40,000 resources, and we let the coordinator manage the system for an hour and measured the metrics. Test runs were repeated while

(a)Manager Heartbeat Latency vs. Failures 30 Average Latency 25 20 15 10 5 0 0

20 40 60 80 100 Failure Rate (% failures per Hour)



changing the failure percentage, which is the percentage of resources failed within an hour.

(b)Coordinator Loop Overhead vs. Failures 400 Average Latency 350 300 250 200 150 100 50 0 0 20 40 60 80 100 Failure Rate (% failures per Hour)

Figure 6.7: Response to Management Workload

As shown by Figure 6.7, Hasthi is stable with higher workloads, which is demonstrated


140

by the fact that the heartbeat latency kept a constant trend and the coordinator-loop overhead kept an almost linear trend. Specifically, a 90x increase in the failure rate only brought about 200 milliseconds of overhead, and the overhead stayed well below values that can affect successive rule evaluations. These results can be attributed to the asynchronous execution of management actions in Hasthi. To recap, all management actions are conducted asynchronously, where each action is submitted to an action queue and a group of threads executes actions asynchronously. By doing so, we have decoupled decision-making and corrective action executions, thus enabling Hasthi to handle bursts of changes. Consequently, Hasthi handles corrective actions without overloading the coordinator-loop or the coordinator. As explained before, the Rete algorithm used for rule evaluation only needs to evaluate new facts, and as the failure rate increases, facts change faster. Therefore, the algorithm has to do more work, which increases the coordinator-loop overhead. However, the linear trend suggests that the rule evaluation keeps up with the load. However, it is important to note that in the management scenario, Hasthi only replaces persistent services and some transient services, meaning that as time passes, the number of resources in the system decreases, thus easing the coordinator-loop overhead slightly. But even in the test run with a 90% failure rate, at the end of the test, the system had 31739 out of initial 40,000 resources (80%) and had created 21090 new resources in the course of the hour. Due to this, we believe effect of reduced resources on the results is small. Furthermore, even with 90% failures in an hour (600 service failures/minute in this experiment), a very high failure rate for a real system, Hasthi was able to withstand it. Specifically, rule evaluations took only 200 milliseconds on average easily keeping up with the load. It is worth noting that Hasthi was able to sustain this behavior throughout the test (over an hour), and we believe these results suggest that it could handle even higher burst loads because actions are queued and processed asynchronously.


141

Therefore, in the final analysis, we conclude that Hasthi is stable with respect to the management workload.

6.3.2

Sensitivity to Epoch time intervals

Each manager periodically sends heartbeats to the coordinator and the coordinator evaluates the system periodically and associated periods are called epoch times. As demonstrated by the formal proof in Chapter 5, these parameters decide the time interval of deltaconsistency and how often rules are evaluated and, therefore, defines how fast Hasthi will respond to a change in the system. Generally, Hasthi uses 30 seconds as epoch times. However, some usecases may need faster responses and consequently may need to use lower epoch times. This experiment studies Hasthi under different epoch times. We have performed an experiment to measure the effect of epoch times on the coordinator, in which we use the aforementioned benchmark with 5, 10, 20, and 30 seconds as epoch times. In each case, while measuring metrics, we let the coordinator manage the system for an hour. We performed two sets of experiments with 20 managers with 40,000 resources and 100 managers with 40,000 resources, and Figure 6.8 depicts the results. Since responsiveness of the system is limited by both the coordinator-loop epoch time and the manager-heartbeat epoch time, it does not make sense in practice to change one; therefore, we have changed both parameters together. As shown by Figure 6.8(a), the heartbeat latency was almost constant with different epoch times, and in both graphs, the number of managers almost did not make a difference. However, the coordinator-loop overhead was actually reduced with smaller epoch times, which is at the first glance counterintuitive. However, there is a simple explanation for this result. As explained earlier, managers only transfer information when a resource has significantly changed, and the underline

142

(a)Manager Heartbeat Latency vs. Epoch Time 14 12 10 8 6 4 2 0

20 managers 100 managers

0

5

10

15

20

25

30

35

Epoch Time Interval(seconds)

40

Coordinator Loop Latency(ms)



(b)Coordinator Loop Overhead vs. Epoch Time 80 70 60 50 40 30 20 10 0

20 managers 100 managers 20 managers, total time 100 managers, total time

0

5

10

15

20

25

30

35

40

Epoch Time Interval(seconds)

Figure 6.8: Sensitivity to Epoch Time

Rete algorithm used for rule evaluation remembers old results and only needs to assert new facts. When the coordinator-loop evaluates rules often, the amount of changes included in each evaluation are decreased, and therefore, rules have less work to do and evaluations are faster. However, with frequent evaluations, total CPU time may be higher, and the dotted lines in Figure 6.8(b) represent the total time spent evaluating rules per 30 seconds. Except for 5 seconds epoch time, smaller epoch times have a higher total CPU overhead, yet provide faster response to changes in the system. However, 5 seconds epoch achieves the best of both worlds.

Furthermore, with increasingly frequent heartbeats, the amount of information contained in a heartbeat decreases. However, we believe that the communication overhead dominates heartbeats, and as a result, heartbeats stay constant even when the amount of transferred information has decreased.

In conclusion, our results suggest that Hasthi can be used with a range of epoch times (from 5 to 30 seconds), and thus remains stable with respect to epoch times.

143


6.3.3

Sensitivity to Rules

Usecases and management scenarios vary from system to system, and as a result, users may want to use Hasthi with different rules. Because of this, we performed the following experiment to measure the sensitivity of Hasthi to different management rules. We repeat the same setup as with the scalability tests with 100 managers and 40,000 resources, but using different rule sets. Since the complexity of rules cannot be quantified, we create 7 rule sets, starting with an empty set (workload 0), and each rules set having successively more rules than the one before. Therefore, even though the complexity of rules cannot be quantified, rule sets with higher numbers are more complex than the lower ones, yet the complexity between two adjacent ones may vary. We performed these tests with a workload of 0.01 service failures and 0.01 service saturations per service, per hour,

Heartbeat Latency(ms)

(a)Heartbeat Latency vs. Rule Complexity 18 16 14 12 10 8 6 4 2 0

Overhead Error Bars

-1

0 1 2 3 4 5 6 7 Workloads, increasing in Complexity


and Figure 6.9 depicts the results. Rules used for these tests can be found in Appendix C.

(b)Coordinator Loop vs. Rule Complexity 70

Overhead Error Bars

60 50 40 30 20 10 0 -1

0 1 2 3 4 5 6 7 Workloads, increasing in Complexity

Figure 6.9: Sensitivity to Rule Complexity

As illustrated by Figure 6.9, Hasthi is relatively stable with different sets of rules. The heartbeat latency is almost constant, which is expected because rules have almost no effect on heartbeats. The coordinator-loop overhead shows a linear trend. In Figure 6.9(b), even


144

though it seems the coordinator-loop overhead has been reduced in the 5th and 6th workloads, we believe changes are not significantly different because 95% of the confidence intervals of 4th, 5th, and 6th workloads overlap. The most probable explanation for this is that the overhead was constant after the 4th workload because new rules introduced by 5th and 6th workloads do not cause a high overhead. Furthermore, for all workloads, the overheads are in the range of 50 milliseconds, which is well within operational ranges. The most probable explanation for the stable behavior of coordinator-loop overhead is the underline Rete algorithm, in which each rule is decomposed and represented as an evaluation tree, which optimizes the number for evaluations. For example, if two rules need to evaluate the same fact, both share the same node in the tree; therefore, the fact is evaluated only once. As a result, adding more rules may not incur much overhead if some evaluations were shared among them. Since there are many types of possible rules and they are only limited by the author’s imaginations, it is very hard to extend this result to all types of rules. However, considering the stable behavior for a reasonably complex rule set given in the 6th workload and considering the fact that the overhead of rules is less than 1% of 30 seconds epoch time period (in order of 50 milliseconds), we believe Hasthi could handle most rule sets. Furthermore, to verify our scalability results, we measured a system with 100,000 resources using the most complex rule set (6th workload), and Hasthi managed that system with the coordinator-control-loop overhead of 292 milliseconds as the average and 453 milliseconds as the maximum. This verifies our former observations. In the final analysis, both the heartbeat latency and the coordinator-loop overhead are stable for rule sets that we tested, and this is strong evidence in favor of the Hasthi’s ability to handle different rules.


6.4

145

Election and Recovery Behavior

Hasthi depends on elections to recover from failures, and Chapter 5 formally analyzed the soundness of the algorithm and obtained upper bounds on the recovery time. To understand the recovery behavior, we performed the following experiment. We set up a system with managers and resources, and managers were instrumented to generate events describing major changes in the system health like coordinator failures, initiation and completion of elections, and reaching a healthy state. We define a system as healthy when there is a coordinator, all managers have joined the coordinator, and 90% of all resources are assigned to a manager. We did not use 100% here because of the way workload is designed, some services may fail in the meantime, meaning that all resources may not be available when the coordinator recovers. We developed a test driver code that monitors the system by listening to events generated by managers, and when the system is healthy, it starts a new manager and kills the current coordinator. Furthermore, by listening to events generated from the system, the driver measures the following metrics. Our tests were conducted with 20, 50, and 100 managers, and with each set of managers, the above process was repeated 100 times. 1. Detection time – the time to detect a coordinator failure. 2. Election time – the time taken by an election. 3. Recovery time – the time taken by the current coordinator to rebuild the system state. This is the time for all managers to join and 90% resources to join. 4. End to end time – the sum of all the above three. Each metric was measured using time stamps associated with each event. Since all hosts are part of a cluster, the clock drift is typically about a millisecond, and therefore, the time drifts between different hosts were ignored.

146


Test systems used 30 seconds as epoch times and the service failure rate was 0.01 service failures per service, per hour, where either 5 managers or 200 services were placed on each host. This process was repeated for 5000 and 10,000 resources.

140 120 100 80 60 40 20 0

Detection Time Election Time Recovery Time End to End Time

20

40

60

(b)Election Recovery:10000 resources Average Latency(seconds)

Average Latency(seconds)

(a)Election Recovery:5000 resources

80

Number of Managers

100

120

140 120 100 80 60 40 20 0

Detection Time Election Time Recovery Time End to End Time

20

40

60

80

100

120

Number of Managers

Figure 6.10: Election and Recovery Behavior of Hasthi

As shown by Figure 6.10, all times show a linear trend with overall time decreasing with more managers, which suggests that Hasthi is stable with elections. The end-to-end time for recovery is the sum of the detection time, the new coordinator election time, and the recovery time for the system. With the number of managers increasing, the first decreases, the second increases, and the third stays relatively constant. However, the recovery time decreases faster than the election time, and therefore, the endto-end time for recovery decreases when the number of managers increase. Managers detect coordinator failures when they try to send heartbeat messages to the coordinator, and assuming manager heartbeat latencies are evenly distributed across the heartbeat time interval, the detection time decreases in a manner inversely proportional to the number of managers. Hence, with more managers, it is more likely that a coordinator failure will be detected sooner. On the other hand, due to O(log(N)) time complexity of the election algorithm, the election time increases following a log function, which changes


147

more slowly than the 1/N behavior depicted by the detection time. Due to this, the end-toend recovery time decreases with more managers. The overall recovery time is marginally higher than a minute, which is a huge improvement over hours of Mean Time to Recover (MTTR) in systems. If required, this can be further decreased by decreasing the epoch times of the system. In conclusion, Hasthi demonstrates a stable behavior with elections, recovers the system from the coordinator failures in close to a minute, and the systems behavior improves with more managers.

6.5

Comparative Analysis

To understand practical concerns associated with management frameworks, we have empirically compared and contrasted Hasthi with the management framework described in Gadgil et al. [60], which we will refer to as CGLM henceforth. Similar to Hasthi, CGLM manages a system using user-defined management logic, and it consists of managers, a registry, and bootstrap services. CGLM assigns each managed resource to a manager, which creates a thread to manage the resource and executes userdefined management logic provided as a Java code. This enables the user-defined control of individual resources; however, CGLM does not provide user-defined control on a global level. That is, in other words, it does not make decisions based on the state of multiple resources. Resources and components of CGLM use a registry to share information. Managers and resources register themselves in the registry using a soft state protocol, and most communications are done using a message broker, a highly scalable messaging middleware. To understand the load on CGLM, we have added lightweight instrumentations, which


148

measure the following metrics that represent performance characteristics of the system.

1. Heartbeat processing time – Resources send heartbeat messages to the assigned manager, and for each message, the time taken from the heartbeat message initialization to the completion of message processing was measured using timestamps in messages while ignoring the clock drift. 2. Latency for a Service to renew with the Registry – each service periodically renews with the registry, and the associated latency is measured. 3. Latency for a Manager to renew with the Registry – each manager periodically renews with the registry, and the associated latency is measured. 4. Resource control-loop overhead – for each assigned resource, a manager periodically executes the associated user-defined management logic to control the resource, and the overhead of each execution is measured. 5. System Control Overhead – Bootstrap services periodically spawn a system-healthcheck-process, which checks the system health and recovers any failed or missing core services like registry, managers, and message nodes. The associated latency for the process to evaluate and recover the system was measured.

For the following experiments, we used the same cluster used for testing Hasthi. To recreate the same testing environment as the Hasthi scalability tests, we have ported the workload defined earlier to CGLM. Furthermore, after verifying that it would not overload the residing host, we placed 200 Test-Services in a host. For the management logic, a Javabased code that approximates the logic of Hasthi rules was used, which among other things, restarts any failed services. Furthermore, time periods (e.g. heartbeat interval) were set to 30 seconds, and Test-Services were set up with 0.01 service failures per service, per hour.


149

The system includes a message node, which, according to Gadgil et al. [60], could handle in excess of 2000 messages/sec. Once every 30 seconds, each resource in the system generates 5 messages (i.e. 2 to renew with the registry, 1 to send heartbeats to the assigned manager, and 2 for the manager to retrieve the resource state) where the number of other messages generated by the system is small. Therefore, even with 5000 resources, only around 25,000 messages are generated for each 30 seconds, which translates to about 1000 messages/sec. Hence, we believe having one message node does not adversely affect results. Furthermore, to verify that the message node is functional throughout all tests, we generated a test message once every 10 seconds, sent it to a different node through the broker, and at the end of the test, verified all messages were delivered. For each test run, one registry, two bootstrap nodes, and a message node were setup with resources to be managed, and when started, CGLM creates enough managers to manage resources. We controlled the managers-to-resource ratio using configuration parameters. For each test run, we operated the setup for one hour and collected the aforementioned metrics. We performed following two experiments using the previously mentioned test setup. Experiment 1:Limits of a Manager – The system was set up to run with one manager, and test runs were performed with 400, 1000, 2000, 3000, 4000, and 5000 resources. We also ran an identical setup with Hasthi. Figure 6.11 depicts the results. Figure 6.11(a) presents the behavior of all five measured metrics. In the experiment, a single manager works up to 4000 resources, and at 5000 resources, failures started to occur because resources were not getting responses from the registry in timely manner. We have tested the system using 6000 resources and verified that this behavior is continuous. Furthermore, as shown by the figure, the heartbeat processing time and the registry renew time of resources increase rapidly with more resources, while other metrics stay constant. As resources increase, both the registry and the manager have to handle more

150


System Control Resource Heartbeat Resource Control Managers Registry Renew Resources Registry Renew

10000 1000 100 10 1 0

1000


4000

5000

60

CGLM, One manager Hasthi, One Manager

50

(c) CGLM vs. Hasthi:Resource Control Overhead 100000 Resource Control Overhead

Average Latency(ms)

(b) CGLM vs. Hashti:Resource Heartbeat Overhead Resource Heartbeat Overhead

(a) CGLM:Single Manager Overhead 100000

40 30 20 10

CGLM per One resource Hasthi per all Resources Hasthi per one Resource

10000 1000 100 10 1 0.1 0.01

0

1000


4000

5000

0

1000


4000

5000

Figure 6.11: Single Manager Overhead, CGLM and Hasthi

resources. Thus, load increases and this is a possible explanation for the above results. Figure 6.11(b) and Figure 6.11(c) compare the resource-heartbeat latency and the resource control overhead of two systems. Other values are not compared because Hasthi does not have a registry and the system control-loop and the global control-loop of Hasthi are semantically different (For the record, on average, with CGLM the global control-loop took 224 to 261 milliseconds, and with Hasthi it was 1 milliseconds). Since architectures of the two systems are not identical, comparing these values with Hasthi should be done carefully. With resource heartbeats, on average, Hasthi took 4 to 9 milliseconds compared to CGLMs 2 to 21 milliseconds. Among the reasons for this discrepancy, Hasthi uses requestresponse operations to send resource heartbeats, which also enables Hasthi to verify that the manager is active, whereas CGLM uses one-way operations. Furthermore, in CGLM, communications are done via a messaging system, which uses pre-established TCP connections, whereas Hasthi opens a new connection for each invocation. This is a design tradeoff that Hasthi made to support more resources. Furthermore, in the resource control, CGLM runs the management logic for each resource using a different thread. Therefore, the value given in Figure 6.11 represents the time to evaluate a resource. In contrast, managers in Hasthi evaluate all resources in one

151


batch. However, for values we measured, on average, Hasthi took 100 to 430 milliseconds to evaluate all resources, whereas CGLM took 20 to 27 milliseconds to evaluate a single resource, but did it in parallel. However, with one thread per resource, a CGLM manager has to handle a large number of threads. On the other hand, with that approach, it is difficult to write management logic that depends on more than one resource for evaluations. In conclusion, a single manager in Hasthi could scale to manage almost as twice resources than CGLM—8000 resources against 4000 resources of CGLM—and performs better in terms of heartbeats. In terms of the wall clock time, CGLM does better. However, Hasthi’s approach of batch processing can scale and support global assertions that depend on more than one resource. We believe this observation presents a useful tradeoff for the system management design. Experiment 2: Limits of the System – Test runs were performed for the cross product of 1, 2, 3, 5, and 10 managers and 400, 1000, 2000, 3000, 4000, and 5000 resources, and

100

10

1

Average Manager Renew Overhead(ms)

0

1000


4000

5000

(d) CGLM:Manager Renew with Registry Overhead 8

1 Manager 2 Managers 3 Managers 5 Managers 10 Managers

7 6 5 4 3 2 1 0

1000


4000

5000

(b) CGLM:Resource Control Overhead 250


200 150 100 50

0

1000


4000

5000

Average System Control Overhead(ms)


Average Resource Renew Overhead(ms)

Average Resource Heartbeat(ms)

(a) CGLM:Resource Heartbeat Latency

Average Resource Control Overhead(ms)

Figure 6.12 depicts the results.

(c) CGLM:System Control Overhead 700


600 500 400 300 200 100 0 0

1000


(e) CGLM:Resource Renew with Registry Overhead 1 Manager 2 Managers 3 Managers 5 Managers 10 Managers

140 120 100 80 60 40 20 0

1000


4000

5000

Figure 6.12: Multiple Manager Overhead of CGLM system

4000

5000


152

As shown by Figure 6.12, adding more managers almost did not make a difference, and even with 10 managers, the system did not scale over 5000 resources. The most probable explanation to this behavior is that the bottleneck is caused by the registry. Therefore, adding managers does not help. In conclusion, CGLM does not scale up with more managers, whereas as demonstrated in earlier scalability tests, Hasthi clearly does better with multiple managers by significantly increasing its limits. As a result, from the scalability point of view, Hasthi does significantly better. Furthermore, we observed some tradeoffs that the two systems make, which are interesting to a designer of a management framework.

6.6

Application to a Real Life Usecase

Finally, we have applied Hasthi to manage a real-life usecase—an e-science cyber infrastructure called the Linked Environments for Atmospheric Discovery (LEAD)—and measured its characteristics. The LEAD-Hasthi integration has been completed, and Hasthi manages the LEAD development stack as of now. Furthermore, we have performed following experiments to evaluate the Hasthi-LEAD integration by injecting failures into the system. The LEAD deployment consists of 26 services, deployed in 6 nodes having Dual AMD 2.0-2.6GHz Opteron CPUs with 16-32GB memory, Red Hat Linux, and 1Gb network. Moreover, Hasthi has been deployed with 3 managers, and all control-loop and heartbeat intervals are set to 30 seconds. The LEAD system and the usecase we implemented are described in Chapter 9. In the usecase, Hasthi recovers LEAD from services and host failures, and once the system is recovered, it recovers workflows failed due to the same service and host failures. Based on the usecase, we performed two experiments. The first experiment killed a service in the LEAD system and measured the time for the system to detect the error, to trigger corrective

153


actions, to execute a corrective action, to new resources to join, and to detect that the system has recovered. Similarly, in the second experiment, we simulate a host failure by killing all LEAD and Hasthi related processes in a Host and measured the aforementioned recovery overheads. We performed the each test 100 times, and following Figure 6.13 illustrates results. In the figure, all values are averages and error bars represents 95% confidence intervals. Furthermore, aforementioned readings measured from experiments are represented by labels, Detect, Trigger, Recovery, Join, and Health Check respectively, and the End2End represents the overall time for recovery.

(a)Host Recovery (Relocation)

(b)Service Recovery (Restarts)

40

Trigger

End2End

0 HealthCheck

0 Join

20

Recovery

20

End2End

40

60

HealthCheck

60

80

Join

80

Overhead Error Bars

Recovery

Overhead(seconds)

100

Trigger

Overhead Error Bars

Detect

Overhead(seconds)

100

120

Detect

120

Figure 6.13: LEAD Recovery Times with Hasthi

As shown by Figure 6.13, the host recovery took in average about 107 seconds, and the service recovery took about 89 seconds. In both cases, about 60% of the recovery time was spent on detecting failures and 25-28% of the recovery time was spent detecting that the system has recovered. Among actual values we observed, the maximum time to detect a failure, which is 79 seconds, is below 90 seconds upper bound we predicted in the proof at Chapter 5. On the


154

other hand, after the recovery is completed, Hasthi decides the system is healthy when the control-loop is executed for the next period, which would happened in about 30 seconds, and therefore, this explains the 25 seconds for detecting that the system has recovered. Furthermore, using the recovery time, which is Mean Time To Recovery (MTTR) for service failures, and Hasthi recovery time, which is about 80 seconds according to the experiment on elections, we can approximate the availability of LEAD managed with Hasthi. Assuming that the scenario captures LEAD downtimes and both services and managers are independent each having Mean Time To Failure (MTTF) of f, according to Baumann [39], the MTTF of the system is f/29 since LEAD has 26 services and 3 managers. Furthermore, since the MTTF of managers and services are the same, it follows from 3:26 managers to services ratio that failures due to Hasthi and services are 3/29 and 26/29 of total number of failures. Therefore, MTTR of the system can be calculated by weighting MTTR of each case by the ratio, and then the MTTR of the system = (3*80/29) + (107*26/29)) = 104 seconds. Therefore, the LEAD system availability is A = M T T F/(M T T F + M T T R) = (f /29)/(f /29 + 104), and with MTTF of 7 days, 14 days, and a month per service, the availability is 0.995, 0.997, and 0.999 respectively. These availabilities correspond to 43, 26, and 9 hours of failures per year respectively. Since we frequently have had hours of MTTR with manual maintenance, these availability numbers are a significant improvement. Furthermore, the Table 6.1 presents management action overheads collected while two weeks of testing. Among readings, the create and shutdown service action times include the time to perform the action as well as the time to verify the successful complication of the action by pinging the service endpoint.

As shown by the table, all actions took around or less than 5 seconds, where they are

155


Action

Mean (ms)

Action count

Send E-Mail ShutDown Create Service User Interaction

520 3697 6688 1177

137 57 806 99

95% Confidence Interval [462,578] [1637,5756] [6511,6865] [677,1677]

Table 6.1: Management Action Overheads well within the acceptable bounds. Also they are only a small portion of overall recovery time. These data are presented as an aid for a perspective user in designing their management scenarios.

6.7

Discussion

This analysis started by designing a scalability benchmark for testing management frameworks, which includes Test-Services that simulate a real world e-science workflow system replicated using the Shared Nothing architecture. Users can change the size of the test system and its behavior by changing parameters, and furthermore, the benchmark also defines few management scenarios that the management framework should support. In addition to Test-Services, the benchmark includes Test-Managers that mimic a workload generated by Hasthi managers and resources, and Test-Managers do so by generating the same messages sent by Hasthi managers and resources to the coordinator in the same order and with a reasonably close timing. We use Test-Managers to test the coordinator to its limits without having to run a hundred thousand resources. Even though the benchmark is designed for Hasthi, we believe the idea is general enough that it can be used to test other systems. For example, we were able to use the benchmark to compare Hasthi with the framework presented by Gadgil et al. [60]. Our main result is that Hasthi can scale to manage 100,000 resources, which is one


156

of the principal results of this thesis. It is obvious from results that the coordinator is the bottleneck; however, the bar is set high enough that Hasthi can handle most real world systems. We observe that centralized and decentralized control-loops are two competing approaches in the system management where the first achieves user-defined control and simplicity in the expense of scalability. Another important contribution of this result is demonstrating that the upper limit of a centralized control-loop is set at six figures, which is good enough for most real world systems. Furthermore, we discussed probable bottlenecks in Hasthi and architectural traits that may have contributed to mitigate those bottlenecks. From the recovery point of view, Hasthi can recover from coordinator failures in about a minute, and the recovery time reduces with more managers, which we believe is a very interesting result because this suggests that Hasthi recovery does better with bigger systems. Furthermore, Hasthi has much freedom in terms of managers. For example, as seen from experiments, the number of managers does not affect most results. This is understandable given that managers are handling a light load. For example, a manager can handle about 8000 resources, but the coordinator handles 100,000 resources only. Therefore, to handle 100,000 resources, 20 managers with 5000 resources apiece should be enough. However, the coordinator can even handle 1000 managers, in which case, each manager only has to handle 100 resources, which is much less than its 8000 limit. Therefore, in other experiments, we focused on the coordinator, because even if changes in operation conditions affect managers, we can mitigate them by increasing the number of managers. Furthermore, we observed that Hasthi stays stable with respect to changes in the management workload, epoch times, and the rule complexity while maintaining a linear or better trend and staying well within acceptable overheads in each case. We believe this result is significant because with such stability, Hasthi will be useful to a wide variety of


157

usecases. Furthermore, these results provide guidelines for deciding which parameter values should be used in given situations and associated tradeoffs. For example, some systems may need Hasthi to respond faster to failures, and our epoch time results provide insights into the tradeoffs of using a smaller epoch time. Moreover, we compared Hasthi with another management system, and in the final analysis, Hasthi did much better in terms of scalability. Furthermore, these results and observations enabled us to identify a few interesting tradeoffs in the system management design. Finally, we tested Hasthi applied to manage a large-scale e-science system to recover it from both services and host failures. We performed the test by injecting failures to the system and observed that Hasthi fully recovers the managed system from service and host failures within about 2 minutes, which is a significant improvement compared to manual recovery that may take hours. In conclusion, this chapter presented a series of experiments designed to empirically assess Hasthi, and results suggest that Hasthi could scale to 100,000 resources, has recovery time of about 1 minute, and is stable against operational conditions like the management workload, epoch times, and the rule complexity.

7 Managing Systems Using Hasthi Earlier chapters illustrated Hasthi architecture, its correctness, and its scalability. Unlike most frameworks where users download, install, and use the framework (e.g. Message Broker, Web Service Container), using Hasthi to manage a system requires users (who are generally system designers or administrators) to also contribute by analysis and integration. We can think of Hasthi as a programming environment for system management, which provides a global view of the system to users and enables them to write management rules. Just like it is a responsibility of the programmer to figure out logic for a Java program, users should identify management usecases and logic for managing their systems with Hasthi. As Chapter 5 illustrated, Hasthi provides a self-stabilization guarantee, but like many self-stabilizing systems, it does not provide the safety property. In other words, Hasthi guarantees recovery, but does not guarantee the behavior of the managed system while recovery. Therefore, the integration process and utilizing management logic to manage a system both give rise to many complexities. For instance, the following iceberg diagram depicts some of those complexities. For example, when a service failed and restarted, it may lose state, its address may change if it has moved to a new host, and some requests to the service while recovery may have failed. Furthermore, users are expected to author management rules that handle 158

159

7. Managing Systems Using Hasthi

!"

#

Figure 7.1: Hidden Complexities of System Management

possible unexpected conditions in the system, which is a non-trivial task. Moreover, while recovery, the recovery actions can fail and the management framework should handle those failures gracefully.

Goals of this chapter are to describe the integration process, to address complexities associated with integrations, to demonstrate the application domain of Hasthi, and to show that it is useful for managing real-life systems. The rest of the chapter is organized as follows.

The first section defines a system from the management point of view. The following


160

section describes the process of managing systems using Hasthi. Due to failures, management actions, and other factors, a managed system undergoes changes while recovery. Subsequently, Section 7.3 illustrates these changes and discusses how they can be addressed, and Section 7.4 extends the discussion to identify application domain of Hasthi and to identify guarantees required by different classes of systems. The following Section 7.5 illustrates potential pitfalls that can occur while managing systems, and finally, Section 7.6 wraps up the discussion.

7.1

Definitions

In this discussion, if the outcome of a request to a system or a service is affected by information stored in that system or service, we call that information as “state,” and say that the system or service has state. We define a system as a collection of components that only communicate via messages, and we represent it as a tuple (G, S, M sg, U, R).

1. U – a set of potential users. 2. S – a set of live services in the system. 3. M sg – a message substrate, which is a graph of users and services. A unique address is assigned to each node of the graph. Nodes may generate messages that are annotated with addresses. When a node(source) sends a message to another node (destination) in the substrate, it is delivered only if there is a path exists from the source to the destination and the delivery is non-deterministic and not-instantaneous. That is the message is delivered with some probability and after waiting for a non-deterministic time interval. Furthermore, if the target has failed, the message is discarded.


161

4. G – the global state of the system, which affects outcomes of invocations at multiple services in the system. It is typically stored in a storage, like a database or a shared file system (e.g. NFS, Google File system [65]). 5. R – the set of active resources available to the system. Furthermore, we define the service s ∈ S as a tuple (SD, uf, a, C, DA, r, P m). 1. uf – an immutable user-defined function that processes messages. At any given time, uf accepts some messages and generates some messages. Also, we assume that uf has a unique identity using which two services can be compared for identical functionality. For web services, this identity is the service port type name. 2. SD – the service level state that only affects outcomes of invocations to this service. 3. a – the address of a service. 4. C – configurations of the service, which decide how it behaves. 5. DA – services may depend on other services, and DA is the list of addresses of these dependent services for this service. We call this list the “dependent service list”. All dependent service lists in the system decide the structure of the system. 6. r – the set of resources the service is assigned to. 7. P m – the set of messages being processed in the service. At any given time, any user of the system or a node may generate messages nondeterministically (which we call a request), and send them to a service, the service processes the message and generates zero or more messages targeted at users or other services, and those services may repeat the process. We call each message reception and processing an “invocation”. Furthermore, a related set of requests is called a session, and a session has


162

own state called “session state” denoted by SSi . The i corresponds to the ith session. The session state only affects outcomes of service invocations in the same session. If a system is in a state such that at least some of future valid requests will continue to fail, we say that the system has failed. On the other hand, if a system will process valid future requests from users successfully as long as any of it parts do not fail, we say it is healthy. Using instrumentations, remote authorities may monitor and measure properties like load, operational states, and CPU utilization of a service, which provide an overall understanding of how a service or a resource behaves at a certain point in time. We call this information “monitoring information”.

7.2

Managing Systems

A management framework monitors and controls a system based on the management logic provided by users. To aid users in the integration with Hasthi, we have proposed the processes described in Figure 7.2. As described in Chapter 1, the key to this process is an observation by Adams [28], which says that most error occurrences (80%) are caused by few error types (20%), and that we can address most error occurrences by only addressing these most common error types. In addition, as we shall discuss in detail in this chapter, the state model of a system affects system recovery. Here we use the term state model to denote the amount of information different components of the system remember and volatility of that information. The key question is that if a service in the system has failed and restarted, how much state can it recover? Will it lose any useful state? Therefore, we argue that to manage a system with Hasthi, users should identify both most common errors and the state model of the system, and they should use that information to identify common management usecases of the system. For example, in LEAD system,

163


Figure 7.2: Methodology to Integrate Hasthi With a System

which is our primary usecase, we identified that failure of services are a common error, and furthermore, we identified that LEAD services write all state to a database and hence do not lose useful state if restarted. Therefore, our management usecase restarts failed services in the system to recover the system if services have failed and, subsequently, restarts workflows failed due to the service failures. We shall discuss this usecase in Chapter 9. After identifying management scenarios, users should instrument resources in the system to expose monitoring information required to implement those management scenarios, write rules to implement the management scenarios, deploy Hasthi with these rules, and manage the system. In chapter 4, we discussed Hasthi agents, which integrate with existing resources and instrument them, and we shall see example rules in the Chapter 9. It is likely that when Hasthi handles most common errors automatically, the next set of common errors would surface and brought into human attention, as they would be the most


164

common errors that cause the system to fail. Then users can handle these new identified errors. Consequently, Human users can identify, implement, and improve management rules by following the above process several times, and each iteration would identify more and more errors and fully or partially automate recovery from those errors. In different systems, wide verities of scenarios are possible, and let us briefly look at those scenarios.

7.2.1

Management Scenarios

Management scenarios describe usecases, which are identified means of responding to changes and failures in a managed system. Scenarios are implemented by connecting actions and monitoring information together using the management logic. Following are few examples. 1. Fault tolerance – The management framework can detect failures using monitoring information and perform corrective actions. Among failure detection mechanisms are actively pinging services, missing heartbeats, suggestions from other services, monitoring for unexpected behaviors, and custom failure detectors. Among possible corrective scenarios are restarting faulty services, moving services to a new host if the residing host has failed, reverting to older versions, and rerunning failed sessions or requests. Furthermore, if a repair failed, human intervention can be sought by using user interaction actions. For example scenario, in a managed workflow system, the management framework can recover failed services, and after the system comes back to a healthy state, the framework can rerun failed workflows from the last known good states. 2. Load Balancing – The management framework can change the system according to changes in the load. For instance, to handle higher loads, the framework can


165

create new services and configure other services to use new services. Alternatively, in response to reduced loads, the framework can shut down services. For example, a management framework can change the size of an application running in the Cloud by allocating and deallocating nodes based on the load. 3. Quality of Service (QOS) enforcement – The management framework can measure QOS and address QOS violations. Among possible remedies are allocating more resources, relocating services, and performing load balancing to meet QOS requirements. For example, the management framework can relocate services in a video conferencing system to ensure a guaranteed bandwidth. 4. Maintenance – The management framework can perform both proactive and reactive maintenance. For example, a system upgrade to a new version can be scheduled, which will checkpoint old states, upgrade the service to a new version, monitor the new service, and revert the service back to the older version if the system is deemed faulty. 5. Raise alarms – The management framework can raise alarms to direct administrators to potential problems. Usually, problems are diagnosed by detecting abnormal conditions in the system. For example, alarms can be raised in the case of a full hard disk, a hard drive generating too many seek errors, or a network dropping too many packets. Even though these usecases are useful, they did not discuss effects of changes and recovery (e.g. lost state, lost messages). Failures, load, and other conditions change the system, and management actions further affect the system. Therefore, after those changes, a system can continue to function only if it has been designed carefully to handle the effects of those changes. In the next section, we identify the effects of changes and discuss possible remedies.


7.3

166

Application Domain of Hasthi

Let a system be defined as Sys = (G, S, M sg, U, R), in which the set of services is S = {(SDi , ufi , ai , Ci , DAi , Ri , P mi )|i = 1..n}. Let us assume that Hasthi manages the system. Also, let us assume that the system went through some changes (e.g. a service failed) and recovery actions were performed.

7.3.1

Effects of Changes and Recovery

Due to initial changes and resulting recovery actions, the system changes. In following discussion, given any property X, we use ∆X to denote that the information represented by the property X has changed. For instance, ∆S says that services have been added to or removed from the system. Following are a list of possible effects caused by changes occur in the system while recovery. 1. Lost State (∆G, ∆SDi , ∆SSi ) - global state, service state, or session state may be lost because services have been restarted, failed and recovered, or moved. 2. Lost and Failed Messages (∆P mi ) - While Hasthi recovers the system after a failure, the system will be in an unsafe state. Hence, the system may lose some messages and some requests or sessions may fail. 3. Lost System Structure (∆ai and ∆DA) - If a host has failed, Hasthi has to move services running in the host to a different host, hence their addresses will change; consequently, in the system, links to that service from other services are no longer valid, thus breaking the system structure. Furthermore, the dependent service list of a service may be lost because the service has been restarted, failed and recovered, or moved, or the list may become outdated because some services in the list have been moved


167

4. Lost Configurations or Resources (∆Ci ) - Configurations of a service or resources assigned to a service may be lost because the service has been restarted, failed and recovered, or moved. To refer to these effects, we will use the term “effect-of-changes” henceforth. After a system failed and repaired, at the end of the repair, we expect the system to reach a functional (healthy) state. However, due to “effect-of-changes” we listed, even though repair actions (e.g. restarts) are carried out, the system may not reach a healthy state. For example, even if a registry service in the system failed and Hasthi recovered it, other services may not be aware of the new registry and might fail in trying to communicate with the old registry. The primary goal of this chapter is to analyze how aforementioned effects-of-changes in the system can be handled. One option is to delegate handling effects-of-changes to management logic, but that would make management logic complex. Furthermore, handling scenarios like recovering a lost service state are impossible without support from services. Let us explore the possibilities of handling effects-of-changes.

7.3.2

Architectural Solutions for Effects of Changes and Recovery

Some system architectures provide transparency to some effects-of-changes, and the following are few examples. 1. Location transparency – publish/subscribe architectures that communicate using either logical addresses (e.g. topic-based subscriptions) or content-based subscriptions are not affected by service address changes. Therefore, even when a service moves, it will continue to receive messages. Enterprise Service Bus and Message Queues are another two types of such architectures.


168

2. Self-healing architectures – when nodes fail, self-healing architectures like P2P systems recover automatically. 3. Automatic service and resource discovery – registries and resource brokers can catalog, find, and allocate resources and services; therefore, services in architectures with registries or resource brokers can discover other resources and services. Hence, they can find alternatives for failed services and resources. Moreover, broadcast and gossip-based approaches could also be used for resource discovery. 4. Fixed or dynamic configurations – many services have fixed configurations that are loaded from the file system, and some other services use a registry to locate dynamic configurations. Therefore, in both cases, even after a restart, configurations of those services are preserved or can be resurrected. 5. Some systems are stateless in one or more scopes (e.g. global, session, and service) and, therefore, do not lose state on that scope even they have failed and recovered. 6. Some message substrates provide reliable, at most once delivery from the source to the destination. For instance, they save messages submitted to them, retry messages until they are delivered, and perform duplicate message detection. In this setting, the sender and the receiver do not need to be online at the same time. Furthermore, some service implementations write received messages to a persistent storage. Hence, messages can be recovered even if a service has failed or moved.

If a system architecture is designed to provide transparency to all changes, all aforementioned effects-of-changes are handled, and therefore, Hasthi can manage it without complications. However, such systems are rare. Therefore, to support systems that are not designed with these features, Hasthi provides following functionalities.


7.3.3

169

Handling Effects-of-Changes with Hasthi

Hasthi provides a dependency-discovery operation, which facilitates services of the system to discover other services. As described in earlier chapters, Hasthi operates with a global view of the system and, therefore, knows about every service in the system and their corresponding runtime statuses. Hasthi exposes details of services in the system via the discovery operation, which accepts a functional description of a service and returns the addresses of services that match the functional description. For example, if a service depends on a registry, the service may discover a registry service through the discovery operation by providing the registry’s functional description as a key. However, Hasthi only provides discovery, and if there is more than one service of the same type, those services must discover each other using the discovery operation and handle the state dissemination between one another. Furthermore, the discovery operation provides a defense against lost configurations and lost resources caused by system changes. A dedicated configuration manager service or a resource broker can be used by other services to locate configurations or resources. In this setting, other services can use the discovery operation of Hasthi to locate the configuration manager or the resource broker in the system and then find correct configurations or resources using the located configuration manager or resource broker. Services can perform this process as needed both at the service startups and at the runtime. However, Hasthi does not preserve messages, and unless the system is built on a message substrate that preserves messages, any message whose recipients are not active at the delivery time is lost. Furthermore, unless services preserve them, any message being processed at failed services is also lost. To recover lost messages, our recommended remedy is to rerun the requests or sessions from the last known best state. However, handling any side effects caused by requests is a responsibility of the system and out of scope of Hasthi.


170

Furthermore, Hasthi does not directly support preserving state (Global, Session, or Service). However, to support recovery, services can store their state in a persistent storage, and in that case, Hasthi facilitates services to locate their storage locations after they are recovered from a failure. To use this feature, a service must expose its storage location (e.g. database or a file) as a management-resource-property of the service, and Hasthi relays this parameter as an argument to the service startup command when it recovers this service. However, Hasthi neither writes state to the storage, nor recovers it. Therefore, saving enough information to the persistent storage to enable the service to recover and recovering the service state from the storage are responsibilities of each service. To summarize, Hasthi handles effects-of-changes except for lost/failed messages and lost state, and handling those two is delegated to the managed system, which will be our next topic of discussion.

7.4

Application Domain of Hasthi and Required Guarantees

As explained in the former section, Hasthi does not handle lost/failed messages or the lost state of services. It does, however, handle other changes and assist in finding storage locations when a service has moved or recovered. Therefore, the subset of systems that can be managed by Hasthi (the application domain) is systems that can preserve sufficient amounts of state and tolerate request/message failures even when changes and partial failures occur in the system. For example, one subclass of such systems is explained by Recovery Oriented Computing initiative, which proposed rebooting as a tool for recovery (e.g. [90, 127]). So far, our definition of the application domain of Hasthi is a characteristic definition.


171

However, the rest of the chapter identifies systems belonging to the definition, and by providing examples, demonstrates that the application domain includes many real life systems, thus, arguing that Hasthi is useful for managing many real systems. This section is intended to discuss the question, how much information should a given system (or services) preserve through failures and changes? However, since the question is complex enough to warrant an entire thesis by itself, we do not intend to answer the question fully. Rather, we will provide some guidelines and examples. The answer to the above question is “it depends”. For example, a stateless usecase that does not have any side effects outside the system and has inexpensive sessions does not need to preserve any state at all. Therefore, in this case, all failed invocations can be re-executed. However, the same system with expensive sessions may need to perform check-pointing to minimize losing work. On the other hand, systems with reversible side effects (e.g. buying a book from Amazon) may need to go through either all or none of the steps. Therefore, we argue that the answer to the question depends on the characteristics of the managed system.

7.4.1

Characteristics of a System

We have identified state, criticality, side effects, and a cost of a session as characteristics of a system that dictate how much information needs to be preserved by the system. Figure 7.3 illustrates these characteristics. Let us look at each Characteristic. We categorize state of the system under four classes.

1. Stateless – In a stateless system, when a user initiates a request, the results of the request only depend on that request.


172

Figure 7.3: Characteristics of a System

2. Session state – In a system with session only state, when user initiates a request, results of the request only depend on requests in that session. Typically, this state is soft, which means it will timeout and will be removed eventually. 3. Global state – In a system with Global state, results of a request depend on the current state of the system. For example, a banking system has an inherent state that a user expects to see when he comes back, and results to his requests depend on the global state of the system. Based on user expectations, we have identified the following four classes. • Read only global state – requests from users do not change the global state. (e.g. static web site, Google Search). • Loosely consistent Global state – results depend on the global state. However,


173

clients expect only loose consistency models like read-your-writes from results. • best-effort global state – results depend on the global state, but clients only expect a best-effort service. • Consistent Global State – results depend on global state and clients expect the system to preserve it.

As shown by Figure 7.3, we categorize Criticality, which is defined by severity of failure, as mission critical (endanger human lives), critical (financial loss), and best-effort. The cost of a session depends on the number of service invocations in the session as well as I/O, CPU, and network costs of service calls. For example, some usecases take between hours and days to run, manipulate gigabytes to terabytes of data, or perform hundreds to thousands of service invocations. We categorize all those sessions as expensive. Finally, a side-effect represents changes to the word outside the system boundary that are directly caused by the system. A side-effect is reversible if changes can be undone (reverted) later, but it is irreversible if changes cannot be undone. For example, buying Skype credit has the external effect of deducting from a bank account, which is reversible, whereas a missile control system has the external effect of firing a missile, which is not reversible.

7.4.2

Methods used for Preserving State

The following are a few common methods used for preserving the state across changes and failures. Elnozahy [56] et al. discuss the first two methods in detail.

1. Checkpoint-based recovery – This method records the state of the system (checkpoints) periodically, and if a failure occurs, the system is rolled back to one of the


174

checkpoints. To support this, each service must support periodic checkpoints of state and rollback operations.

2. Log-based Recovery – This method records all non-deterministic events in the system in a persistent storage, and in the case of a failure, the lost state is recovered by replaying non-deterministic events and redoing the execution from the last known best state. There are two variations: pessimistic-logging and optimistic-logging. The former writes the logs to the persistent storage before continuing the operation, and the latter writes logs asynchronously. The pessimistic-logging can recover the system state up to the most recent non-deterministic event; therefore, it is useful with usecases that have unrecoverable side effects. Log-based recovery depends on the piecewise deterministic assumption [56], which assumes the protocol can identify and log all non-deterministic events. The choice of non-deterministic events (e.g. Is a context switch a non-deterministic event?) is a design decision, and the most commonly recorded non-deterministic event type is message receptions. For example, Recovery Oriented Computing (ROC) initiative has built services that support undo [45] by recording all input messages, which is based on the same idea.

3. Durable State – This method guarantees that when an operation is completed, all changes are written to a persistent storage. But this does not provide a state recovery guarantee, and rather, it only guarantees that state has been saved before an operation finishes.

These methods preserve state. On the other hand, to recover lost/failed messages, in most cases, a user, an intermediate service like a workflow engine, or a front-end service may restart failed requests from the last known best states to recover them. Furthermore, if no state has been preserved, requests might need to be started from the scratch.


7.4.3

175

Required Guarantees from Systems

Figure 7.4: Outline of Hasthi Application Domain

Using criticality and side-effects to classify systems, Figure 7.4 identifies which classes can be managed with Hasthi and makes recommendations on which state preservation methods are required in each case. In the figure, a checkmark suggests that Hasthi can manage the associated class of systems, and a cross suggests that Hasthi cannot manage the associated class of systems. We will discuss these classes based on the state and the cost-of-a-session later. As depicted by the figure, handling mission critical usecases and critical yet irreversible usecases are out of the scope of Hasthi. On the other hand, to handle critical yet reversible applications, like banking and online transactions, the managed system should support distributed transactions, which would undo any side effects of failed transactions. Failed transactions can be recovered by re-executing them. To handle other three classes of systems that have best-effort global state, one of recovery methods—checkpoints, logging-based recovery, or durable state—is used. To elaborate more, in the following discussion, we will draw usecases with different levels of state and discuss their behaviors and required guarantees. For each case, we will discuss the effects of lost state, the effects of lost messages, and how to recover failed invocations. In general, unless otherwise specified, following rules of thumbs apply for each class of


176

systems. If a session is expensive and the system does not have side-effects, checkpoints are used to avoid losing work. However, if there are side-effects, the logging-based recovery is used. In each case, to recover failed sessions, users can retry sessions from the last known best state. Furthermore, It may be possible to hide failure from users by placing a front-end between user and the system and retrying failed requests transparently to the user. Otherwise, users have to retry explicitly.

7.4.3.1

Stateless Systems

Some examples of stateless systems are job submission services, file transfer services, and processing services (e.g. converting images to a movie, creating a movie from a render script, and ray tracing), in which the output only depends on the request. To manage similar systems with Hasthi, the following guarantees are required. In these usecases, there is no state to be lost. Therefore, in systems that do not have external effects, failed executions can be re-executed. However, some usecases, like render farms (e.g. rendering animations for the movie Toy Story), many scientific applications, and data intensive applications, take between hours and days to complete. Therefore, in those cases, execution states should be checkpointed occasionally to avoid losing work. Furthermore, if operations have external effects on the outside world, logging-based recovery methods can be used, which make it possible to recover executions from the failed state. In this setting, lost state is handled and lost/failed messages can be handled by retrying or recovering requests; therefore, Hasthi can manage these systems.


7.4.3.2

177

Session-Only-State

Among services that have session-only-state are scientific workflow systems, Internet Messaging (e.g. Yahoo), Internet Telephony (e.g. Skype), Internet TV, and Two-player Games. These systems may depend on account details, which are stored as global state. However, account details are highly static and only serve to log into the system. Therefore, we have classified them under the session-only-state. Furthermore, some usecases that need session state can be implemented using stateless systems by mandating clients to include required state in every message (e.g. HTTP Cookies, and WS-Addressing resource properties). These usecases are handled similarly to stateless systems. To manage other session-only-state systems with Hasthi, the following guarantees are needed. If systems do not have external effects, failed sessions can be recovered by re-executing them. However, if sessions are expensive, they should use checkpoints to avoid losing work. Furthermore, when there are no external effects, replication can also be used (e.g. Ling et al. [82]). On the other hand, Systems should make their best-efforts to recover the failed session state if they have external effects. A possible solution is recording all non-deterministic events so the lost state can be reconstructed using log-based recovery. In this setting, lost state is handled and lost/failed messages can be handled by retrying or recovering sessions; therefore, Hasthi can manage these systems.

7.4.3.3

Systems with Read-Only-Global-State

Even though these systems have a global state, users cannot edit the global state in any way, rather users retrieve or ingest data from the system. Therefore, regardless of failed requests,


178

the global state is consistent. Examples of such systems are static web sites, search engines, and news sites. Typically, these systems operate on a global state stored in a persistent storage like a database or a file system (e.g. Google file system [65]) where replicated copies of services operate on top of the persistent storage. Therefore, even when a service fails, requests can be processed using a different service instance. On the other hand, if sessions are expensive or have side effects, these systems need to save their session state. Either case, front-end servers sitting between the user and the system may be able to mask failures by retrying. On the other hand, Stream processing systems, like Google Alerts, Yahoo Pipes, feed aggregators, and complex event processing systems, can be also placed in this category because users almost never edit the global state of the server. In this setting, we view subscription details (e.g. subscription to a Google alert) as session state. These systems can handle any session state using one of methods explained before. Since global and session state are not lost, these systems do not lose any state, and any failed requests can be recovered by re-executing them. Above scenarios handle lost state and failed messages, and therefore, Hasthi can manage these systems.

7.4.3.4

Systems with Loosely Consistent Global State

There are many variations within this class, and among the examples are the client centric consistency models described in Terry et al. [118], delta-consistency models [112], and eventual consistency models [42]. Implementations of these consistency models are wellstudied and most of them are implemented using a replicated storage. Let us briefly look at few examples. Consider a user managing his Amazon account. Since user accounts are independent


179

from each other, the only required guarantee is that each user reads his own writes (readyour-writes), and this guarantee is enforced by forcing clients to talk to a sufficiently up to date replica server. Consider an email account or a shared calendar, where each user wants to see all his changes and monotonically increasing reads. For example, a user wants to see all sent emails in his sent folder, and he does not want to see that an email he saw in the sent folder has disappeared when he returns. Therefore, as described in Terry et al. [118], these systems need read-your-writes and monotonic-reads. Similarly, RSS aggregators and Internet classifieds (e.g. craigslist), in which the user needs to see an incremental set of reads, need a monotonic-reads guarantee. Consider a social network site like Facebook, and assume that user X has added a comment to user Y’s page and user Z has commented on the comment. Z’s comment must be seen after X’s comment; therefore, this usecase needs a write-follow-reads guarantee. For some systems, guarantees for the timeliness of data (delta-consistency) are of importance. Examples of these systems are audio video applications, virtual environments like multiplayer games, systems like stock markets, and alarm propagation systems. To manage any of these or similar systems with Hasthi, they must provide required state guarantees associated with the corresponding weak consistency model, and failed requests should be restarted. If these requirements are met, lost state and lost/failed messages would not be an issue. Therefore, Hasthi can manage these systems.

7.4.3.5

Systems with best-effort Global State

In systems that need best-effort global state, users edit the global state and results for user requests depend on the global state at that particular time. However, an approximation of the global state is sufficient. For example, history-based search results depend on the


180

history of the user. However, even though some history results are missing, results will not be affected significantly. Among examples of these systems are P2P file sharing, e-science data archives, and scientific workflow systems that support the searching and processing of results. Typically, these systems are implemented with the durable state that guarantees the state of all completed requests are saved, and failures are addressed by restarting failed operations from the last known best state. It is possible that due to failures, some data are only half transferred, never added, or added twice, but occasional lapses like this are acceptable. Furthermore, it is possible to perform lazy repairs to the global state and resolve any inconsistencies asynchronously. For example, a scientific data system can either remove event generated by uncompleted workflow invocations asynchronously or exclude such uncompleted workflow invocations from the search. Another example is that a file sharing system can remove any broken links asynchronously. Typically, since these systems do not need strong guarantees, they can tolerate lost state and lost/failed messages, which are the only guarantees Hasthi needs. Therefore, Hasthi can manage these systems as they are.

7.4.3.6

Systems with Global State

These systems need a consistent global state, and among example systems are online auctions and complex distributed scientific computations. To preserve state across service failures and migrations, these systems could use the checkpoint-based-recovery or log-based-recovery. For example, since distributed scientific computations usually do not have any side effects, they can rollback to the last checkpoint, and therefore, they use checkpoints-based recovery. On the other hand, since systems do


181

not need to rollback with pessimistic logging and only need limited rollbacks with optimistic logging, systems that have side-effects use logging-based-recovery. To handle lost/failed messages, either a reliable message substrate should be used, or clients should resend lost/failed messages, and services should provide duplicate detection to guard against possible duplicate messages. The above guarantees will handle lost/failed messages and lost state, which are the only guarantees Hasthi needs. Therefore, Hasthi can manage these systems if the above guarantees are provided by the managed system. In Chapter 8, we will return to this topic while discussing possibilities of using Hasthi to manage Distributed Computations. Systems Stateless Systems

Session State

Read Only Global state Systems with Loose Consistency based Global State Best-effort Global State Sequential global state

or

consistent

Required Guarantee If invocations are expensive, checkpoint them, or use logging-based recovery of the invocation state if the system has external effects If sessions are expensive, checkpoint them or use logging-based recovery of the session state if the system has external effects Preserving session state may be required An implementation of the respective loose consistency model Usually use durable state, and preserving session state may be required Check pointing or logging-based recovery

Table 7.1: Summery of Guarantees Required by Different State-models

Table 7.1 summarizes the above discussion, which depicts state guarantees required to support usecases that have different levels of inherent state. Furthermore, Figure 7.4 depicts different recovery methods that can be used based on the criticality and external effects of each usecase.


182

If a system provides guarantees recommended by the above descriptions, Hasthi can manage that system according to user-defined management rules, and it will handle “effectsof-changes”. To summarize, we have defined the application domain of Hasthi and made recommendations on how to design systems to be compatible with Hasthi.

7.5

Pitfalls and Complexities

In earlier chapters, we discussed the structure of management rules, implementations of management actions, and how to compose management scenarios using these actions. This section discusses pitfalls typically associated with rule-based management logic and some remedies. The first pitfall is that Hasthi does not preserve state of managed resources. It is a responsibility of the user to design components of the system to preserve enough state so that the managed system as a whole will continue to function even when components have failed and recovered. Section 7.4 discussed both different classes of systems and guarantees those systems should provide. The second pitfall is that management actions could fail, and consequently, rules may keep retrying an action starting a never-ending loop (live lock). Hasthi addresses this problem using the resource lifecycle described in Chapter 3. To be specific, if a management action fails, Hasthi changes the associated resource state to “Unrecoverable,” thus, keeping the same action from being repeated over and over again. If users pass a callback object when executing management actions, Hasthi invokes corresponding method in the callback to notify them about any action failures. Furthermore, as a possible remedy for failed management actions, we provide support for user interactions as a management action. As we will explain in Chapter 9 with the primary usecase, if this solution is used, the duty of fixing the error can be delegated to a human user.


183

In rule-based programming, rules are triggered based on the state of a system, and this model can easily lead to loops. Therefore, care must be taken while designing rules. To facilitate rule design, we have developed a simple test environment, with which users can set up a model of a system, run management rules to evaluate the system model, and verify management actions fired by rules. This test environment can be used as a unit-testing environment for rule development, and we have used it for the rule design in Hasthi. The fourth pitfall is that communication can fail. Since heartbeats are repeated periodically, the Hasthi meta-model is usually not affected by most transient errors. However, if a network partition occurs, two coordinators will be elected and two independent systems will emerge. If enough resources are available, each coordinator may be successful in building a healthy system in its partition. Furthermore, if partitions merge, two coordinators will merge creating one system. However, management rules decide behavior of resources in this process, and therefore, if the managed system is expected to go through network partitions, the rule designer should pay close attention to this process. Furthermore, if transient communication failures are longer than the heartbeat interval timeout, Hasthi may perceive some services in the system as failed. Hasthi provides two ways to address this problem. The first is that when heartbeat messages are missing from a service, Hasthi uses a failure detector to verify the health of the service. Users can plug in a custom failure detector instead via configurations, and therefore, even though the default failure detector pings the service for health, if users have a better algorithm for failure detection, they may override the failure detector. The second is that the rule designer can handle the problem of false positives using rules. In normal setting, if a registry is perceived as failed, a new registry may be created, and when the transient communication failure is gone, the system may have two registries. By adding a rule to create a registry if there is none and by adding another rule to shutdown surplus registries if there is more than one, the designer can anticipate both conditions in rules.


184

On the other hand, if rules become too complex, it is possible that services in the system are loosely defined, and it may be possible to simplify rules by delegating some complexities to services. For example, let us assume a service depends on a registry. If no registry is found at the startup, the service has two options. Either the service can decide for itself that it has failed and exit, or otherwise, it can wait and periodically check for a registry. The former method needs services to be started up in order, whereas the latter handles dependency order transparently, thereby simplifying related rules. By building safe and agile services, some complexities of managing services can be simplified. For instance, based on their experiences of managing Microsoft online services, James Hamilton, the architect of Microsoft Data Center Future team, has made many recommendations for building safe services that are simple to manage [69].

7.6

Summary

We defined a “system” from the management point of view, identified possible management actions, and discussed management scenarios. These scenarios are implemented with management logic, which can be formulated in many ways, and Hasthi uses rules to implement management scenarios. We also observed that failures, other changes, and resulting management actions change a system while recovery and handling all these effects-of-changes (e.g. lost state, service endpoint changes, and lost/failed messages) using only management logic is sometimes impossible and other times leads to complex management logic. However, to ensure a continuous operation, these effects should be handled. In this setting, we argued that managed systems should also aid in handling these changes. We enumerated possible effects-of-changes that could occur in a system while recovery and identified different architectural properties (e.g. reliable communication, location


185

transparency) that can negate some of those effects. However, for the benefit of systems that do not have all the aforementioned architectural properties, we explored how Hasthi handles these changes and concluded that it handles all changes except for lost/failed messages and the lost state in services. Regarding state, if a service exposes its storage location, Hasthi passes this location as an argument when it restarts or migrates the service in order to recover it. Therefore, in this setting, if an appropriate state recovery method has been set in place, the recovered service can recover its state from the storage. In this setting, to be managed by Hasthi, managed systems should preserve required service state across failures and tolerate lost/failed messages. We categorized systems and provided recommendations on the guarantees required by each class of systems to handle lost state and messages. Finally, we concluded the discussion by listing few common pitfalls and complexities.

8 Managing Distributed Computations This chapter discusses limits of Hasthi in managing distributed computations. We examine different models of distributed and parallel computing, and for each, we discuss the benefits Hasthi brings and the limits of what it can accomplish. A distributed computation is usually one task distributed across many processes, which interact to carry out the task, and the computation has a well-defined start, end, and correctness criteria. Unlike other systems, if the tasks are expensive, which is often the case with large e-science applications, setting up a system just to perform a single distributed computation is common. Parallel computing applications (e.g. MPI based [18]) and Map-Reduce [50] based computations are examples of such distributed computations.

8.1

Challenges

Often, distributed computations span hundreds or thousands of machines and run for days. Therefore, setting up, monitoring, and handling failures of complex, distributed, and long-running computations pose significant challenges. Given that certain conditions are met—which we shall discuss shortly—users can use Hasthi together with user-defined 186

8. Managing Distributed Computations

187

management logic to initialize a computation, to monitor services, to recover failed services, and to recover failed computations. As described in earlier chapters, when deployed to manage such a computation, Hasthi monitors the system and periodically evaluates userdefined management logic (rules) that depend on a global view of the managed system. However, Hasthi is useful for managing distributed computations only if it can recover both the services in the computation and the computation itself after a failure. As described in Chapter 7, handling effects of changes using management usecases or within the managed system implementation is very important. Let us look at this in more details. Using Hasthi to manage a system involves identifying errors that can happen in the system, identifying remedies (corrective actions), and writing rules to trigger corrective actions when Hasthi detects those errors. However, as we demonstrated in Chapter 5, Hasthi only provides a self-stabilization guarantee, which assures recovery but does not assure safety. In other words, Hasthi does recover following a failure but does not guarantee the behavior of the system while it recovers the system. However, these distributed computations are highly stateful and have complex interactions, and therefore, even if one message is lost or one process has failed and lost its state, the distributed computation could fail. On these setting, recovering the system with Hasthi is useful only if Hasthi can recover the underline computation as well. Although it is possible to rerun the computation from the start, these computations may take days to run and consume a vast amount of resources and, therefore, rerunning them from the scratch is wasteful at best. Hence, managing such computations is useful only if Hasthi can recover the computation, and this is possible only if the application preserves a sufficient amount of state to restart the computation. Therefore, the ability to recover computations in case of a failure is a main factor in using of Hasthi to manage distributed computations. In the next section, we will provide a generic solution to this problem using generic checkpointing approaches, and the following section discusses the possibility of incorporating the application behavior (e.g.,


188

Map-Reduce based applications) into our analysis.

8.2

Generic Solution

Process of recovering a system to an earlier state is called “rollback recovery”. As explained in a detailed survey by Elnozahy [56], systems achieve rollback recovery using either checkpoints or log based recovery. Checkpoint based approach periodically writes the state to a stable storage, and log based recovery writes all non-deterministic events to a stable storage, and both are capable of restoring a system to an earlier state, which the system had sometime before it failed. Log based recovery is expensive than checkpoints but allows users to recover most of the processing up to the failure and, therefore, useful when the system has side effects on the outside world. Since most distributed computations do not have side effects, they can rollback to the last known good state, and consequently, these computations often use checkpoints instead of log-based recovery. If a managed distributed computation has checkpoints, when a failure happens in the system, user-defined management logic in Hasthi can recover the failed services and use the checkpoints to recover the computation. In retrospect, to manage a distributed computation using Hasthi, the computation should support following three conditions.

1. Each process in the computation should have a Hasthi agent integrated, so that Hasthi can monitor and control the process. Furthermore, as management actions, each process must support start, stop, remote configurations, and resetting the state to an earlier state using a checkpoint. 2. It should be possible to build the lost system structure following a failure. For example,when a failure happens, Hasthi may have to move processes (e.g. because host has failed), and then the system structure will fail. (Among solutions are (a) using a


189

communication medium like publish/subscribe systems that provides location transparency thus avoid the problem, (b) change the services to use Hasthi “dependencydiscovery” operation to discover other services in the system, (c) management logic explicitly notifying other services about the new service location). 3. Either the application should have some way to recover the failed computation (e.g. only run failed parts of the computation and incorporate it to the final results, which can be done in Map-Reduce case) or the application should periodically write enough state to a stable storage to enable recovery of the computation (in other words, create a consistent checkpoint). If checkpoints are used, each process in the computation must expose the storage location as a management resource property of the resource (User-defined management logic in Hasthi will use that to locate the checkpoints for recovery).

Let us discuss how Hasthi can recover the computation if the above conditions are met. Hasthi monitors the system, captures all the management properties exposed by resources in the meta-model, and keeps it up to date. To manage the computation, users should write rules (management logic) to start, monitor, and control the computation utilizing the global view, and Hasthi periodically evaluates the system using these rules and carries out corrective actions. Rules have an init block, which can be used to initialize the computation, and other rules should follow the format described in Chapter 3. On this setting, user-defined management logic have access to all the storage locations exposed by services (processes) according to the Condition 3. Hence, if the computation has failed, the management logic can access checkpoints (from storage locations) and use those checkpoints to restart the computation by rolling back state of services to a healthy state. Since, users write the recovery logic within management rules, they may use an algorithm of their choice to identify a consistent snapshot from checkpoints and reset all processes to that state. On the other hand, if the distributed computation could not meet at least one of the above conditions,


190

Hasthi could not manage those computations. Let us discuss how a distributed computation can meet the above three conditions. To integrate Hasthi agent with a process (Condition 1), users can use one of the agents described in Chapter 4. For example, users can integrate the Hasthi agent with Map-Reduce processes, and in theory, it is possible to integrate Hasthi agents to processes of most parallel frameworks (e.g. MPI [18]), thus enabling Hasthi to monitor applications implemented using that framework. Furthermore, to support Condition 2, users can use one of the three methods described with the condition. Moreover, to support checkpoints (Condition 3), one solution is performing checkpointing generally without assuming any behavior of the application. For example, LAMP/MPI [107] provides checkpoints for MPI applications. Treaster [119] and Maloney [85] discuss the use of rollback recovery in general and available generic checkpointing mechanisms for distributed computations. As discussed earlier, it is possible to use Hasthi to initialize, monitor, and control the application if it supports the above three conditions. However, we acknowledge that distributed computations with generic and establish tools (e.g. MPI) are unlikely to adopt Hasthi to steer and manage computations. Some other applications such as Map-Reduce and Search, however, can greatly benefit from Hasthi. We shall discuss such usecases in the next section. To summarize, our problem was that even though Hasthi recovers the infrastructure of a distributed computation from failures, it does not guarantee the behavior of the managed computation. Therefore, as a generic solution to this problem, we propose to use a generic checkpointing approach in the computation and use Hasthi rules to detect failures and to recover the computation by rolling it back to an immediate checkpoint. Another important aspect of recovery is cascading failures, where one error may cause another, which causes another and so on, thus potentially bringing down the system. To recover from cascading failures, Hasthi depends on its self-stabilizing properties. In other


191

words, Hasthi does not guarantee the behavior while failures occur but guarantees that once all failures are fixed (e.g. failed network or file systems) the system will return to a healthy state. Therefore, even after cascading failures, if at least few managers and enough resources are left standing, management rules will take effect and recover the system when errors have stopped occurring.

8.3

Utilizing Application Behavior

However, the overhead of these generic checkpointing mechanisms could be prohibitive, which limits their adaptation. It is possible to do better by using specific knowledge about distributed computations to provide application-specific recovery and checkpointing. For example, most parallel applications go through steps (or phases), and they can save results (state) of each phase before proceeding to the next, and then Hasthi can use checkpoints to recover failed applications. Moreover, some applications have independent parts, which can be executed independently in case of a failure and merged with other results (e.g. Map/Reduce [50]). To explore this idea, let us discuss 13 patterns for parallel applications (13 dwarfs) defined by Asanovic et al. [33] and possibilities of managing these applications using Hasthi. Based on recovery characteristics, we have identified four categories of applications, and following tables present each of 13 patterns, amount of state each class needs to save, and possibilities of managing those applications using Hasthi. For the discussion, we will use following communication patterns. We assume that the application is executed by n processes that are labeled as 0, 1, 2 . . . n-1. Chapter 3 of Kumar et al. [81] (pages 65-116) describes these communication patterns in detail. 1. One-to-One pattern sends a message from a process to another process and “circular q shift” sends data from every node i to the i+q mod q process.


192

2. Broadcast operations: “one-to-all” sends data from a process to all processes, “allto-all broadcast” performs a “one-to-all” from every node, and their personalized versions (e.g. personalized one-to-all and personalized all-to-all) send different message to each processes while performing the broadcast. 3. Accumulation operations: all-to-one (single node accumulation) collects data from all nodes to one node, and “all node accumulation” repeats all-to-one from every node. Let us look at 13 patterns and discuss possibilities of using Hasthi to manage those patterns.

8.3.1

Tightly Coupled applications

Following applications have complex communication patterns, and therefore, to recover them, applications must fallback to a consistent checkpoint. Furthermore, they do not provide clear boundaries (e.g. iterations) that are useful for checkpoints, and hence they need a generic checkpointing algorithm. These are classical parallel applications, and even though Hasthi can manage them in theory (if they support checkpoints), it is unlikely that there will be a motivation to move these applications to Hasthi from current well-established approaches like MPI (Message Passing Interface) [18], or to integrate with Hasthi by instrumenting their processes. This category has two classes of applications. 1. Dense Linear Algebra: Examples of operations are matrix transpose, matrix vector multiplications, matrix-matrix multiplications, and Gaussian elimination. Parallel implementations typically distribute columns, rows, or blocks across processes, and processes send local data and results to other processes as required by the algorithm. There are several implementations exists, and they depend on one or more of communication patterns: multiple broadcasts, all-to-all broadcasts (everybody broadcasts),

193


and circular shifts. Hence, these algorithms are communication intensive. Kumar et al. [81] (page 98-101)) discuss them in detail. 2. Sparse Linear Algebra: These algorithms perform the same tasks as Dense Linear algebra while utilizing the sparse nature of the metrics. Even though they may reduce the number of occurrences of broadcasts, these algorithms still depend on multiple broadcasts, all-to-all broadcasts (everybody broadcasts), or circular shifts (Kumar et al. [81], Chapter 11, page 407-489). Application Pattern Dense Linear Algebra

Required Level of State to be preserved We were unable to identify any generic checkpointing boundaries. Hence, these applications need a generic checkpointing algorithm.

Application of Hasthi Hasthi can initiate the application and manage it. User defined rules will use the global view of the system to detect failures, recover the failed processes, and then recover the application using checkpoints (checkpoint locations are also a part of the global view). Hasthi in this setting depends on the existence of checkpointing support.

Sparse Linear Algebra Table 8.1: Tightly Coupled applications

8.3.2

Iterative Applications

Following applications provide logical points for checkpoints from the application. For example, most of them go though iterations. If each process takes an independent checkpoint before starting an iteration, all the checkpoints for the same iteration would create a consistent global state of the system, which can be used by Hasthi to recover the state. Since

194


each process synchronizes with each other typically using a barrier, it should not be hard to find places to perform checkpoints. Furthermore, it may be too expensive to checkpoint once every iteration, but each process can checkpoint once a fixed amount of iterations (e.g. once every 10 iterations). Users may decide the size of the number of iterations between checkpoints based on the cost of each iteration. Application Pattern Spectral Methods

N-Body Problem

Structured Grid Unstructured Grids Dynamic Programming

Required Level of State to be preserved Binary exchange algorithm goes through steps, and an application can use these steps as boundaries for checkpoints. Furthermore, Transpose algorithm needs all-toall personalized communication only in the middle step, and at the beginning and the end, each process computes independently. It may be possible to use this property to perform checkpoints. Iterations provide a logical place to perform checkpoints. Moreover, If required, the application may utilize the hierarchy to provide checkpoints between iterations. Iterations provide a logical place to perform checkpoints. Iterations provide a logical place to perform checkpoints. Iterations (calculating each row) may provide logical points for checkpointing.

Application of Hasthi Hasthi can initiate the application and manage it. User defined rules will use the global view of the system to detect failures, recover the failed processes, and then recover the application by falling back to an iteration (checkpoint locations are a part of the global view).

Table 8.2: Iterative Applications


195

This category has following main classes of applications. 1. Spectral Methods: FFT (Fast Fourier Transform) is the main application in this class. Among solutions, Binary Exchange algorithm depends on a complex communication pattern where each process talks to other processes that differ by one bit in its name (e.g. process 5 and process 7 talk to each other as their names, 101 and 111, only differs by one bit.) The second approach, Transpose algorithm, depends on all-to-all personalized communication. (Kumar et al. [81], Chapter 10, 377-406). 2. N-Body Problem: N body problem involves calculating interactions between a set of particles (e.g. modeling Planetary Systems). Typical algorithm involves building a hierarchy by recursively identifying independent clusters of particles and representing them by a single particle at their center of gravity. (e.g. Barnes-Hutt or Fast Multipole Algorithm [40]). Parallel algorithm distributes different parts of this hierarchy across different processes, and the computation typically goes through iterations. One difficulty is that after interactions, placement of particles may have changed and the hierarchy needs to be changed. 3. Structured Grid: This method places data in grid points and re-computes them stepby-step. At each step, each grid point calculates its value using its neighborhood. Parallel implementation distributes the grid across processes, and each process has to communicate with immediate neighbors to calculate the boundary values assigned to itself, and moreover, they communicate to synchronize operations. 4. Unstructured Grids: This approach is similar to structured grids except for the fact that the algorithm represents the unstructured grid as a graph, partitions it to sub graphs, and assigns them to processes. 5. Dynamic Programming: Applications that use Dynamic programming cast the underline problem using a recursive equation that depends on sub problems of the main


196

problem. Algorithm remembers results to the sub problems, and uses them to optimize the computations. Usually, the problem has different levels, and we say a problem is serial if each level only depends on the level before, or otherwise, we call the problem non-serial. As described by Chapter 9 of Kumar et al. [81] (pages 356-376), the parallel implementation calculates a table where each table entry includes a solution to a sub problem, which is a part of the recursive definition of the main problem. Algorithm assigns each column of the table to a process and proceeds by iterations where each iteration calculates a row in the table. For serial problems, each process only needs to remember one old row, and for non-serial problems, processes need to remember more rows. Communication patterns across processes are complex, and they often need all-to-all broadcast.

Table 8.2 shows possibilities of managing these applications with Hasthi.

8.3.3

Applications with Limited State

These applications have a compact and identifiable state and are often have processes that are loosely coupled. For example, the finite state model based applications can capture enough state by just remembering the current state and inputs left to be processed. Furthermore, applications such as search with Backtrack, branch, and bound (e.g. solving the Rubik cube, searching a large number of photographs for the best pattern, and mining the web for a pattern) only need to remember limited state like which process handles which part of the search space, which parts are completed, and current results. Interactions between processes in most algorithms of this class are for optimizations; therefore, correctness is not affected even if a process has failed and restarted. To support recovery with these types of applications, processes can write this state to a stable storage and update state periodically or as values change. If a computation has failed, Hasthi can restart the process


197

while passing the location of failed processs storage location (which contains its state) as an argument, and the restarted processes can resume the execution using those saved state. It is worth noting that we have placed LEAD usecase, which we will discuss in Chapter 9, also within this class to provide a contrast between continuously running systems and distributed computations. This category has three classes of applications.

1. Backtrack, branch and bound: This class of problems searches a large search space for a solution. If the problem can be casted as a tree search, it belongs in this class, or otherwise, it belongs in the graph traversal class. Two main approaches are Depth first Search (DFS) and Best First Search (BFS). In DFS, processes partition the search space among each other, and when a process ran out of work, it requests work from other processes. In DFS branch and bound version, when a process found a better solution than the current best, it broadcasts the solution to all processes, so they all come to know about the current best solution. In BFS, a heuristic decides the exploration order, and each process has its own open list and processes share the open lists with each other. Chapter 8 of Kumar et al. [81] (page 299-353) describes both DFS and BFS in detail. 2. Graph traversal: Graph traversal-based solutions are similar to Backtrack, branch and bound solutions. However, in this case, the search space is a graph instead of a tree and, therefore, needs duplicate detection. To support duplicate detection, processes use a hash function that maps each node to a process. When a process found a node, it finds the associated process using the hash function, talks to the process to check whether the node has been encountered before, and adds the node to the encountered node list of the associated processes. Chapter 8 of Kumar et al. [81] (page 335-336) describes this solution.

198


Application Pattern Backtrack branch and bound

Required Level of State to be preserved In these applications, processes talk to each other to optimize and load balance the computation. Hence, even if processes lost messages, correctness is not affected. Therefore, if process has failed and recovered in few minutes, it can still resume the computation and yield correct results. Each process can independently save state to recreate assigned parts of the search. This state is usually compact. For instance, it includes, search spaces assigned to this process, the current best solution, and open node lists etc.

Graph traversal: Finite State The current execution state of these Automata applications consists of the current state in the state model and inputs to be received. To support recovery, an application can periodically checkpoint the execution state. LEAD (SOA In these systems, state is twofold. based weather Global state of the system and sesprocessing) sion states (e.g. workflow execu: (This is tion state). LEAD services are eipresented as a ther stateless or write all state to contrast with a database. If they have failed distributed and restarted, they do not lose any computations useful state. To recover session and continu- state, Hasthi either reruns workously running flows from the scratch or saves systems.) the workflow execution state and recovers the workflow using that state.

Application of Hasthi Hasthi can initiate the search, monitor processes, and independently recover any process that fails. Each process can write the state to a file (or other stable storage), and Hasthi will pass this to the new process if the process has failed and recovered. Furthermore, in some usecases, approximations suffice. In those cases, results may be useful even if some process has failed and a part of search space has not explored. In these cases, user-defined management logic may recover the fail processes only when more than a given threshold of processes has failed.

Hasthi can initiate the system and monitor processes. User-defined management logic can detect failures, recover the system, and recover the computation using saved state. Hasthi initiates the system and monitors the system. User-defined management logic recover any failed services and restart failed workflows that has failed due to the above failures.

Table 8.3: Applications with Limited State


199

3. Finite State Automata: According to Asonovic et al. [33], a general way to parallelize this type of applications is yet to be discovered.

Table 8.3 shows possibilities of managing these applications with Hasthi. As an example, let us assume an investigation where investigators only have an incomplete photograph and a description of the suspect. They know that he passed though one of many airports, and they have decided to search for 100 personals who are closest to the description and manually investigate each. Let us assume that they use a parallel application to find 100 personals from surveillance videos. Each process searches different parts of the video archives, and if it finds a match, it goes through different shots from the video to verify. Each process keeps track of the best 100 matches, and when a new match is found, it broadcasts the match to everyone, which is used by others to discard matches that are inferior. Let us assume that each process writes assigned but unprocessed video names and best matches so far to a stable storage periodically and exposes the storage location as a management parameter. As described in Table 8.3, Hasthi can monitor this system, and if a process failed, it can restart the process passing the stable storage location of the failed process as an argument. From the storage, the new process can find videos to be processed and current best matches, and hence resume the execution.

8.3.4

Loosely coupled Applications

Applications in this class are embarrassingly parallel or highly parallel (has limited coordination between processes, e.g. Map-Reduce). Often, Hasthi can recover these computations by rerunning failed subtasks independently and integrating results with the results from other completed subtasks. This category has three classes of applications.

200


Application Pattern Map-Reduce

Required Level of State to be preserved Since the Map phase is executed in parallel and processes in this phase are independent, if a process at the Map phase fails, user-defined management logic can rerun that process and incorporate results into the Reduce phase. However, if the Reduce process fails, Hasthi may have to rerun the Reduce phase using results from Map phase. If required, both Map and Reduce phases can save their results and partial works. For example, Apache Hadoops [5] Map-Reduce implementation uses a file system to work and, therefore, preserves partial works by default.

Combinational Logic

These two types are embarrassingly parallel. In case of a failure, user-defined management logic can rerun only failed parts of the computation. If results have a Reduce phase, this becomes a MapReduce application.

Graphical methods

Application of Hasthi Hasthi can initiate the search, monitor processes, and user defined management logic can detect failures and recover any process that fails. Each process can use a storage location to save intermediate work, and Hasthi will pass this to the new process if the process has failed. Hence, the new process can resume the execution. In some cases (e.g. Google index generation), losing small percentage of processes in the map phase may be acceptable. In such cases, userdefined logic can restart processes only if the threshold of failures has exceeded. Furthermore, even in the case of a failure, results may be still useful. Moreover, Hasthi can notify users about the failure, and get their input on the recovery action. Hasthi can initiate the computation, monitor the processes, and userdefined management logic can rerun only failed parts of the computation.

Table 8.4: Loosely coupled Applications


201

1. Map Reduce: Map-Reduce pattern involves mapping data to different processes, processing them, (using data parallelism), and then reducing all results to one or few processes. We call the initial phase as the Map phase, and the second phase as the Reduce phase. Processing in the Map phase is independent, and usecases may have one or more Reduce phases. The Reduce operation is associative (the order of combining the results from the Map phase does not matter). Dean et al. [50] describes Map-Reduce in detail. 2. Combinational Logic: These applications perform simple operations (e.g. Logic operations and counting) on large amounts of data. They achieve parallelism by partitioning data and performing operations with each partition of data. 3. Graphical methods: Variables (vertices) connected by conditional probabilities (edges) compose these models. Among examples are hidden Markov models and Neural Networks. The reference [11], which has been authored by the same group as Asonovic et al. [33], observes that in practice, often either the same graphical model evaluates different data or different graphical models evaluate the same data. Therefore, usecases based on these models can be easily parallelized.

Table 8.4 shows possibilities of managing these applications with Hasthi. For example, let us consider a parallel application that searches 10 million documents and creates an inverted index from keywords to document ids based on keywords contained in the documents. This application matches the Map-Reduce pattern, where the Map phase partitions documents among processes and each process creates an inverted index using assigned documents, and the Reduce phase creates the final inverted index by merging results from the Map phase. Let us assume that all the documents are placed in a shared file system, each process writes the updated inverted index and processed document ids to a file once every 100 documents, and each process exposes that file as a management property.


202

Hasthi can initiate the computation and monitor processes. As explained in the table, if a process in the Map phase failed, Hasthi can restart the process passing the storage location of the failed process as an argument, and using saved inverted index and other information, the new process can resume the execution. Furthermore, let us assume that the Reduce process writes an updated inverted index to a stable storage whenever it incorporates results from a process in the Map phase. So similar to the earlier case, if the reduce process failed, Hasthi can recover the process and recover the execution using the storage location of the failed process.

8.4

Summary

This chapter discusses the limits of applying Hasthi to manage distributed computations. We observed that similar to earlier usecases, users can write management logic (rules) to identify and correct faulty conditions, and Hasthi monitors the system and periodically evaluates management logic, which trigger corrective actions when failures occur. Even though this process can recover the managed system from failures, it does not provide any guarantee about the behavior of the system while recovery taking place. However, a single missing message or lost state due to failed processes could cause a distributed computation to fail, and consequently, following a failure in the system, the underline computation may have failed even if Hasthi has recovered the system. Therefore, to support managing distributed computations, the ability to recover the distributed computation is of a critical importance. A naive solution to the problem is rerunning the distributed computation. However, these computations are often expensive and may take from hours to days to complete. The standard approach to this problem is using rollback recovery, which saves the current execution state of the computation time to time and restarts the computation using saved state


203

in a case of a failure. We observed that to manage a distributed computation with Hasthi, the computation must meet three conditions, which are ability to instrument processes in the computation, ability to rebuild the system structure after a failure, and ability to recover the computation (by reruns or using checkpoints). These changes may call for changes to applications, and if an application cannot meet these requirements, Hasthi cannot successfully manage that application. On this setting, as a generic solution, we propose to use a generic checkpointing mechanism to preserve state, and to use resulting checkpoints to recover the computation in case of a failure. Furthermore, we observed that user-defined management logic could detect failure in the system, recover the system, and then recover the computation using aforementioned checkpoints. However, we observed that even though the above generic solution suits tightly coupled applications with complex interactions among processes, most distributed computations need lesser guarantees. To understand the possibilities, we explore 13 parallel computing patterns defined by Asanovic et al. [33], and made following observations.

1. If parallel subtasks write their results to a stable storage, Hasthi can recover MapReduce, Combinational logic, and Graphical methods by rerunning only failed portions of computations and merging the results with results from the successful subtasks. 2. To manage Finite State Automata, Backtrack branch and bound, and Graph traversal with Hasthi, these applications should save small amount of critical state (e.g. open node list for a graph traversal or current state in a finite state automata). Hasthi can recover the computation using that saved state. 3. Dynamic Programming, Structured Grids, Unstructured Grids, N-Body Problem, and


204

Spectral Methods go through iterations. If processes in these computations take independent checkpoints at iteration boundaries (e.g. once every 10 iteration), the checkpoints collected at all process for the same iteration will provide a consistent global snapshot. If they provide such checkpoints, Hasthi can recover the computations using those checkpoints. 4. Dense Linear Algebra and Spare Linear Algebra have very complex communication patterns, and we were unable to identify any generic checkpointing boundaries. Therefore, if they are to be managed with Hasthi, they would have to incorporate some generic checkpointing mechanism.

Finally, we observe that, even though theoretically feasible, it is unlikely that Hasthi would be ever used to manage tightly coupled computations like Linear Algebra, as there are well establish paradigms like MPI, which support such computations and their recovery. However, other applications that need lesser guarantees (e.g. Map-Reduce, Search, and Dynamic Programming) are being implemented outside of those paradigms (e.g. Google architecture using Map-Reduce, many search usecases, and many embarrassingly parallel tasks like format conversion of documents), and we believe Hasthi will be useful to manage those computations.

9 Motivating Usecases This chapter describes few motivating usecases for Hasthi. The first section describes an application of Hasthi to manage an E-Science cyber infrastructure called LEAD. We have implemented the complete usecase, and Hasthi manages LEAD system as of now. Furthermore, in next sections, we will explore few potential usecases. We have not implemented those usecases, and instead, we discuss a potential design for each usecase.

9.1

The Primary Usecase: LEAD System

LEAD is a large-scale e-science project built on a Service Oriented Architecture. It enables students, faculty, and researchers to perform numerical weather prediction using observational weather data procured from various instruments and locations. A user can log into the portal, search for data, and run workflows to data mine, forecast, or postprocess data. When a user starts a workflow from the portal, the portal sends a workflow request to the workflow engine, and the workflow engine orchestrates the workflow. Each workflow consists of services, and each service wraps a command line application. The workflow orchestration invokes services. When invoked, services execute underline FORTRAN applications using computation resources like U.S TeraGrid. The LEAD system 205

206

9. Motivating Usecases

consists of 13 persistent services and many transient application services created on demand and, therefore, gives rise to a complex e-science infrastructure. Figure 9.1 depicts the architecture.

Figure 9.1: LEAD Architecture

From LEAD, we expect continuous availability and high success rate of submitted workflows. The first ensures that when a user tries to use LEAD, it is accessible and users can initiate workflows, browse data, or use LEAD in some way. Since each workflow could have 40-50 grid operations and could take hours to complete, the achieving high success rate of workflows has been specially challenging. Table 9.1 shows different types of errors we encountered from LEAD and potential solutions to those errors. Column 1 presents the name of the error category and example of errors in the category, and Column 2 presents potential solutions. As described in earlier chapters, to detect resource failures (e.g. service failures), Hasthi uses heartbeats


207

and failure detectors, and management rules can detect more complex errors conditions by utilizing the global view of the system. LEAD workflows generate events depicting their progress, and as a supplement to the above error detection mechanisms, Hasthi monitors those workflow events. To categorize workflow errors, we use old LEAD error traces collected over 18 months, and we have categorized error traces based on the most common error types. Hasthi compares new errors against known error traces to diagnose the type of the error and acts accordingly.

Figure 9.2: LEAD Errors and Corrective Actions

Figure 9.2 summarizes how Hasthi handles different errors happen in LEAD. For errors like software bugs, deployment errors, where their causes vary and automatic recovery is hard, Hasthi notifies the users. Furthermore, these errors are common in initial part of the system lifetime, but they become less and less common as the system stabilizes and when these errors are fixed. Therefore, we believe it is acceptable to consult users when these errors occur. LEAD has faced major reliability issues from dependent external resources it uses for file transfers and job submissions, and to mitigate them, LEAD infrastructure already has multiple levels of retries, which either blindly retry or use alternative resources. These retires happen transparently to Hasthi, and are handled by LEAD infrastructure. Finally, Hasthi rules automatically recover the system when hosts failed or services failed.

208


Error Software Bugs: e.g. Runtime exceptions (Null Pointer errors, Class cast exceptions) and Service Aging (Memory Leaks, Connection Leaks).

Deployment and Configuration Errors: e.g. system configuration errors, wrong security settings, and workflow configuration errors. External Resource Errors: LEAD depends on computation and data resources for its operation and, therefore, depends on Job submission services and File Transfers services. Examples of errors are Unavailable Services, Overloaded Services, and Transient errors. In addition, underline applications may also fail.

Infrastructure Errors: Hardware (e.g. Faulty host, Overloaded hosts), Network (e.g. Unknown Host, No Route to Host, Socket Exceptions), and File systems (e.g. WAN File system mounts, NFS drives) Maintenance proxy expiration, security configuration changes, and other Maintenance issues (e.g. full hard disk) Capacity Sustainability - Service Failures, Overload Services, and Capacity problems (e.g. Out of memory and Service response times out)

Solutions A wide range of reasons causes software bugs, and hence, automated recovery is very hard. Therefore, notifying the users and requesting manual recovery is the most practical solution. However, some errors may only occur if some specific scenario has occurred, and if that fault scenario is rare, Hasthi can restart the services after notifying the users in the hope that the scenario will not repeat. A wide range of reasons causes deployment and configuration bugs, and hence, automated recovery is very hard. Therefore, Hasthi may notify the users and request manual recovery. To address these failures, LEAD includes multiple levels of retries. At the first level, LEAD retires all grid operations using an exponential back off, and this approach deals with transient errors in file transfer and job submission services. At the second level, if a service execution of a workflow failed, LEAD reexecutes the service using a different computation resource. LEAD has multiple computation resources. Therefore, if one computation node is faulty, these reties will send the jobs to another location. These retires are not part of Hasthi, rather built directly into the system. If a host has failed, Hasthi migrates the services reside in the failed host to other hosts. In addition, Hasthi also notifies the user about the failure. Furthermore, if the Wide Area Network (WAN) mount or a Network File System (NFS) mount has failed, a Hasthi agent running in each host detects the failure, and Hasthi notifies users. These are error types like full hard disks or expired security credentials, which are maintenance related. Hasthi can detect these errors using errors patterns as described before and notify users. Hasthi recovers service failures, typically caused by service aging, operator errors, or hardware or software glitches by restarting them. However, in our LEAD integration, we do not perform load balancing using Hasthi.

Table 9.1: LEAD Errors and Corrective Actions


209

However, as explained in Chapter 7, due to failures and recovery operations, different kinds of changes could occur in the system, and let us look at how Hasthi-LEAD integration handles each of these cases.

1. Handling Lost State - If services in a managed system failed and Hasthi recovered them, the failed services may lose their state. However, most LEAD services are either stateless or use a write-through-policy with a database (durable state) to hold state. Therefore, even though services failed, they can resume their executions when Hasthi recovers them. Moreover, according to the categorization in Chapter 7, the LEAD system falls under systems with a best-effort global state, and it does not have any external effects. Consequently, even if LEAD services have failed and restarted, they do not lose critical state. 2. Handling Lost and Failed Messages - While Hasthi recovers the system after a failure, the managed system will be in an unsafe state. Hence, while the system is being recovered, the managed system may lose some messages and some workflows may fail. To recover those failed workflows, we re-execute them. LEAD Workflows do not have any side effects, and the data subsystem can cleanup any duplicate data generated from workflow re-executions. Therefore, recovery by rerunning the workflows is acceptable. 3. Handling Lost System Structure - If a host has failed, Hasthi has to move services running in the host to a different host. Hence, their addresses will change, but other services do not aware of this change and requests may fail if they try to send messages to the old address. As explained above, users go to LEAD Portal and start a workflow, LEAD Portal sends a workflow request to the workflow engine, and the workflow engine orchestrates services sending messages (service invocation messages to services). The workflow request, which initiates a workflow, includes a


210

SOAP header called “LEAD Context Header,” and the workflow engine copies this header from the workflow request message to every service invocation associated with that workflow. Among other information, the header includes locations of all services in the system, and each service, when invoked by a workflow, finds other services (e.g. registry, data catalog) from this header. Before launching or recovering a workflow, both LEAD Portal and management logic find the most up-to-date service locations using the “dependency discovery” operation of Hasthi and update the LEAD context header with new service locations. Therefore, even if service endpoints change at recovery, workflows are not affected. 4. Handling Lost configurations or resources - LEAD services have fixed configurations loaded from the file system; therefore, the configurations are preserved even if services are restarted. Resources (computation and data resources) are brokered per request basis by a special service, which takes into account the history of failures and successes. Therefore, even if a service has failed and recovered, it does not affect the resource assignments. We have integrated Hasthi instrumentations with LEAD services, and as management actions, we support creating, restarting, stopping, configuring services, and performing user interactions with human administrators. As an example of how Hasthi can implement the above usecases, let us consider two of them in details.

9.1.1

Implementing Workflow Recovery Usecase

Let us look at a management scenario that recovers LEAD from services and host failures and recovers any failed workflows that failed due to the same failures. Listed below are rules, which Hasthi has used to implement the scenario. As described in Chapter 3, the when-clause of a rule contains an object query language, and the then-clause of a rule


211

contains a Java code. The system.invoke() method is used to schedule an action. For more details about management rules, please refer to the management rules section in Chapter 3. The usecase consists of three parts: 1) detecting failures in the system, 2) recovering specific services, and 3) detecting a healthy system and recovering workflows failed due to aforementioned errors. Let us look at rules used to implement each usecase. Rule 1 r u l e ” LogSystemNonHealthyTime ” salience 10 when systemHealth : SystemHealthState ( systemHealthy == true ) ; e x i s t s ( ManagedService ( s t a t e = = ” C r a s h e d S t a t e ” | | s t a t e == ” F a u l t y S t a t e ” | | s t a t e = = ” U n R e p a i r a b l e S t a t e ” | | s ta te == ” RepairingState ” , category == ” Service ” ) ) ; then systemHealth . setSystemFailed ( ) ; update ( systemHealth ) ; end

Rule 1 detects when the system has at least one service that is not in an operational state: Saturated, Idle, Busy, or Repaired and marks the system as unhealthy. Rule 2 rule ” RecoverFailedServices ” salience 4 when s e r v i c e : ManagedService ( c a t e g o r y = = ” S e r v i c e ” , s t a t e == ” CrashedState ” ) then ActionHelper . doRecoverFailedServices ( service , host , system ) ; end

Rule 2 recovers failed services. Hasthi triggers the rule if a service is in the “Crashed” state, and it restarts the service in the same host if the host is active or otherwise restarts


212

the service in a different host defined in the service profile. If an action fails, Hasthi marks the service as “Unrecoverable” and requests human help. Rule 3 rule ” ResurrectWorkflowsAfterRecovery ” salience 5 when not e x i s t s ( ManagedService ( s t a t e = = ” C r a s h e d S t a t e ” | | s t a t e == ” F a u l t y S t a t e ” | | s t a t e = = ” U n R e p a i r a b l e S t a t e ” | | s ta te == ” RepairingState ” , category == ” Service ” ) ) ; systemHealth : SystemHealthState ( systemHealthy == f a l s e ) ; then long failedTime = systemHealth . getSystemFailedTime ( ) ; systemHealth . setSystemHealthy ( ) ; A c t i o n H e l p e r . d o R e s u r r e c t W o r k f l o w s A f t e r R e c o v e r y ( system , failedTime ) ; update ( systemHealth ) ; end

Finally, Rule 3 detects when the system returns to a healthy state, marks the system as such, and recovers workflows failed due to service or host failures. When the system is deemed faulty, Rule 1 records the timestamp, and when the system recovered again, Rule 3 triggers the workflow recovery code, which recovers workflows that had failed after the system was faulty and due to service or host failures. Each workflow in the LEAD system advertises its execution progress by publishing events to the message broker, and a separate service collects and stores those events in a database. Among other things, events include the workflow request message that started the workflow. The workflow recovery code searches the event database for workflows that failed while the system is faulty and selects only those workflows failed due to service failures by matching error traces against known error patterns. The outcome of this search is a list of workflow instance names (workflow-ids). To recover selected workflows, for each workflow, using its workflow-id as the key, the workflow recovery code finds the workflow


213

request message sent to that workflow from events stored in the database and replays the message to the workflow engine, which restarts a new execution of that workflow. Chapter 6 includes a detailed performance analysis of this usecase.

9.1.2

Implementing Data Transfer Recovery Usecase

LEAD system includes a hidden process (workflow), which is not apparent to the user. Each workflow in LEAD generates notifications depicting its progress, and the data subsystem listens to these notifications, indexes them, and preserves details about experiments and their results. One important part of this process is copying inputs, intermediate results, and final results to a data repository from temporary working directories, so they are preserved and can be used later. This process happens in the background. A LEAD workflow truly completes—in other words, useful to an end user—only when this process has completed and all the data products are preserved. However, due to overloading and failures of infrastructure (e.g. failure of Wide Area network file system), these data transfers may fail. One solution to this problem is rerunning workflows related to failed transfers. However, since workflow executions are expensive, we have considered data transfer failures as a separate usecase—an asynchronous sub workflow of the main workflow—and specifically retried only those transfers without re-executing all the workflows. LEAD uses a Wide Area Network file system as its data repository, which we called the data capacitor (or DC-WAN). To transfer files, we use Reliable File Transfer Service [84] (RFT service), and for transfers, both the source and the destination must have a service called Grid FTP service or a RFT service running. There is one RFT service running on top of the DC-WAN, and we call that the destination RFT service. The service called MyLEAD Agent listens to notifications generated by workflows,


214

finds inputs and generated data products of workflows, and transfers the data products from their current locations to the DC-WAN. MyLEAD agent keeps six months of old data transfers, both successful and failed transfers, in a database. Furthermore, the agent includes a blind recovery loop, which identifies any file transfers caused by transient problems and retires them using an exponential back off algorithm. Furthermore, MyLead Agent supports a “file transfer retry method”, which accepts a time period as the input, and when invoked, the agent retries all the previous file transfers that were failed in the give time period. For example, if it receives an invocation that has the time period [March 10 2009: 2PM to March 10 2009: 6PM] as an input, it will retry all the file transfers that have failed between 2PM and 6PM, March 10 2009. To handle file transfer errors, Hasthi monitors both DC-WAN mounted into hosts in the system and the RFT service running with the DC-WAN. To handle their failures, Hasthi includes following three rules. Rule 4 rule ” DetectRFTFailure ” when s y s t e m H e a l t h : S y s t e m H e a l t h S t a t e ( RFTHealthy = = t r u e ) ; s e r v i c e : DAMNService ( R F T S t a t u s ! = ”OK” ) ; then A c t i o n H e l p e r . doDetectRFTHealth ( system , systemHealth , s e r v i c e ) ; update ( systemHealth ) ; end

Rule 5 rule ” DetectDCFailure ” when s y s t e m H e a l t h : S y s t e m H e a l t h S t a t e ( WANMountHealthy = = t r u e ) ; h o s t : H o s t ( WANMountHealth ! = ” h e a l t h y ” , name = = ” s i l k t r e e . c s . i n d i a n a . edu ” | | name = = ” c h i n k a p i n . c s . i n d i a n a . edu ” ) ; then A c t i o n H e l p e r . doDetectDCFailure ( system , systemHealth , h o s t ) ;


215

update ( systemHealth ) ; end

Rule 6 rule ” RecoverFromRftDCFailures ” when s y s t e m H e a l t h : S y s t e m H e a l t h S t a t e ( RFTHealthy = = f a l s e | | WANMountHealthy = = f a l s e ) ; s e r v i c e : DAMNService ( R F T S t a t u s = = ”OK” ) ; n o t ( e x i s t s ( H o s t ( WANMountHealth ! = ” h e a l t h y ” , name = = ” s i l k t r e e . c s . i n d i a n a . edu ” | | name = = ” c h i n k a p i n . c s . i n d i a n a . edu ” ) ) ) ; then A c t i o n H e l p e r . doRecoverFromRftDCFailures ( system , s y s t e m H e a l t h ) ; update ( systemHealth ) ; end

First two rules detect when either of DC-WAN or RFT service has failed and notify users and request them to respond when the failure is fixed. Furthermore, following a failure, when the DC-WAN and RFT services have recovered, the third rule detects the condition and uses the “file transfer retry method” of MyLEAD agent to recover all the file transfers that failed while DC-WAN or RFT has failed. To summarize, LEAD transfers inputs and results generated by workflows to a data repository, and these transfers happen asynchronously. To guard again failures of those transfers, MyLEAD agent, which initiates those transfers, performs a retry with an exponential back off. In addition, Hasthi monitors the WAN and RFT services that are needed to perform those transfers, and if they have failed and recovered, Hasthi restarts the failed transfers that are caused by these WAN or RFT failures.

9.2

Stream Processing Systems

Let us assume a distributed stream processing system that sits between users and event sources and generates alerts when events match user-defined queries. Users define queries

216


using an event query language (e.g. Esper EQL), and those queries are broken down to sub-queries and assigned to few stream processing services (subscribers). In this setting, multiple services match parts of each query against events. Figure 9.3 depicts the architecture.

Figure 9.3: Stream Processing Usecase

Let us assume that the system performs all communications using a broker hierarchy that supports a topic-based publish/subscribe model, where publishers and subscribers can publish or subscribe using any node in the hierarchy. The hierarchy routes matching messages to subscriptions regardless of their origin. Let us assume that system does not lose events and services can recover state if they have failed and recovered by Hasthi. For instance, a system can implement these two guarantees like follows. Each event originates at an event source, which publishes it to a broker, and processing services match them and either generate composite events or send notification to end users. Brokers can store subscriptions from processing services in a database, and processing services can store queries submitted by users in a database. To guard against lost messages, both processing services and brokers can write events to a database before responding to the event-receiving message. If event subscribers are not available, brokers can keep events stored for few hours, and on the other hand, if a broker has failed, event sources can publish


217

to an alternative broker. Furthermore, if each processing service has a logical name, even if a processing service has failed and moved to a new address, it can contact message brokers and update its subscriptions to reflect its new service location. Then, the processing service can receive messages collected by the broker while it was faulty, and it will continue to receive future messages. In this system, even if a broker or a processing service has failed and restarted, they can recover state (e.g., unprocessed messages, subscriptions, and assigned queries) from information stored in the database, and therefore, they do not lose state. Furthermore, since all communications happen through the publish/subscribe system, which provides location transparency, the system does not lose the structure even if services have moved.

Among possible management scenarios are fault tolerance and load balancing. Similar to the LEAD usecase, management rules can implement these management scenarios. For fault tolerance, Hasthi can monitor and recover processing services and message brokers from failures. For instance, if a service failed while the residing host is active, Hasthi can restart the service, and on the other hand, if a host has failed, Hasthi can start the service in a new host. As described above, they can recover state, would not lose messages, and the system structure will be intact after recovery. Furthermore, Hasthi can perform load balancing in several dimensions. For instance, Hasthi can add new brokers to the hierarchy or remove brokers depending on the events rates of event sources, and Hasthi can add or remove processing services based on the number of active queries running in the system. On the other hand, Hasthi can keep track of the load on each host and move services to keep the load on hosts under control.


9.3

218

Internet telephony, Video Conferencing or Internet TV systems

Distributed telephony, Internet TV, and video conferencing systems create a graph of media-streaming nodes, and using the graph, they transfer video streams from sources to destinations while optimizing the bandwidth utilization. Since a short break in the transmission is not critical, these systems can tolerate loss of few packets; however, timely delivery is of essence. Graph structure, in other words connections from one media node to other nodes, is the only critical state contained in each node. Let us assume that each node is able to recreate this state if it has failed and restarted. For instance, each node can either write or update information about the graph structure to a stable storage whenever the structure has changed, or each node can build connections dynamically by talking to other nodes once it is started. On this setting, the system can tolerate lost messages and preserve enough state even it has failed and recovered; thus, Hasthi can manage the system. Similar to the earlier usecase, Hasthi can perform both fault tolerance and load balancing in these cases. For fault tolerance, Hasthi can monitor and recover failed media nodes similar to the earlier usecase. Load balancing also provides many possibilities. Since Hasthi has a global view of the graph of media nodes, it can edit the graph to best suit the current situation. For example, it can adjust the graph by adding, removing, and moving nodes, or creating or deleting links in response to changes in the number of users (e.g. watching TV or participating in calls), distribution of users, available bandwidth, and Quality of Service Requirements like Latency, Jitter, and bandwidth limits.


9.4

219

Distributed Service Container

J2EE containers have eased the development of J2EE applications and made wide adaptation of those applications possible, and service containers provide the same functionalities to services. However, currently, service containers are often restricted to one host. The development of a distributed service container, which deploys and controls services in a group of hosts, can open up many possibilities. One main challenge while building such a container is the need for an underline framework (a kernel), which monitors, controls, and manages hosts allocated to the container as well as services running in those hosts. Owing to dynamic and robust traits and the ability to provide a global view of the system, Hasthi could serve as the underline kernel of such a container where it will manage the life cycle of services deployed in the container, control them based on rules, and allow administrators to control the services via a single entry point. Such a container would allow users to describe a distributed application as one unit (e.g. single package) like a J2EE application. The description could include a list of services in the system, the required number of copies for each type of services, interconnections and dependencies between them, and load balancing policies. Container would translate that description of the application to Hasthi management rules, deploy services in hosts, start those services, and finally monitor and control services. Moreover, the container can use “dependency discovery” operation and the global view provided by Hasthi to notify each service about the location of it dependencies, thus establishing the system structure. In such a setting, Hasthi can recover the applications and the container from service or host failures. Furthermore, Hasthi rules can control the utilization of resources across different applications, and if services support the replication, Hasthi can start replicas of services in response to increasing load or shutdown replicas that are idling. Finally, with the global view of the system, Hasthi could also enforce Quality of Service (QOS) requirements and


220

Service Level Agreements (SLA). As mentioned earlier, such a container would simplify packaging, distribution, deployment, and management of distributed applications. For example, we distribute core services of LEAD project though OGCE project [20], and a primary difficulty for our users is the hassle of setting up a system that need multiple services across different hosts. With a container similar to the one we described, the deployment of a distributed application like LEAD, which has multiple services and complex interaction among each other, would be much easier.

10 Conclusion and Future Work 10.1

Outline

As discussed in the introduction, large-scale systems are becoming ubiquitous. In those systems, changes (both in terms of failures and maintenance operations like discovery, start stop, and configuration of services) are a norm rather than an exception. Since these systems may have thousands of resources, controlling them manually is difficult, if not impossible. Therefore, to monitor, manage, and sustain these systems, automated management tools are needed. Management usecases differ from system to system, but in practice, only large organizations can afford to develop a specific management solution for each system. In this setting, a possible solution to this problem is to develop generic management frameworks, which manage a given target system according to the user-defined management logic. With such a develop-once-use-everywhere approach, these management frameworks can be developed, tested, and hardened once and then used to manage a wide variety of systems. However, different systems have their own management scenarios. Therefore, it is important that generic management frameworks enable users to define their own management logic, which dictate how a framework should respond to changes in the system. Also, in the 221

10. Conclusion and Future Work

222

introduction chapter, we argued that in order to support large-scale systems, the proposed solution should be scalable, robust, and dynamic. Furthermore, we argued that providing a global view to the management logic will simplify the management logic authoring. To illustrate the global view, let us consider the following management logic, which says, “If the system does not have 5 message brokers, create new brokers, and connect them to the broker network.” Logic should detect that when the system has less than five brokers, find the best place to create a new one, create a new one, and connect it to existing brokers. The process depends on information about multiple resources of the system. Hence, the above logic depends on a global view of the system. If logic depends on a global view, we call them global logic and otherwise call them local logic. This dissertation addresses the enforcement of user-defined management logic that depend on a global view of the managed system state in order to manage large-scale systems. As discussed in the introduction, our solution to this problem will enable user to have explicit control over the system being managed.

We propose a dynamic and robust management architecture called “Hasthi” as a solution to this problem. This solution includes a manager-cloud that keeps track of and keeps connected the live components of the system, a meta-model of the system that exposes the monitoring state of the system with delta-consistency, and a decision framework that enforces local and global user-defined management logic on a managed system. Chapter 3 illustrates the architecture in detail.

We have showed that our solution is both sound and useful. We have demonstrated the soundness of our solution by demonstrating the robustness, dynamic nature, and scalability of the architecture, and we have demonstrated usefulness by using Hasthi to manage a large-scale e-science cyber-infrastructure called LEAD, providing an extensive discussion about the application domain of the Hasthi, and presenting a series of usecases.


223

In Chapter 6, we have demonstrated the scalability of Hasthi using a series of experiments. We observed that Hasthi could scale to 100,000 resources and it is stable with respect to operation conditions like management workload, rule complexity, and epoch time period. Moreover, we have compared Hasthi with another management system, and in the final analysis, Hasthi did much better in terms of scalability. Furthermore, we observed that the most likely explanation to the scalability results is the Rete algorithm, which provides a tradeoff between time and space by remembering old results to evaluate new facts faster. Hasthi coordinator, which is the main bottleneck in the system, receives only a summary of updates happen in the managed system, and it uses Rete algorithm to evaluate the managed system. Even though the managed system may have large number of resources, at each evaluation, only few of them change enough to affect their summary, and therefore, the coordinator only receives few changes. Since the Rete algorithm only evaluates changes and their effects in each evaluation, the overhead of evaluating these changes is manageable. This is a possible explanation of these scalability results. In Chapter 5, we have demonstrated the robustness and dynamic properties using analytical evidence. Specifically, we have demonstrated that given a system managed with Hasthi, there exists a constant th for that system such that regardless of the initial state of the system, if managers do not join or leave and communication failures do not happen for a continuous th time interval, after the th interval, Hasthi is and will continue to be healthy as long as aforementioned errors do not occur. Furthermore, we have derived a similar upper bound for recovery when the coordinator of the systems has not failed but managers and resources have either failed or joined. In particular, as we discussed in Chapter 5, these results have two significant properties: a) Hasthi is self-stabilizing, and b) in the absence of failures, it self-stabilizes within a constant time, and they ensure that Hasthi recovers after the error conditions cease to exist. Moreover, also in Chapter 5, as a function of the MTTF of a single manager in the


224

system, which is the basic building block of Hasthi, we derived a lower bound for the availability of Hasthi. In addition, using these availability results, we showed that Hasthi belongs to one of the availability classes: “managed”, “well-managed”, or “fault tolerant”, which are defined by Gray et al. [68]. The specific class will be decided by the MTTF of a single manager of Hasthi. Furthermore, we have performed an experiment to empirically evaluate recovery, and observed that Hasthi recovers in 80 seconds on average, whereas the upper bound, which was calculated using analytical results, was 221 seconds. Therefore, we argue that in real usecases Hasthi would have even higher availability than demonstrated by analytical results. As Chapter 5 illustrated, Hasthi provides a self-stabilization guarantee, but like many self-stabilizing systems, it does not provide the safety property. In other words, Hasthi guarantees recovery, but does not guarantee the behavior of the managed system while recovery. In Chapter 7, to demonstrate the usefulness of Hasthi, we have identified the effects-of changes a managed system may undergo while recovery and categorized these effects as effects that can and cannot be addressed using Hasthi. Moreover, we have defined different classes of systems based on their characteristics (e.g. state and side-effects etc), and we have recommended guarantees each class should provide if systems of the class are to be managed with Hasthi. Among those classes, Distributed Computations are a class of problems that have stronger consistency requirements. Chapter 8 discussed possibilities of using Hasthi to manage distributed computations. Furthermore, Chapter 9 presented our experiences with applying Hasthi to manage a large-scale e-science infrastructure and a series of motivating usecases. Finally, since Hasthi monitors and controls a given system, it needs a means of collecting data from resources of the system and controlling those resources to keep them within acceptable bounds. This process is called instrumentation. Hasthi instruments resources by linking itself and resources with a software component called an agent, and Chapter 4


225

identified, discussed, and presented implementations of different instrumentation choices available for users who want to integrate their systems with Hasthi.

10.2

Contributions

Our primary contribution of this thesis is proposing, implementing, and analyzing a dynamic and robust management architecture, which manages large-scale systems by enforcing user-defined management logic that depend on a global view of the managed system state, and discussing its applications. Moreover, we demonstrated that despite it dependency on a global view of the managed system state, the proposed approach can scale to handle most practical systems. Our primary contribution can be broken down to following specific contributions.

1. We propose the “Manager-Cloud Algorithm,” which combines well-known system design techniques in a new combination for building a new system management architecture (Manager-Cloud) that keeps track of and keeps connected the live components of the system and exposes monitoring information as a meta-model of the system that exhibits delta-consistency. Furthermore, we have proved that given a system managed with Manager-Cloud, there exists a constant th for that system such that regardless of the initial state of the system, if managers do not join or leave and communication failures do not happen for a continuous th time interval, after the th interval, Manager-Cloud is and will continue to be healthy as long as aforementioned errors do not occur. We believe that this result is useful because it ensures that even if errors happen, Manager-Cloud would recover once the error conditions pass, which we believe is a useful foundation for building management frameworks. 2. We argue that delta-consistency—a guarantee that the changes to resources will be


226

reflected in the copy (the meta-model) within a bounded time—is sufficient to represent monitoring information in a system management framework. This observation captures a useful characteristic of monitoring information in abstract terms. 3. We propose a scalable decision framework, which uses the meta-model provided by the manager-cloud and summarization techniques to enforce global and local userdefined management logic to achieve explicit user-defined control of large-scale systems. Furthermore, the decision framework includes an extensible action framework, and we discussed the programming-model resulting from the decision framework. One specific contribution is our demonstration that despite its dependency on a global view of the managed system state, a management framework can scale to a hundred thousand resources—a limit that is sufficient for most usecases. 4. We provide an extensive analysis of Hasthi, which includes a theoretical lower bound for availability, an analysis of its scalability, an analysis of sensitivity to different operation conditions, and a performance comparison with another management system. These bounds and sensitivity results will be useful for a user who wants to apply the proposed approach to manage systems. 5. We propose a scalability benchmark and a series of tests for evaluating the scalability of large-scale management frameworks, which can be useful starting points for evaluating future large-scale management frameworks. 6. We propose a taxonomy of systems, use that taxonomy to identify the application domain of Hasthi, and define the guarantees Hasthi expects from each class of systems in the taxonomy. Identifying application domains and documenting implicit guarantees assumed while managing those systems have been topics that were given limited attention, and we believe the discussion presented in Chapter 7 will be a useful starting point.


10.3

227

Future work

In the course of our study, we have observed several future directions that could expand this thesis or spawn new research directions, and the following are some of those observations. Graphical Composition of Management Logic: Even though Hasthi allows users to define rules that explicitly identify failure conditions and contingency plans, rule authoring still requires expertise, which is not always available, especially for individual or smaller groups. Therefore, the composition of management rules could be the hardest part of integrating Hasthi to manage a given system. Hence, this is a useful research problem. A possible solution would be supporting graphical composition of management scenarios and automatically generating rules to support these scenarios. Moreover, since most of these scenarios can be represented as workflows, such a graphical composition can build on top of the existing work on workflow representation and composition. In retrospect, such a tool would make Hasthi accessible to a larger audience. Making the Coordinator Lightweight: In Chapter 6, we observed that the coordinator is the main bottleneck in the system. Even in the current design, the coordinator is designed to do a minimal amount of work. Nevertheless, it may still be possible to delegate some of the work the coordinator is doing, such as executing management actions or receiving and processing manager heartbeats sent from other managers. Delegating these tasks would reduce the load on the coordinator. For example, one possibility is to elect designated managers as helpers to the coordinator and submit management actions to those managers so they can do the heavy lifting while executing those actions. Furthermore, those helper managers can collect heartbeat messages from managers and transfer collected data to the coordinator time to time, thus reducing the load of handing multiple network connections from the coordinator. Moreover, in these cases, it is possible to maintain persistent transport connections between helpers and the coordinator, thus making the communication between


228

helpers and the coordinator efficient. Application of Management Frameworks: In Chapter 7, we presented a discussion on the application of management frameworks to manage systems and some of the resulting complexities. Furthermore, we presented integration with the LEAD cyber-infrastructure and discussed few possible usecases. However, our treatment presented Hasthi as our main focus. Moreover, unlike management architecture design, the application of management frameworks is an area that is given a limited attention. Therefore, we believe there is much more to learn in that area. Hasthi as a Distributed Service Container: In Chapter 7, we discussed the possibility of using Hasthi to build a distributed service container, which would expand the idea of service containers like Apache Axis2 to a distributed setting by supporting applications that span multiple hosts. Such a container will manage the lifecycles of distributed applications, including deployment, start, stop, configuration, and the monitoring of these applications. For example, with such a service container, a distributed application could be developed as one archive, deployed across multiple hosts, and controlled via one control panel. Moreover, the service container may aid services, which would be the building block of those distributed applications, to discover other services in the system and to support usecases like fail-over and load balancing. Exploring Decision Frameworks: Hasthi has been designed with a rule engine as the decision framework, and it would be interesting to explore the possibilities of using other models, like artificial neural networks, case base reasoning, and fuzzy logic as decision frameworks with Hasthi. Hasthi would provide a distributed and scalable platform to develop and research those decision frameworks. Furthermore, it may make it possible to use hybrid approaches in which multiple decision frameworks complement each other’s strengths and weaknesses.


229

Hasthi as a Tool to Collect System State Over Time: Another interesting dimension is that it is relatively easy to capture managed system state over time using Hasthi, and that information can be a useful source for studying reliability, decision frameworks, and system dynamics.

10.4

Conclusion

In conclusion, we observed that with large-scale systems, changes are a norm rather than an exception, and to aid in managing those systems, this dissertation proposes a dynamic and robust architecture that enforces user-defined management logic in order to control large-scale systems, in which the management logic depend on a global view of the managed system state. Furthermore, we demonstrated that the solution is both sound and useful by using analytical and empirical evidence. With a framework that enforces user-defined management logic that support a global view of the managed system state, users can explicitly author management logic, which detect when the managed system deviates from a healthy state and performs corrective actions. For example, logic could state, “if the system does not have 10 message brokers, create new ones and link them to the current broker hierarchy.” We observed there exists an architecture (Hasthi), which can scale up to 100,000 resources—a limit that is sufficient to manage most real life systems—despite enforcing user-defined management logic that depend on a global view of the managed system. In contrast, an alternative scalable solution to this problem is using emergent control, which could scale past 100,000 resources. However, for designing user-defined management logic that support the emergent behavior, users have to understand the interactions of local rules and their collective behavior, which is a hard problem even for a scientist, let alone a user. Therefore, we note that with these


230

two solutions, there is a tradeoff between explicit control and scalability. However, the existence of an architecture with explicit control that can scale to 100,000 resources implies that any system that does not need to scale over that limit can benefit from explicit control.

A Appendix

Appendix A:Algorithm to Generate the Workload Hasthi scalability analysis used the following algorithm to simulate a managed service. The update() method was executed once every epoch time, and when executed, the method updated resource properties, which are defined as class variables. The WSDM runtime, which was integrated with the service, expose those properties to Hasthi. public c l a s s ManagedServiceSimulator { s t a t i c f i n a l double s e r v i c e f a i l u r e P r o b = 0 . 0 1 ; i n t newReqCount ; int failedRequestCount ; int sucessfulRequestCount ; i n t pendingRequestsCount ; long lastRequestProcessingTime ; l o n g maxRequestTime ; long lastRequestReceived ; String systemStatus ;

public boolean update ( ) { newReqCount = 1 0 ∗ ( i n t ) Math . a b s ( r . n e x t G a u s s i a n ( ) ) ; l a s t R e q u e s t R e c e i v e d = System . c u r r e n t T i m e M i l l i s ( ) − 1 0 ∗ ( i n t ) Math . a b s ( r . n e x t G a u s s i a n ( ) ) ; f o r ( i n t i = 0 ; i < newReqCount ; i + + ) { if ( Utils . selectWithProbability (0.1 f )) { f a i l e d R e q u e s t C o u n t ++; } else { s u c e s s f u l R e q u e s t C o u n t ++; } l a s t R e q u e s t P r o c e s s i n g T i m e = ( l o n g ) Math . a b s ( r . n e x t G a u s s i a n ( ) ) + 1 0 ; i f ( l a s t R e q u e s t P r o c e s s i n g T i m e > maxRequestTime ) { maxRequestTime = l a s t R e q u e s t P r o c e s s i n g T i m e ;

231

A. Appendix

232

} } pendingRequestsCount = pendingRequestsCount + ( r . nextBoolean ( ) ? + 1 : − 1 ) ∗ r . n e x t I n t ( 5 ) ; if ( Utils . selectWithProbability ( servicefailureProb / 120 f )) { shutDown ( ) ; return false ; } Thread . s l e e p ( 3 0 0 0 0 ) ;

i f ( pendingRequestsCount > 200) { systemStatus = ” SaturatedState ” ; } e l s e i f ( pendingRequestsCount > 0) { systemStatus = ” BusyState ” ; } else { systemStatus = ” IdleState ” ; } } }

Appendix B:Rules for Scalability tests The Hasthi Scalability test used following rules. rule ” I n i t ” when then i n s e r t ( new NamedList ( ” r e m o v e d S e r v i c e L i s t ” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ”WRF” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ”ADAS” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ”NAM” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ” WRFStatic ” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ” C l u s t e r ” ) ) ; System . o u t . p r i n t l n ( ” I n i t i a l i z e d ” ) ; end

rule ” C r e a t e A l t e r n a t i v e s F o r F a l u i r e s ” salience 10 when

233

A. Appendix

s e r v i c e : ManagedService ( s t a t e = = ” C r a s h e d S t a t e ” , c a t e g o r y = = ” S e r v i c e ” ) ; then s y s t e m . i n v o k e ( new C r e a t e S e r v i c e A c t i o n ( s e r v i c e ) ) ; retract ( service ); end

rule ” UpdateHostServiceCounts ” salience 10 when a h o s t : Host ( ) ; s e r v i c e C o u n t : Number ( ) from a c c u m u l a t e ( M a n a g e d S e r v i c e ( h o s t = = a h o s t . name ) , sum ( 1 ) ) ; then ahost . setServiceCount ( serviceCount . intValue ( ) ) ; end

rule ” UnregisterOverloadedServices ” salience 10 when n a m e l i s t : NamedList ( name = = ” r e m o v e d S e r v i c e L i s t ” ) ; s e r v i c e : ManagedService ( s t a t e = = ” S a t u r a t e d S t a t e ” , c a t e g o r y = = ” T r a n s i e n t S e r v i c e ” , active == true ); r e g i s t r y : ManagedService ( type = = ” X r e g i s t r y ” , s e r v i c e . group = = group ) ; then service . setActive ( false ); update ( service ) ; system . invoke ( r e g i s t r y , ” removeService ” , ” urn : removeService ” , ” s e r v i c e N a m e =”+ s e r v i c e . getName ( ) ) ; end

rule ” RegisterBackIdleServices ” salience 10 when n a m e l i s t : NamedList ( name = = ” r e m o v e d S e r v i c e L i s t ” , r l i s t : l i s t ) ; s e r v i c e : ManagedService ( s t a t e = = ” I d l e S t a t e ” , a c t i v e = = f a l s e ) ; r e g i s t r y : ManagedService ( type = = ” X r e g i s t r y ” , s e r v i c e . group = = group ) ; then service . setActive ( true ); update ( service ) ; s y s t e m . i n v o k e ( r e g i s t r y , ” a d d S e r v i c e ” , ” u r n : a d d S e r v i c e ” , ” s e r v i c e N a m e =”+ s e r v i c e . getName ( ) ) ; end

A. Appendix

234

rule ” CreateEnoughAppServices ” salience 10 when w f e n g i n e : M a n a g e d S e r v i c e ( c a t e g o r y = = ” S e r v i c e ” , t y p e = = ” WorkflowEngine ” ) ; A r r a y L i s t ( s i z e > 1 0 0 ) from c o l l e c t ( M a n a g e d S e r v i c e ( c a t e g o r y = = ” T r a n s i e n t S e r v i c e ” , group = = wfengine . group ) ) ; app : N a m e d S t r i n g ( name = = ”App” ) ; A r r a y L i s t ( s i z e < 2 0 ) from c o l l e c t ( M a n a g e d S e r v i c e ( c a t e g o r y = = ” T r a n s i e n t S e r v i c e ” , g r o u p = = w f e n g i n e . group , t y p e = = app . v a l u e , ( s t a t e = = ” I d l e S t a t e ” | | s t a t e = = ” B u s y S t a t e ” ) ) ) ; then s y s t e m . i n v o k e ( new C r e a t e S e r v i c e A c t i o n ( app . g e t V a l u e ( ) , w f e n g i n e . g e t G r o u p ( ) ) ) ; end

rule ” RemoveCrashedServices ” salience 0 when s e r v i c e : ManagedService ( s t a t e = = ” C r a s h e d S t a t e ” ) ; then retract ( service ); end

Appendix C:Rule for Hasthi Rule Sensitivity Test The Hasthi rule sensitivity test used the following rules. rule ” I n i t ” when then NamedList r e m o v e d S e r v i c e L i s t = new NamedList ( ” r e m o v e d S e r v i c e L i s t ” ) ; r e m o v e d S e r v i c e L i s t . add ( ”WRF2” ) ; i n s e r t ( removedServiceList ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ”WRF” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ”ADAS” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ”NAM” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ” WRFStatic ” ) ) ; i n s e r t ( new N a m e d S t r i n g ( ”App” , ” C l u s t e r ” ) ) ;

235

A. Appendix

System . o u t . p r i n t l n ( ” I n i t i a l i z e d ” ) ; end

rule ” UpdateHostServiceCounts ” salience 10 when a h o s t : Host ( ) ; s e r v i c e C o u n t : Number ( ) from a c c u m u l a t e ( M a n a g e d S e r v i c e ( h o s t = = a h o s t . name ) , sum ( 1 ) ) ; then ahost . setServiceCount ( serviceCount . intValue ( ) ) ; end

rule ” UnregisterOverloadedServices ” salience 10 when s e r v i c e : ManagedService ( s t a t e = = ” S a t u r a t e d S t a t e ” , c a t e g o r y = = ” T r a n s i e n t S e r v i c e ” , active == true ); r e g i s t r y : ManagedService ( type = = ” X r e g i s t r y ” , s e r v i c e . group = = group ) ; then service . setActive ( false ); update ( service ) ; s y s t e m . i n v o k e ( r e g i s t r y , ” r e m o v e S e r v i c e ” , ” u r n : r e m o v e S e r v i c e ” , ” s e r v i c e N a m e =”+ s e r v i c e . getName ( ) ) ; end

rule ” RegisterBackIdleServices ” salience 10 when s e r v i c e : ManagedService ( s t a t e = = ” I d l e S t a t e ” , a c t i v e = = f a l s e ) ; r e g i s t r y : ManagedService ( type = = ” X r e g i s t r y ” , s e r v i c e . group = = group ) ; then service . setActive ( true ); update ( service ) ; s y s t e m . i n v o k e ( r e g i s t r y , ” a d d S e r v i c e ” , ” u r n : a d d S e r v i c e ” , ” s e r v i c e N a m e =”+ s e r v i c e . getName ( ) ) ; end

rule ” CreateEnoughAppServices ” salience 10 when w f e n g i n e : M a n a g e d S e r v i c e ( c a t e g o r y = = ” S e r v i c e ” , t y p e = = ” WorkflowEngine ” ) ; A r r a y L i s t ( s i z e > 1 0 0 ) from c o l l e c t ( M a n a g e d S e r v i c e ( c a t e g o r y = = ” T r a n s i e n t S e r v i c e ” ,

236

A. Appendix

group = = wfengine . group ) ) ; app : N a m e d S t r i n g ( name = = ”App” ) ; A r r a y L i s t ( s i z e < 2 0 ) from c o l l e c t ( M a n a g e d S e r v i c e ( c a t e g o r y = = ” T r a n s i e n t S e r v i c e ” , g r o u p = = w f e n g i n e . group , t y p e = = app . v a l u e , ( s t a t e = = ” I d l e S t a t e ” | | s t a t e = = ” B u s y S t a t e ” ) ) ) ; then s y s t e m . i n v o k e ( new C r e a t e S e r v i c e A c t i o n ( app . g e t V a l u e ( ) , w f e n g i n e . g e t G r o u p ( ) ) ) ; end

rule ” RemoveCrashedServices ” salience 0 when s e r v i c e : ManagedService ( s t a t e = = ” C r a s h e d S t a t e ” ) ; then retract ( service ); end

rule ” RecoverFailedHost ” salience 10 when h o s t : Host ( s t a t e = = ” C r a s h e d S t a t e ” ) ; s e r v i c e : M a n a g e d S e r v i c e ( s t a t e = = ” C r a s h e d S t a t e ” , c a t e g o r y = = ” S e r v i c e ” , h o s t = = h o s t . name , type

n o t m a t c h e s ” . ∗MySQL. ∗ ” )

then f i n a l ActionCenter finalSystem = system ; f i n a l ManagedService f a i l e d S e r v i c e = s e r v i c e ; ActionCallback

c a l l b a c k = new A c t i o n C a l l b a c k ( ) {

p u b l i c v o i d a c t i o n S u c e s s f u l ( ManagementAction a c t i o n ) { MngActionUtils . s e t R e s o u r c e S t a t e ( a c t i o n . getActionContext ( ) , f a i l e d S e r v i c e , ” RepairedState ” ) ; } p u b l i c v o i d a c t i o n F a i l e d ( ManagementAction a c t i o n , T h r o w a b l e e ) { e . p r i n t S t a c k T r a c e ( ) ; } }; MngActionUtils . s e t R e s o u r c e S t a t e ( system . getActionContext ( ) , f a i l e d S e r v i c e , ” R e p a i r e d S t a t e ” ) ; s y s t e m . i n v o k e ( new C r e a t e S e r v i c e A c t i o n ( f a i l e d S e r v i c e ) , c a l l b a c k ) ; end

rule ” R e s t a r t F a i l e d S e r v i c e s ” salience 10 when s e r v i c e : ManagedService ( s t a t e = = ” C r a s h e d S t a t e ” ) ; h o s t : H o s t ( s t a t e ! = ” C r a s h e d S t a t e ” , s e r v i c e . h o s t = = name ) ;

A. Appendix

237

then f i n a l ManagedService f a i l e d S e r v i c e = s e r v i c e ; f i n a l ActionCenter finalSystem = system ; MngActionUtils . s e t R e s o u r c e S t a t e ( system . getActionContext ( ) , f a i l e d S e r v i c e , ” R e p a i r e d S t a t e ” ) ; s y s t e m . i n v o k e ( new R e s t a r t A c t i o n ( s e r v i c e ) , new A c t i o n C a l l b a c k ( ) { p u b l i c v o i d a c t i o n S u c e s s f u l ( ManagementAction a c t i o n ) { MngActionUtils . s e t R e s o u r c e S t a t e ( a c t i o n . getActionContext ( ) , f a i l e d S e r v i c e , ” RepairedState ” ) ; } p u b l i c v o i d a c t i o n F a i l e d ( ManagementAction a c t i o n , T h r o w a b l e e ) { e . p r i n t S t a c k T r a c e ( ) ; } }); end

r u l e ” LogSystemNonHealthyTime ” salience 8 when w f e n g i n e : M a n a g e d S e r v i c e ( t y p e = = ” WorkflowEngine ” , s t a t e ! = ” C r a s h e d S t a t e ” ) ; s l : A r r a y L i s t ( s i z e > 0 ) from c o l l e c t ( M a n a g e d S e r v i c e ( s t a t e = = ” C r a s h e d S t a t e ” , c a t e g o r y = = ” S e r v i c e ” , wfengine . group = = group ) ) ; then s y s t e m . p u t ( w f e n g i n e . getName ( ) + ” S y s t e m F a i l e d ” , new Long ( System . c u r r e n t T i m e M i l l i s ( ) ) ) ; end

rule ” ResurrectWorkflowsAfterRecovery ” salience 5 when w f e n g i n e : M a n a g e d S e r v i c e ( t y p e = = ” WorkflowEngine ” , s t a t e ! = ” C r a s h e d S t a t e ” ) ; s l : A r r a y L i s t ( s i z e > = 1 3 ) from c o l l e c t ( M a n a g e d S e r v i c e ( s t a t e == ” B u s y S t a t e ” | | s t a t e == ” I d l e S t a t e ” , c a t e g o r y = = ” S e r v i c e ” , w f e n g i n e . g r o u p = = g r o u p ) ) ; e v a l ( s y s t e m . g e t ( w f e n g i n e . getName ( ) + ” S y s t e m F a i l e d ” ) ! = n u l l ) ; then system . invoke ( wfengine , ” r e s u r r e c t W o r k f l o w ” , ” urn : r e s u r r e c t W o r k f l o w ” , ” ResurrectWorkflow ” ) ; s y s t e m . remove ( w f e n g i n e . getName ( ) + ” S y s t e m F a i l e d ” ) ; s y s t e m . remove ( w f e n g i n e . getName ( ) + ” M a i l S e n t ” ) ; end

rule ” NotifyDowntimes ” salience 9 when w f e n g i n e : M a n a g e d S e r v i c e ( t y p e = = ” WorkflowEngine ” , s t a t e ! = ” C r a s h e d S t a t e ” ) ; e v a l ( s y s t e m . g e t ( w f e n g i n e . getName ( ) + ” M a i l S e n t ” ) = = n u l l ) ; e v a l ( s y s t e m . g e t ( w f e n g i n e . getName ( ) + ” S y s t e m F a i l e d ” ) ! = n u l l

A. Appendix

238

& & ( ( ( Long ) s y s t e m . g e t ( w f e n g i n e . getName ( ) + ” S y s t e m F a i l e d ” ) ) . l o n g V a l u e ( ) < System . c u r r e n t T i m e M i l l i s ( ) − 1 0 ∗ 6 0 ∗ 1 0 0 0 ) ) ; then s y s t e m . i n v o k e ( w f e n g i n e , ” s e n d E m a i l ” , ” u r n : s e n d E m a i l ” , ” N o t i f y a Downtime ” ) ; s y s t e m . p u t ( w f e n g i n e . getName ( ) + ” M a i l S e n t ” , ” t r u e ” ) ; end

r u l e ” HandleUnknownErrors ” when s e r v i c e : ManagedService ( s t a t e = = ” F a u l t y S t a t e ” ) ; h o s t : H o s t ( s t a t e ! = ” C r a s h e d S t a t e ” , s e r v i c e . h o s t = = name ) ; then f i n a l ManagedService f a i l e d S e r v i c e = s e r v i c e ; f i n a l ActionCenter finalSystem = system ; s y s t e m . i n v o k e ( new ShutDownAction ( s e r v i c e ) , new A c t i o n C a l l b a c k ( ) { p u b l i c v o i d a c t i o n S u c e s s f u l ( ManagementAction a c t i o n ) { MngActionUtils . s e t R e s o u r c e S t a t e ( a c t i o n . getActionContext ( ) , f a i l e d S e r v i c e , ” RepairedState ” ) ; } p u b l i c v o i d a c t i o n F a i l e d ( ManagementAction a c t i o n , T h r o w a b l e e ) { e . p r i n t S t a c k T r a c e ( ) ; } }); end

Appendix D:Expected Time for n Continuous HEADs Consider a biased coin where the probability of a HEAD is p. Let En be the expected number of throws required to get consecutive n HEADs. To find En , we will define a recurrence relation for En and solve that to obtain an expression for En . Following proof extends the same result for an unbiased coin given in [8] to a biased coin. Lets first consider a single head. In this case, either a HEAD occurs in the first trial with a probability p, or it takes 1+E1 trails with a probability 1 − p. Then E1 = 1.p + (1 + E1 )(1 − p) = 1/p.

239

A. Appendix

Now let us consider En . For n HEADs to occur, first n − 1 HEADs should occur, and then it will either take one more throw with a probability p or (En−1 + 1 + En ) throws with a probability (1 − p). Note that in the latter case, we have to restart throwing and get another n consecutive heads (which takes En heads), and we have already done En−1 + 1 throws. Therefore, En = p(En−1 + 1) + (1 − p)(En−1 + 1 + En ) En = p1 (En−1 + 1) By expanding En using E1 = 1/p to solve the recurrence, we get the following.

1 1 1 1 1 1 1 1 1 + ( 2 + ) + ( 3 + 2 + ) + ··· + ( n + ··· + 2 + ) p p p p p p p p p n−1 1X 1 ,n ≥ 1 = p k=0 pk

En =

=

1 − pn pn (1 − p)

This completes the derivation .

Bibliography [1] Amazon downtime. online. 2008, http://www.techcrunch.com/2008/02/15/amazonweb-services-goes-down-takes-manystartup- sites-with-it/. [2] Amazon web service case studies. online. http://aws.amazon.com/solutions/casestudies/. [3] Animoto

-

scaling

through

viral

growth.

online.

http://aws.typepad.com/aws/2008/04/animoto—scali.html. [4] Apache axis2 project. online. http://ws.apache.org/axis2/. [5] Apache hadoop. online. hadoop.apache.org/core/. [6] Apache logging services project. online. http://logging.apache.org/log4j/. [7] Apache muse. online. http://ws.apache.org/muse/. [8] Consecutive heads. online. http://www.qbyte.org/puzzles/p082s.html. [9] Drools. online. http://labs.jboss.com/drools/. [10] Freepastry. online. http://freepastry.rice.edu/. [11] Graphical models. online. http://view.eecs.berkeley.edu/wiki/Graphical Models. 240

241

BIBLIOGRAPHY

[12] High scalability building bigger, faster, more reliable websites.

online.

http://highscalability.com/. [13] Hyperic hq. online. http://www.hyperic.com/. [14] Hyperic sigar api. online. http://www.hyperic.com/products/sigar.html. [15] Iso/osi management. online. http://www.iso.org. [16] Keynote

availability

and

cus for a new century.

maintainability online.

>>

performance:

2002,

New

fo-

USENIX FAST keynote,

http://www.usenix.org/events/fast02/tech.html. [17] Load average: Wikipedia entry. online. http://en.wikipedia.org/wiki/Load average. [18] Message passing interface. online. http://www.mpi-forum.org/. [19] Nyquist-shannon sampling theorem. online. http://en.wikipedia.org/wiki/Nyquist Shannon sampling theorem. [20] Open grid computing environment. online. http://www.collab-ogce.org/. [21] Prevention of online crashes is no easy fix. online. Los Angeles Times, 1999, http://articles.latimes.com/1999/dec/02/business/fi-39616. [22] Top twenty sites:

Most downtime.

online.

Michael Arrington, 2007,

http://www.techcrunch.com/2007/04/02/top-twenty-sites-most-downtime/. [23] Underneath the covers at google: Current systems and future directions. online. Jeff Dean , Google IO sessions,2008 http://sites.google.com/site/io. [24] Windows

management

http://www.hyperic.com/.

instrumentation,

msdn

library.

online.

242

BIBLIOGRAPHY

[25] Common management information protocol specification(cmip), 1991. ITU-T Recommendatio n X.711. Data Communication Networks - Open Systems Interconnection (OSI); Management. [26] Oasis web services distributed management. online, August 2006. www.oasisopen.org/committees/wsdm/. [27] Web

services

for

management.

online,

Aprail

2006.

http://www.dmtf.org/standards/wsman/. [28] E.N. Adams. Optimizing Preventive Service of Software Products. IBM Journal of Research and Development, 28(1):2–14, 1984. [29] M. Agarwal, V. Bhat, H. Liu, et al. Automate: Enabling autonomic applications on the grid. In AMS’03:International Workshop on Active Middleware Services, page 48. IEEE Computer Society, 2003. [30] Ehab Al-Shaer, Hussein Abdel-Wahab, and Kurt Maly. Hifi: A new monitoring architecture for distributed systems management. In ICDCS’99: IEEE International Conference on Distributed Computing Systems. IEEE Computer Society, 1999. [31] J. Albrecht, C. Tuttle, A.C. Snoeren, and A. Vahdat. PlanetLab application management using plush. ACM SIGOPS Operating Systems Review, 40(1):33–40, 2006. [32] C. Anderson. The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion, 2006. [33] K. Asanovic et al. The landscape of parallel computing research: A view from berkeley. Technical report, Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2006-183.

BIBLIOGRAPHY

243

[34] Gerd Aschemann, Svetlana Domnitcheva, Peer Hasselmeyer, Roger Kehr, and Andreas Zeidler. A framework for the integration of legacy devices into a jini management federation. In DSOM ’99: Proceedings of the 10th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, pages 257–268. Springer-Verlag, 1999. [35] Raphael M. Bahati, Michael A. Bauer, and Elvis M. Vieira. Mapping policies into autonomic management actions. In ICAS ’06: Proceedings of the International Conference on Autonomic and Autonomous Systems, page 38. IEEE Computer Society, 2006. [36] J B Baker, Darrell Reimer, Sam Spiro, and John Whitfield. Management of serviceoriented architecture ibm tivoli soa management suite, June 2005. [37] L.A. Barroso, J. Dean, and U. H¨olzle. Web Search for a Planet: The Google Cluster Architecture. IEEE MICRO, pages 22–28, 2003. [38] M. A. Bauer, P. J. Finnigan, J. W. Hong, J. A. Rolia, T. J. Teorey, and G. A. Winters. Reference architecture for distributed systems management. IBM System. Journal, 33(3):426–444, 1994. [39] J. Baumann. Mobile Agents: Control Algorithms, section Appendix B, Introduction to Fault Tolerance. Springer Verlag, 2000. [40] G. Blelloch and G. Narlikar. A practical comparison of N-body algorithms. Parallel Algorithms: Third DIMACS Implementation Challenge, October 17-19, 1994, page 81, 1997. [41] S. Bouchenak, N. De Palma, D. Hagimont, and C. Taton. Autonomic management of clustered applications. In Proceedings of the 2006 IEEE International Conference on Cluster Computing, pages 1–11. IEEE Computer Society, 2006.

BIBLIOGRAPHY

244

[42] Robert Paul Brettf, Subu Iyer, Dejan Milojicic, Sandro Rafaeli, and Vanish Talwar. Scalable management. In ICAC ’05: IEEE International Conference on Autonomic Computing, pages 159–170. IEEE Computer Society, 2005. [43] A. Brown and D.A. Patterson. To Err is Human. In Proc. 2001 Workshop on Evaluating and Architecting System dependabilitY. [44] A. Buchmann, C. Bornhovd, M. Cilia, L. Fiege, F. Gartner, and M. Meixner. Dream: Distributed Reliable Event-based Application Management. Web Dynamics, pages 319–352, 2003. [45] G. Candea, A.B. Brown, A. Fox, and D. Patterson. Recovery-Oriented Computing: Building Multitier Dependability. COMPUTER, pages 60–67, 2004. [46] J. D. Case, M. Fedor, M. L. Schoffstall, and J. Davin. Simple network management protocol (snmp), 1990. [47] Shang-Wen Cheng, An-Cheng Huang, David Garlan, Bradley Schmerl, and Peter Steenkiste. An architecture for coordinating multiple self-management systems. In WICSA ’04: Proceedings of the Fourth Working IEEE/IFIP Conference on Software Architecture (WICSA’04), page 243, Washington, DC, USA, 2004. IEEE Computer Society. [48] David M. Chess, Alla Segal, Ian Whalley, and Steve R. White. Unity: Experiences with a prototype autonomic computing system. In ICAC’04: IEEE International Conference on Autonomic Computing, pages 140–147. IEEE Computer Society, 2004. [49] Chun and David E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817–840, 2004.

BIBLIOGRAPHY

245

[50] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation-Volume 6 table of contents, pages 10–10. USENIX Association Berkeley, CA, USA, 2004. [51] D. Deugo, M. Weiss, and E. Kendall. Reusable Patterns for Agent Coordination. Coordination of Internet Agents, Springer, 2001. [52] S. Dolev. Self-stabilization. MIT press, 2000. [53] Kelvin Droegemeier et al. Linked environments for atmospheric discovery (lead): Architecture, technology road map and deployment strategy. In 21st International Conference on Interactive Information Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, 2005. [54] Abhishek Dubey, Steve Nordstrom, Turker Keskinpala, Sandeep Neema, Ted Bapty, and Gabor Karsai. Towards a model-based autonomic reliability framework for computing clusters. In EASE ’08: Proceedings of the Fifth IEEE Workshop on Engineering of Autonomic and Autonomous Systems (ease 2008), pages 75–85, Washington, DC, USA, 2008. IEEE Computer Society. [55] Mohamed El-Darieby and Diwakar Krishnamurthy. A scalable wide-area grid resource management framework. In ICNS’06:International conference on Networking and Services, page 76. IEEE Computer Society, 2006. [56] E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375–408, 2002. [57] Robert E. Filman and Diana D. Lee. Managing distributed systems with smart subscriptions. Technical report, 2000.

BIBLIOGRAPHY

246

[58] C.L. Forgy. Rete: a fast algorithm for the many pattern/many object pattern match problem. Ieee Computer Society Reprint Collection, pages 324–341, 1991. [59] D. Fraser, I. Foster, et al. Engaging with the LEAD science gateway project: lessons learned in successfully deploying complex system solutions on teragrid. In 3rd TeraGrid Conference,http://www.tacc.utexas.edu/tg08/, 2008. [60] Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, and Marlon Pierce. Scalable, fault-tolerant management of grid services. In In Proceedings of IEEE Cluster 2007. IEEE Computer Society, 2007. [61] A.G. Ganek and T.A. Corbi. The dawning of the autonomic computing era. IBM SYSTEMS JOURNAL, 42(1):5–18, 2003. [62] D. Gannon, B. Plale, and D.A. Reed. Service Architectures for e-Science Grid Gateways: Opportunities and Challenges. LECTURE NOTES IN COMPUTER SCIENCE, 4804:1179, 2007. [63] David Garlan, Shang-Wen Cheng, An-Cheng Huang, Bradley Schmerl, and Peter Steenkiste. Rainbow: Architecture-based self-adaptation with reusable infrastructure. Computer, 37(10):46–54, 2004. [64] I. Georgiadis, J. Magee, and J. Kramer. Self-organising software architectures for distributed systems. Proceedings of the first workshop on Self-healing systems, pages 33–38, 2002. [65] S. Ghemawat, H. Gobioff, and S.T. Leung. The Google file system. ACM SIGOPS Operating Systems Review, 37(5):29–43, 2003. [66] Debanjan Ghosh, Raj Sharman, H. Raghav Rao, and Shambhu Upadhyaya. Selfhealing systems - survey and synthesis. Decis. Support Syst., 42(4):2164–2185, 2007.

247

BIBLIOGRAPHY

[67] P. Goldsack, J. Guijarro, A. Lain, G. Mecheneau, P. Murray, and P. Toft. SmartFrog: Configuration and Automatic Ignition of Distributed Applications. HP OVUA, 2003. [68] J. Gray and DP Siewiorek.

High-availability computer systems.

Computer,

24(9):39–48, 1991. [69] James Hamilton. On designing and deploying internet-scale services. In LISA’07: Proceedings of the 21st conference on Large Installation System Administration Conference, pages 1–12, Berkeley, CA, USA, 2007. USENIX Association. [70] Thomas Heinis, Cesare Pautasso, and Gustavo Alonso. Design and evaluation of an autonomic workflow engine. In ICAC ’05: IEEE International Conference on Autonomic Computing, pages 27–38. IEEE Computer Society, 2005. [71] T. Hey and A. Trefethen. The Data Deluge: An e-Science Perspective. Grid Computing: Making the Global Infrastructure a Reality, pages 809–824, 2003. [72] Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham Boon, Thau Loo, Scott Shenker, and Ion Stoica. Querying the internet with pier. In Proceedings of 19th International Conference on Very Large Databases (VLDB), 2003. [73] Michael Jarrett and Rudolph Seviora. Constructing an autonomic computing infrastructure using cougaar. In EASE ’06: Proceedings of the Third IEEE International Workshop on Engineering of Autonomic & Autonomous Systems (EASE’06), pages 119–128. IEEE Computer Society, 2006. [74] Gail Kaiser, Janak Parekh, Philip Gross, and Giuseppe Valetto. Kinesthetics extreme: An external infrastructure for monitoring distributed legacy systems. In AMS’03:International Workshop on Active Middleware Services, page 22. IEEE Computer Society, 2003.

BIBLIOGRAPHY

248

[75] J.O. Kephart, H. Chan, R. Das, D.W. Levine, G. Tesauro, F. Rawson, and C. Lefurgy. Coordinating Multiple Autonomic Managers to Achieve Specified PowerPerformance Tradeoffs. Autonomic Computing, 2007. ICAC’07. Fourth International Conference on, pages 24–24, 2007. [76] O. Khalili, J. He, C. Olschanowsky, A. Snavely, and H. Casanova. Measuring the performance and reliability of production computational grids. In Intl. Conf. on Grid Computing (GRID), IEEE Computer Society, 2006. [77] J. Knight, D. Heimbigner, E.L. Wolf, A. Carzaniga, A. Carzaniga, J. Hill, J. Hill, P. Devanbu, P. Devanbu, M. Gertz, et al. The Willow architecture: comprehensive survivability for large-scale distributed applications. In Distributed Applications., Intrusion Tolerance Workshop, Dependable Systems and Networks (DSN 2002), Washington DC, 2002. [78] T. Koch, B. Kramer, and G. Rohde. On a rule based management architecture. In SDNE ’95: Proceedings of the 2nd International Workshop on Services in Distributed and Networked Environments, page 68. IEEE Computer Society, 1995. [79] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, C. Wells, et al. OceanStore: an architecture for globalscale persistent storage. ACM SIGARCH Computer Architecture News, 28(5):190– 201, 2000. [80] V. Kumar, BF Cooper, and K. Schwan. Distributed Stream Management using Utility-Driven Self-Adaptive Middleware. In Autonomic Computing, 2005. ICAC 2005. Proceedings. Second International Conference on, pages 3–14, 2005. [81] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc. Redwood City, CA, USA, 1994.

BIBLIOGRAPHY

249

[82] Benjamin C. Ling, Emre Kiciman, and Armando Fox. Session state: beyond soft state.

In NSDI’04: Proceedings of the 1st conference on Symposium on Net-

worked Systems Design and Implementation, pages 22–22, Berkeley, CA, USA, 2004. USENIX Association. [83] Hua Liu and Manish Parashar. Rule-based monitoring and steering of distributed scientific applications. International Journal of High Performance Computing and Networking (IJHPCN), 3(4):78–96, 2005. [84] RK Madduri, CS Hood, and WE Allcock. Reliable file transfer in grid environments. In Local Computer Networks, 2002. Proceedings. LCN 2002. 27th Annual IEEE Conference on, pages 737–738, 2002. [85] A. Maloney and A. Goscinski. A survey and review of the current state of rollbackrecovery for cluster systems. Concurrency and Computation: Practice and Experience, 2009. [86] Jean-Philippe Martin-Flatin, Simon Znaty, and Jean-Pierre Hubaux. A survey of distributed enterprise network andsystems management paradigms. J. Netw. Syst. Manage., 7(1):9–26, 1999. [87] Keith Marzullo and Mark D. Wood. Tools for constructing distributed reactive systems. Technical Report TR 91-1193, Ithaca, New York (USA), 1991. [88] Keith Marzullo and Mark D. Wood. Tools for constructing distributed reactive systems. Technical Report TR 91-1193, Ithaca, New York (USA), 1991. [89] Julie McCann and Markus Huebscher. A survey of Autonomic Computing - degrees, models and applications. December 2007. [90] Eamonn McManus et al. Java management extensions (jmx) specification. Technical report, 2006.

BIBLIOGRAPHY

250

[91] Sam Michiels, Nico Janssens, Wouter Joosen, and Pierre Verbaeten. Decentralized cooperative management: a bottom-up approach. In IADIS AC, pages 401–408, 2005. [92] Pierre Mouallem. Fault tolerance and reliability in scientific workflows. Master’s thesis, Master thesis, north Carolina State University. United States, 2005. NCSU, NDLTD Union Catalog [http://alcme.oclc.org/ndltd/servlet/OAIHandler]. [93] V. K. Naik, A. Mohindra, and D. F. Bantz. An architecture for the coordination of system management services. IBM Syst. J., 43(1):78–96, 2004. [94] Suman Nath, Haifeng Yu, Phillip B. Gibbons, and Srinivasan Seshan. Tolerating correlated failures in wide-area monitoring services. Technical report, Intel Corporation, 2004. IRP-TR-04-09. [95] B. Clifford Neuman. Scale in distributed systems. In Readings in Distributed Computing Systems, pages 463–489. IEEE Computer Society Press, 1994. [96] H. B. Newman, I. C. Legrand, P. Galvez, R. Voicu, and C. Cirstoiu. Monalisa : A distributed monitoring service architecture. In Conference for Computing in High Energy and Nuclear Physics, 2003. [97] David Oppenheimer, Vitaliy Vatkovskiy, Hakim, Weatherspoon, et al. Monitoring, analyzing, and controlling internet-scale systems with acme. Technical report, UC Berkelay, 2003. http://techreports.lib.berkeley.edu/accessPages/CSD-03-1276.html. [98] Andre Panisson, Diego Moreira da Rosa, Cristina Melchiors, Lisandro Zambenedetti Granville, Maria Janilce Bosquiroli Almeida, and Liane Margarida Rockenbach Tarouco. Designing the architecture of p2p-based network management systems. In ISCC ’06: Proceedings of the 11th IEEE Symposium on Computers and Communications, pages 69–75. IEEE Computer Society, 2006.

BIBLIOGRAPHY

251

[99] Michael P. Papazoglou and Willem-Jan van den Heuvel. Web services management: A survey. IEEE Internet Computing, 9(6):58–64, 2005. [100] M. Parashar and S. Hariri. Autonomic Grid Computing–Concepts, Requirements, Infrastructures, chapter Architecture Overview for Autonomic Computing. CRC Press, 2006. [101] D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, et al. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. 2002. [102] Christoph Reich, Matthias Banholzer, Rajkumar Buyya, and Kris Bubendorfer. Engineering an Autonomic Container for WSRF-based Web Services. In proceedings of the 15th International Conference on Advanced Computing and Communication (ADCOM), Bangalore, India, December 2007. [103] Robbert Van Renesse, Kenneth P. Birman, and Werner Vogels. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Trans. Comput. Systems., 21(2):164–206, 2003. [104] Timothy Roscoe, Richard Mortier, Paul Jardetzky, and Steven Hand. Infospect: using a logic language for system health monitoring in distributed systems. In EW10: Proceedings of the 10th workshop on ACM SIGOPS European workshop, pages 31– 37. ACM Press, 2002. [105] Jonathan Charles Rowanhill. Survivability Management Architecture for Very Large Distributed Systems. PhD thesis, University of Virginia, 2004. [106] S. M. Sadjadi and P. K. McKinley. A survey of adaptive middleware. Technical Report MSU-CSE-03-35, Computer Science and Engineering, Michigan State University, December 2003.

BIBLIOGRAPHY

252

[107] S. Sankaran, J.M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479, 2005.

[108] Bradley Schmerl and David Garlan. Exploiting architectural design knowledge to support self-repairing systems. In SEKE ’02: Proceedings of the 14th international conference on Software engineering and knowledge engineering, pages 241–248. ACM, 2002.

[109] J. Schoenwaelder. Using multicast snmp to coordinate distributed management agents. In 2nd IEEE International Workshop on Systems Management (SMW’96), page 136. IEEE Computer Society, 1996.

[110] Koon seng Lim, Constantin Adam, and Rolf Stadler. Decentralizing network management. Technical report, Royal Institute of Technology (KTH), 2005.

[111] C.C. Shen, C. Jaikaeo, C. Srisathapornphat, and Z. Huang. The Guerrilla management architecture for ad hoc networks. In MILCOM 2002. Proceedings, volume 1, 2002.

[112] A. Singla, U. Ramachandran, and J. Hodgins. Temporal notions of synchronization and consistency in Beehive. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, pages 211–220. ACM New York, NY, USA, 1997.

[113] N.K. Srinivasan. ReliabilityHow to quantify and improve? Resonance, 5(5):55–63, 2000.

BIBLIOGRAPHY

253

[114] Peter Steenkiste and An-Cheng Huang. Recipe based Service Configuration and adaptation, chapter 10. CRC Press, 2006. Autonomic Computing: Concepts, Infrastructure and Applications, M. Parashar and S. Hariri. [115] Michael Stonebraker. The Case for Shared Nothing. Database Engineering Bulletin, 9(1):4–9, 1986. [116] Rajagopal Subramaniyan, Pirabhu Raman, and Alan D. George. Gems: Gossipenabled monitoring service for scalable heterogeneous distributed systems. Cluster Computing, 9(1):101–120, 2006. [117] T. Sweeney. No Time for DOWNTIMEIT Managers Feel the Heat to Prevent Outages that Can Cost Millions of Dollars. InternetWeek, pages 104–105, 2000. [118] DB Terry, AJ Demers, K. Petersen, MJ Spreitzer, MM Theimer, and BB Welch. Session guarantees for weakly consistent replicated data. In Parallel and Distributed Information Systems, 1994., Proceedings of the Third International Conference on, pages 140–149, 1994. [119] M. Treaster. A survey of fault-tolerance and fault-recovery techniques in parallel systems. Arxiv preprint cs.DC/0501002, 2005. [120] William Vambenepe, Carol Thompson, Vanish Talwar, Sandro Rafaeli, Bryan Murray, Dejan Milojicic, Subu Iyer, Keith Farkas, and Martin Arlitt. Dealing with scale and adaptation of global web services management. In ICWS ’05: Proceedings of the IEEE International Conference on Web Services (ICWS’05), pages 339–346. IEEE Computer Society, 2005. [121] Jeffrey S. Vetter and Daniel A. Reed. Real-time performance monitoring, adaptive control, and interactive steering of computational grids. Int. J. High Perform. Comput. Appl., 14(4):357–366, 2000.

BIBLIOGRAPHY

254

[122] Steve Vinoski. Chain of responsibility. IEEE Internet Computing, 6(6):80–83, 2002. [123] Werner Vogels. Beyond server consolidation. Queue, 6(1):20–26, 2008. [124] Werner Vogels and Dan Dumitriu. An overview of the galaxy management framework for scalable enterprise cluster computing. Cluster, 00:109, 2000. [125] Abdul Waheed, Warren Smith, Jude George, and Jerry Yan. An infrastructure for monitoring and management in computational grids. International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 619– 628, 2000. [126] Mike Wawrzoniak, Larry Peterson, and Timothy Roscoe. Sophia: an information plane for networked systems. SIGCOMM Comput. Commun. Rev., 34(1):15–20, 2004. [127] Xing Wu, Jin Chen, Ruqiang Li, and Fucai LiDOI. Web-based remote monitoring and fault diagnosis system. The International Journal of Advanced Manufacturing Technology, 28(1):162–175, 2006. [128] Praveen Yalagandula and Mike Dahlin. A scalable distributed information management system. In SIGCOMM ’04: Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications, pages 379–390. ACM, 2004. [129] Serafeim Zanikolas and Rizos Sakellariou. A taxonomy of grid monitoring systems. Future Gener. Comput. Syst., 21(1):163–188, 2005.

Curriculum Vitae Srinath Perera was born in Sri Lanka in 1980. He received his Bachelor of Science in Computer Science and Engineering from University of Moratuwa, Sri Lanka in September 2004, and he received his Master of Science in Computing Sciences from Indiana University, Bloomington in May 2007. He has been involved with Apache Web Service project since 2002. He is an Apache Member and a co-founder of Apache Axis2.

enforcing user-defined management logic in large scale ... - CiteSeerX

enforcing user-defined management logic in large scale ... - CiteSeerX

Suggest Documents

Enabling Large-Scale Telemedical Disease Management ... - CiteSeerX

Configuration Management for Large-Scale Scientific ... - CiteSeerX

Large-Scale Conversation - CiteSeerX

Replicated File Management in Large-Scale

Logic Programming for Large Scale Applications in Law

Logic Programming for Large Scale Applications in Law

Multicast communication in large scale networks - CiteSeerX

Large scale mitochondrial sequencing in Mexican ... - CiteSeerX

Large-Scale Vulnerability Analysis - CiteSeerX

Large Scale Stochastic Simulations - CiteSeerX

Large-scale lightning parametrisations - CiteSeerX

Large-scale plant proteomics - CiteSeerX

A Survey of Large Scale Data Management Approaches ... - CiteSeerX

A Framework for the Management of Large-Scale ... - CiteSeerX

Cooperative, cross-boundary management facilitates large-scale ...

Large scale multimedia production management: from

Configuration Management for Large-Scale ... - Semantic Scholar

Management of Large-scale Multimedia ... - Semantic Scholar

Large-Scale Log Management Deployment - Webflow

Large-scale and very-large-scale motions in turbulent

Large-scale and very-large-scale motions in turbulent

An efficient logic emulation system - Very Large Scale Integration ...

Universal delay test sets for logic networks - Very Large Scale ...

Enabling Large-Scale Pervasive Logic Verification ... - IBM Research