OPTIMIZING TASK ASSIGNMENT FOR ...

OPTIMIZING TASK ASSIGNMENT FOR COLLABORATIVE COMPUTING OVER HETEROGENEOUS NETWORK DEVICES

by

Yi-Hsuan Kao

A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING)

May 2016

Copyright 2016

Yi-Hsuan Kao

Dedication

To my beloved family: Yao-Wu Kao, Yen-Ling Hung, Ru-Huei Shiao, Zi-Xin Kao and Mabli Kao

ii

Acknowledgments I would like to express my sincere gratitude to many significant individuals for your help and contribution through my PhD study. First, I would like to thank my advisor, Professor Bhaskar Krishnamachari. He gives me extensive freedom to explore every interesting things, including both academic research and life beyond school. He gives me the crayons to draw on a sheet of paper. But he never mind if I actually draw on the walls. I have gone through several different stages of my life these years. He gives me full scale of trust and understands my situations on these life events. He not only shows his profession as a teacher, as a researcher, but truly as an advisor. Besides my advisor, I would like to thank Dr. Fan Bai for bringing me industrial insights and giving me valuable advice on the research projects. I would also like to thank the rest of my thesis committee: Professor John Silvester, and Professor Leana Golubchik, for their insightful comments and contribution to my thesis. I would also like to thank Professor Alexandros G. Dimakis for his advice during my Master’s degree studies. He spent considerable amount of time with me on iii

research and courses. His patience and encouragement made me overcome the time when I was not comfortable speaking English. My sincere thanks also goes to other knowledgeable researchers: ANRG members, Professor Shaddin Dughmi, Professor Rajgopal Kannan. I thank my ANRG labmates for comprehensive discussions on research and life. We, as a self-motivated group, share memorable time when facing ups and downs, and keep the faith for the best. I want to thank Professor Shaddin Dughmi for being in my qualification committee and offering an enlightening course at spring 2014, which opened my eyes on approximation algorithms. I also want to thank Professor Rajgopal Kannan, Dr. Moo-Ryong Ra, and Mr. Kwame Wright for stimulating discussions and collaboration on my research. The work in this dissertation was supported in part by funding from NSF and GM Research. I would also like to express my appreciation to the Annenberg Fellowship Program, and my country, Taiwan, for supporting tuition and offering stipends during my PhD study. Moreover, I thank the Medical Program, and the WIC Program, for offering medical insurance and nutrition benefits to my dependents. Last, I want to dedicate this thesis to my family: my parents, Yao-Wu Kao and Yen-Ling Hung, my wife, Ru-Huei Shiao, my brother, Yi-Jiun Kao, and my two kids, Zi-Xin and Mabli. Life is unpredictable. Without your support, it wouldn’t have been possible for me to complete my PhD study. I love you.

iv

Table of Contents

Dedication

ii

Acknowledgments

iii

List Of Figures

viii

List Of Tables

x

Abstract

xi

Chapter 1: Introduction 1.1 General Settings . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Application Task Graph . . . . . . . . . . . . . . . . 1.1.2 Resource Network . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Deterministic Optimization with Single Constraint . 1.2.2 Deterministic Optimization with Multiple Constraints 1.2.3 Stochastic Optimization with Single Constraint . . . 1.2.4 Online Learning in Stationary Environments . . . . . 1.2.5 Online Learning in Non-stationary Environments . .

. . . . . . . . .

1 4 4 5 5 6 7 8 9 10

. . . . . . . . . . .

12 12 13 15 16 19 20 20 22 24 25 25

Chapter 2: Background 2.1 Approximation Algorithms . . . . . . . . . . . . . . 2.1.1 Two Examples: Set Cover and Vertex Cover 2.1.2 Hardness of a Problem . . . . . . . . . . . . 2.1.2.1 Complexity Analysis . . . . . . . . 2.1.2.2 Proof of NP-hardness . . . . . . . 2.1.3 LP Relaxation and Rounding Algorithms . . 2.1.3.1 Deterministic Rounding Algorithms 2.1.3.2 Randomized Rounding Algorithms 2.1.4 Other Approximation Algorithms . . . . . . 2.2 Dynamic Programming and FPTAS . . . . . . . . . 2.2.1 The Knapsack Problem . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

v

2.2.2 2.2.3 2.3

FPTAS: Fully Polynomial Time Approximation Scheme . . Guidelines to Derive FPTAS from DP . . . . . . . . . . . 2.2.3.1 When the optimum cannot be properly bounded Online Learning: The Multi-armed Bandit Problems . . . . . . . 2.3.1 Assumptions on a Bandit Process . . . . . . . . . . . . . . 2.3.2 Performance Measurement . . . . . . . . . . . . . . . . . . 2.3.3 An Example: The UCB1 Algorithm . . . . . . . . . . . . . 2.3.4 Adversarial Multi-armed Bandit Problems . . . . . . . . .

Chapter 3: Related Work 3.1 Task Assignment Formulations and Algorithms 3.2 Multi-armed Bandit Problems . . . . . . . . . . 3.3 Our Approach and Proposed Algorithms . . . . 3.4 System Prototypes . . . . . . . . . . . . . . . .

. . . .

Chapter 4: Deterministic Optimization with Single 4.1 Problem Formulation . . . . . . . . . . . . . . . . 4.1.1 Task Graph . . . . . . . . . . . . . . . . . 4.1.2 Cost and Latency . . . . . . . . . . . . . . 4.1.3 Optimization Problem . . . . . . . . . . . 4.2 Proof of NP-hardness of SCTA . . . . . . . . . . 4.3 Hermes: FPTAS Algorithms . . . . . . . . . . . . 4.3.1 Tree-structured Task Graph . . . . . . . . 4.3.2 Serial Trees . . . . . . . . . . . . . . . . . 4.3.3 Parallel Chains of Trees . . . . . . . . . . 4.3.4 More General Task Graph . . . . . . . . . 4.4 Numerical Evaluation . . . . . . . . . . . . . . . . 4.4.1 Algorithm Performance . . . . . . . . . . . 4.4.2 CPU Time Evaluation . . . . . . . . . . . 4.4.3 Benchmark Evaluation . . . . . . . . . . . 4.4.4 Discussion . . . . . . . . . . . . . . . . . .

. . . . . . . .

28 32 34 36 37 39 40 43

. . . .

. . . .

. . . .

46 46 50 53 55

Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

57 58 58 60 61 62 63 63 70 72 72 74 75 77 78 80

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Chapter 5: Deterministic Optimization with Multiple Constraints 83 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.1 Hardness of MCTA . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Sequential Randomized Rounding Algorithm . . . . . . . . . . . . . 90 5.2.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 A Bicriteria Approximation Algorithm for MCTA with Bounded Communication Costs1 . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 1

This section is a joint work with Dr. Rajgopal Kannan, University of Southern California.

vi

Chapter 6: Stochastic Optimization with Single Constraint 6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . 6.2 PTP: Probabilistic Delay Constrained Task Partitioning . 6.3 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

110 110 114 117 119

Chapter 7: Online Learning in Stationary 7.1 Why Online Learning? . . . . . . . . . 7.2 Models and Formulation . . . . . . . . 7.3 The Algorithm: Hermes with DSEE . . 7.4 Numerical Evaluation . . . . . . . . . . 7.5 Discussion . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

120 120 122 123 132 135

. . . . . . . . . . . . . .

136 136 138 141 141 146 148 150 152 154 154 156 156 157 161

Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

Chapter 8: Online learning in Non-stationary Environments 8.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 8.2 MABSTA Algorithm . . . . . . . . . . . . . . . . . . . . . . 8.3 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Proof of lemmas . . . . . . . . . . . . . . . . . . . . . 8.3.2 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . 8.4 Polynomial Time MABSTA . . . . . . . . . . . . . . . . . . 8.4.1 Tree-structure Task Graph . . . . . . . . . . . . . . . 8.4.2 More general task graphs . . . . . . . . . . . . . . . . 8.4.3 Marginal Probability . . . . . . . . . . . . . . . . . . 8.4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . 8.5.1 MABSTA’s Adaptivity . . . . . . . . . . . . . . . . . 8.5.2 Trace-data Emulation2 . . . . . . . . . . . . . . . . . 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

Chapter 9: Conclusion

163

Reference List

167

2

This section is a joint work with Mr. Kwame Wright, University of Southern California.

vii

List Of Figures

2.1

A vertex cover problem . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2

Relationship between quantized domain and original domain . . . .

32

2.3

An example of adversarial MAB . . . . . . . . . . . . . . . . . . . .

42

4.1

An example of application task graph . . . . . . . . . . . . . . . . .

59

4.2

A tree-structured task graph with independent child sub-problems .

64

4.3

Hermes’ methodology . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.4

A task graph of serial trees . . . . . . . . . . . . . . . . . . . . . . .

70

4.5

Hermes converges to the optimum as decreases . . . . . . . . . . .

74

4.6

Hermes over 200 different application profiles . . . . . . . . . . . . .

74

4.7

Hermes in dynamic environments . . . . . . . . . . . . . . . . . . .

75

4.8

Hermes: CPU time measurement . . . . . . . . . . . . . . . . . . .

76

4.9

Hermes: 36% improvement for the example task graph . . . . . . .

82

4.10 Hermes: 10% improvement for the face recognition application . . .

82

4.11 Hermes: 16% improvement for the pose recognition application . . .

82

5.1

Collaborative computing on face recognition application . . . . . . .

84

5.2

Graphical illustration of MCTA . . . . . . . . . . . . . . . . . . .

84 viii

5.3

SARA: optimal performance in expectation . . . . . . . . . . . . . . 105

5.4

BiAPP: (θ + 1, 2β + 2)-approximation (θ = 3, β = 2) . . . . . . . . 106

6.1

PTP: error probability . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.1

The task graph with 3 edges as maximum matching . . . . . . . . . 125

7.2

The performance of Hermes using DSEE in dynamic environment . 133

7.3

Comparison of Hermes using DSEE with other algorithms . . . . . 133

8.1

The dependency of weights on a tree-structure task graph . . . . . . 150

8.2

The dependency of weights on a serial-tree task graph . . . . . . . . 152

8.3

MABSTA has better adaptivity to the changes than a myopic algorithm157

8.4

Snapshots of measurement result . . . . . . . . . . . . . . . . . . . 158

8.5

MABSTA’s performance with upper bounds provided by Corollary 1 158

8.6

MABSTA compared with other algorithms for 5-device network . . 159

8.7

MABSTA compared with other algorithms for 10-device network . . 159

ix

List Of Tables

2.1

Common Functions in Complexity Analysis . . . . . . . . . . . . . .

18

3.1

Task Assignment Formulations and Algorithms

. . . . . . . . . . .

48

4.1

Notations of SCTA . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

5.1

SARA/BiApp: Simulation Profile . . . . . . . . . . . . . . . . . . . 105

6.1

PTP: Simulation Profile . . . . . . . . . . . . . . . . . . . . . . . . 117

8.1

Parameters Used in Trace-data measurement . . . . . . . . . . . . . 157

x

Abstract The Internet of Things promises to enable a wide range of new applications involving sensors, embedded devices and mobile devices. Different from traditional cloud computing, where the centralized and powerful servers offer high quality computing service, in the era of the Internet of Things, there are abundant computational resources distributed over the network. These devices are not as powerful as servers, but are easier to access with faster setup and short-range communication. However, because of energy, computation, and bandwidth constraints on smart things and other edge devices, it will be imperative to collaboratively run a computationalintensive application that a single device cannot support individually. As many IoT applications, like data processing, can be divided into multiple tasks, we study the problem of assigning such tasks to multiple devices taking into account their abilities and the costs, and latencies associated with both task computation and data communication over the network. A system that leverages collaborative computing over the network faces highly variant run-time environment. For example, the resource released by a device may suddenly decrease due to the change of states on local processes, or the channel xi

quality may degrade due to mobility. Hence, such a system has to learn the available resources, be aware of changes and flexibly adapt task assignment strategy that efficiently makes use of these resources.

We take a step by step approach to achieve these goals. First, we assume that the amount of resources are deterministic and known. We formulate a task assignment problem that aims to minimize the application latency (system response time) subject to a single cost constraint so that we will not overuse the available resource. Second, we consider that each device has its own cost budget and our new multi-constrained formulation clearly attributes the cost to each device separately. Moving a step further, we assume that the amount of resources are stochastic processes with known distributions, and solve a stochastic optimization with a strong QoS constraint. That is, instead of providing a guarantee on the average latency, our task assignment strategy gives a guarantee that p% of time the latency is less than t, where p and t are arbitrary numbers. Finally, we assume that the amount of run-time resources are unknown and stochastic, and design online algorithms that learn the unknown information within limited amount of time and make competitive task assignment.

We aim to develop algorithms that efficiently make decisions at run-time. That is, the computational complexity should be as light as possible so that running the algorithm does not incur considerable overhead. For optimizations based on known xii

resource profile, we show these problems are NP-hard and propose polynomialtime approximation algorithms with performance guarantee, where the performance loss caused by sub-optimal strategy is bounded. For online learning formulations, we propose light algorithms for both stationary environment and non-stationary environment and show their competitiveness by comparing the performance with the optimal offline policy (solved by assuming the resource profile is known). We perform comprehensive numerical evaluations, including simulations based on trace data measured at application run-time, and validate our analysis on algorithm’s complexity and performance based on the numerical results. Especially, we compare our algorithms with the existing heuristics and show that in some cases the performance loss given by the heuristic is considerable due to the sub-optimal strategy. Hence, we conclude that to efficiently leverage the distributed computational resource over the network, it is essential to formulate a sophisticated optimization problem that well captures the practical scenarios, and provide an algorithm that is light in complexity and suggests a good assignment strategy with performance guarantee.

xiii

Chapter 1

Introduction

We are entering an era of the Internet of Things (IoT) that connects billions of devices in the world [1]. From ubiquitous sensors, embedded devices to personal devices and cloud servers, computational resources are distributed over the network. As more and more promising applications leverage collaborative computation and data transmission on multiple devices [2], we are more aware of the requested content but less aware of how and where data communication and computation happen. This is a paradigm of pervasive computing, where computation is made to appear anytime and anywhere and the boundaries between devices begin to blur [3, 4]. Suppose you wear a pair of intelligent glasses that capture several images of the world. With limited processor, memory and storage located on the glasses, applications like face recognition, which requires feature extraction techniques and matching with a large data set, are hardly supported by such a wearable device 1

alone. But if you have a mobile phone, face recognition is possible through collaborative computation. The glasses first do some simple processing and send the images to the mobile phone, which can filter out possible faces and match them with the profiles stored on it, or can access the cloud server for matching over a larger database through the cellular network. With devices that are capable of running light computation spread everywhere, we are thinking from a system perspective: how can we leverage these devices to jointly support a complicated application that a single device fails to run individually? Ideally, data communication and computation happen unobtrusively, so the system must bring good user experience, considering the resource that are available across the network. The system response time (latency) could be a good metric of user experience. Processor cycles, storage and battery are the computation resources on a device. The channel bandwidth is the resource given by the network. We identify several key considerations on how a system can make good use of these resources:

• Awareness of data shipment: Data shipment uses network resources as well as device battery and incurs transmission delay. A system should be aware of these costs and avoid unnecessary data transmission.

• Device-dependent task assignment: Some sensors and embedded devices are equipped with microcontrollers that are designed to perform specific tasks 2

[5]. A system should make proper task assignment based on each device’s ability. • Flexible computing in dynamic environments: The resources on the device (ex. CPU load) and the channel condition vary with time. A system should adapt task assignment to the changes of the environment. For example, a sensor can transmit the raw data when the channel is good but consume local processor resource to process the raw data to reduce its size when the channel is bad. We take a step by step approach to achieve these goals. We partition an application into executions of multiple tasks and aim to find a good task assignment to available devices. First, given an application profile with the cost metrics and potential data shipment requirements between tasks, we solve an optimal task assignment that minimizes application latency subject to a single cost constraint. Then, considering that each device may have its own cost budget, we formulate an optimization problem subject to individual constraints on each device. Taking a step further, in the scenario where the cost metrics and channel qualities vary with time, we model them as stochastic processes with known statistics and solve a stochastic optimization problem. Finally, we propose online learning algorithms to learn the unknown statistics and make competitive task assignment. We envision that a future cyber foraging system [6] is able to take the requests, explore the environment and assign tasks on heterogeneous computing devices, 3

to satisfy the specified QoS requirements. Furthermore, the concept of macroprogramming enables application developers to code by describing the high-level and abstracted functionality that is platform independent [7]. The existence of an interpreter plays an important role, which translates the high-level functional description to machine codes for different devices. Especially, one crucial component that is closely related to system performance is how to partition an application into tasks and assign them to suitable devices with the awareness of resource availability at run time. Hence, we are motivated to develop these pioneering algorithms that will optimize the performance of collaborative computing. In the following, we introduce the general settings of application model and environment in our formulations, and then illustrate our major contributions to these problems.

1.1

1.1.1

General Settings

Application Task Graph

We partition an application into N task executions, which can be described by a directed acyclic graph (DAG). Each node specifies a computing task and each directed edge specifies the data dependency between two tasks. For example, the existence of a directed edge (m, n) implies that task n relies on the execution result 4

of task m. That is, task n cannot start until task m finishes. Hence, the directed edges in the graph also imply the precedence constraints of these task executions.

1.1.2

Resource Network

We use the term “resource network” to describe the environment, including the devices’ ability and channels’ qualities. We are concerned with two metrics, latency and cost. We assume that all devices are connected, that is, the resource network is a mesh network of M connected devices. For each task, a device’s ability is described by the task execution latency, subject to a execution cost of performing the task on the device. Two devices describe a channel, with data transmission latency and cost that specify its quality. We assume a mesh network without loss of generality in the way that if two devices are not connected, the corresponding latency and cost are infinite.

1.2

Contributions

We propose efficient and competitive algorithms for the following task assignment problems. 1. Deterministic Optimization with Single Constraint 2. Deterministic Optimization with Multiple Constraints 3. Stochastic Optimization with Single Constraint 5

4. Online Learning in Stationary Environments 5. Online Learning in Non-stationary Environments An efficient algorithm has reasonable complexity that scales well with the problem size. On the other hand, a competitive algorithm provides a performance guarantee in that if it cannot solve the problem exactly, then it specifies how close it can approximate the optimum. For the optimization formulations (problem 1,2 and 3), there exist algorithms that can solve them exactly but the complexity grows exponentially with problem size. For learning in unknown dynamic environments (problem 4 and 5), the problems cannot be solved exactly without the full knowledge of the environment. Hence, we aim to develop efficient (polynomial time) algorithms and show their competitiveness by provable performance guarantee.

1.2.1

Deterministic Optimization with Single Constraint

Given an application task graph described by a directed acyclic graph, we want to find a good task assignment strategy that assigns each task to devices. We measure the overall application latency, i.e. the time the last task finishes. The performance of a task assignment strategy depends on devices’ performance as well as channel qualities given by the resource network profile. Hence, we formulate an optimization problem that aims to minimize the overall latency subject to a single cost constraint. We show that the problem is NP-hard, that is, there is no polynomial-time algorithm to solve all instances of the problem unless P = NP 6

[8]. Hence, we propose Hermes1 , which is a fully polynomial time approximation scheme (FPTAS) [9]. For all instances, Hermes always guarantees a solution that gives no more than (1 + ) times of the minimum objective, where is a positive number, and the complexity is bounded by a polynomial in

1

and the problem size.

Our analysis result clearly illustrates the trade-off between Hermes’ accuracy and its complexity. The closer the objective value is to the optimum (smaller ), the more running time Hermes takes. Our formulation generalizes the previous formulation in [10], where the application is modeled as a chain-structured task graph. Furthermore, instead of relying on a standard integer programming solver, we propose Hermes as the polynomial time algorithm. We verify our result by comprehensive simulations based on trace data of several benchmarks, and compare Hermes with a heuristic proposed in [11] for dynamic environments. To the best of our knowledge, for this class of task assignment problems, Hermes applies to more sophisticated formulations than prior work and runs in polynomial time with problem size yet still provides near-optimal solutions with a performance guarantee.

1.2.2

Deterministic Optimization with Multiple Constraints

In the scenario where the resource network consists of multiple battery-operated devices, like sensors and mobile phones, the formulation with a single constraint 1

Because of its focus on minimizing latency, Hermes is named for the Greek messenger of the gods with winged sandals, known for his speed.

7

has deficiencies. The assignment strategy given by Hermes may rely mostly on a single device and drain its battery. Hence, we formulate another problem to minimize the overall application latency subject to individual constraints on each device. Our multi-constraint task assignment formulation clearly attributes the cost to each device separately and aims to find out a task assignment strategy that considers both system performance and cost-balancing. We show that our formulation is NP-hard, but propose two new polynomial-time algorithms that provide provable guarantees: SARA and BiApp. SARA rounds the fractional solution of an LP-relaxation to achieve asymptotically optimal performance from a time average perspective. BiApp is a bicriteria (θ + 1, 2β + 2) approximation with respect to the latency objective and cost constraints for each data frame processed by the application (θ, β are parameters quantifying bounds on the communication latency and the communication costs respectively). Our simulation results show that these algorithms perform extremely well on a wide range of problem instances.

1.2.3

Stochastic Optimization with Single Constraint

We next go a step further to develop an algorithm that makes good task assignment under stochastic environments with known statistics. In real environment, the CPU load on a device could increase dramatically due to other co-existing processes and the wireless channel quality also varies with time. Hence, we solve a stochastic optimization problem subject to a probabilistic QoS constraint. That is, 8

our algorithm provides a task assignment strategy that guarantees the confidence on the latency falling within a given value. For example, the QoS guarantees that more than 90% of time the latency is less than 200 ms. Our algorithm runs in polynomial time and its near-optimal performance is shown by simulation results.

1.2.4

Online Learning in Stationary Environments

At application run time, the on-device resource and channel qualities may be unknown and time-varying. Hence, we design an algorithm that efficiently probes the environment and makes task assignment based on the probing result. However, the probing itself incurs overhead and undertakes the risk of sampling on some devices and channels with poor performance. Hence, we have to take these losses into account when analyzing the performance. We formulate this problem of learning in unknown dynamic environments as a multi-armed bandit problem (MAB) [12]. In a typical MAB formulation, each arm represents an unknown stochastic process. At each trial an agent chooses over a set of arms, gets the payoff from the selected arms and tries to learn the statistical information from sensing them. The historical information helps for decisions in the future trials. Given an online learning algorithm, its performance is judged by the regret [13]. That is, the amount of payoff the algorithm loses due to sub-optimal strategy compared to a genie who knows the statistics and can use the optimal strategy. The regret is expected to grow with the number of trials but should slow 9

down (grow sub-linearly) as the algorithm learns more information and takes the strategy that is converging to the optimal one. In our setting of dynamic resource network, we model each unknown device and unknown channel as “arms”. Our algorithm consists of two interchanging phases, an exploration phase and an exploitation phase. In the exploration phase, we design a deterministic sequence to sample all devices and channels thoroughly. In the exploitation phase, we call Hermes to solve for the optimal assignment based on the sample means that have been collected so far. We show that the performance loss (regret) can be bounded by O(ln T ), which is a sub-linear function of the number of trials, and justify our analysis in the simulation.

1.2.5

Online Learning in Non-stationary Environments

We move one more step closer to the reality, where the resource network cannot be captured by stationary processes, due to the unpredictable resources released by a device at run time and intermittent connections in a mobile network. Instead of assuming the payoff given by each arm as a stationary stochastic process, we model the payoff as a bounded sequence. At each trial, the number of the chosen sequence is exposed and others remain unknown. Since the sequence can change arbitrarily over time without being limited to evolve as a stationary process, our result can be applied to any dynamic environment. 10

We propose MABSTA (Multi-Armed Bandit based Systematic Task Assignment) as an online learning algorithm for the non-stationary environment and show √ that the performance loss is bounded by O( T ). Furthermore, we compare MABSTA with other algorithms via simulations based on the trace data measured on an IoT testbed [14], and justify MABSTA’s competitiveness and adaptivity to dynamic environments.

11

Chapter 2

Background

Our main focus is to design algorithms to solve the proposed optimization formulations. The theoretical background includes approximation algorithms and online learning algorithms for multi-armed bandit problems.

2.1

Approximation Algorithms

When we solve an optimization problem, we first want to know how hard it is. There have been several classical problems whose hardness have been justified [15]. A high level approach would be to map the optimization problem to these “benchmarks” so that we can get an idea of its hardness. Some problems are too hard that it is impossible to be solved, or it is too complicated to be solved in a reasonable time (i,e, a computer may take years to solve it). If we are unable to solve the hard problem exactly, we rely on solving it approximately, in which we propose an approximation algorithm [9], and attempt to give a performance guarantee on how 12

close it approximates the optimum. The following steps summarize how we study an optimization problem and propose the corresponding approximation algorithm.

1. Determine the hardness of the problem

2. Propose an approximation algorithm (assuming the problem is not trivial)

3. Analyze the algorithm’s complexity

4. Prove the algorithm’s performance guarantee

In the following, we present related examples and some techniques that have been used in our work.

2.1.1

Two Examples: Set Cover and Vertex Cover

We use two well known problems as examples. Given M subsets S1 , · · · , SM , where Si ⊆ {1, · · · , N } for all i, the set cover problem (SC) [16] is to find a minimum number of subsets to cover the universe [N ]. That is,

SC : min

X

xi

i∈[M ]

s.t.

[

Si = [N ],

i:xi =1

xi ∈ {0, 1}, ∀i ∈ [M ]. 13

Figure 2.1: A vertex cover problem

This is an integer programming formulation (IP). The binary variable xi denotes whether the ith subset is selected or not. An input instance of SC can be described by the tuple, {N, {S1 , S2 , · · · , SM }}, where N is the size of the universe, and {S1 , S2 , · · · , SM } are the subsets of the universe. For example, if N = 4 and we have subsets {1, 4}, {2}, {3} and {2, 3}, then we choose {1, 4} and {2, 3} as the minimum set cover.

Another well known problem is the vertex cover problem (VC) [17]. We use G(V, E) to denote an undirected graph, where V = {1, · · · , N } is the set of vertices enumerated from 1 to N , and E is the set of edges. For example, if there exists an edge that joints vertex m and n, we use (m, n) to denote this edge. Given an undirected graph G(V, E), the vertex cover problem is to find the minimum set of 14

vertices such that for each edge, at least one of the two vertices it connects is in the set. That is,

VC : min

X

xi

i∈[N ]

s.t. xm + xn ≥ 1, ∀(m, n) ∈ E, xi ∈ {0, 1}, ∀i ∈ [N ].

Fig. 2.1 shows an example of finding the minimum vertex cover, where the filled circles are selected vertices. We can see that both of them cover all the edges, with the solution on the right being the minimum vertex cover. These two problems are “hard” problems in computational theory. Hence, several approximation algorithms have been proposed. We will use these two problems as examples to introduce the approximation algorithm paradigm in this dissertation.

2.1.2

Hardness of a Problem

The hardness of a problem is measured by the computing resource cost when solving it. If, for a problem, the best algorithm is still very expensive, then we know the problem is hard. Theoretical research generally focuses on two major costs of an algorithm, computational cost and storage cost. Computational cost refers to how many CPU 15

cycles an algorithm consumes, and storage cost refers to how much memory an algorithm uses. There could be a trade-off between these two costs. For example, algorithm A may use more CPU cycles than algorithm B, but use less memory. However, most of the time, storage cost is positively related to computational cost in theoretical analysis. Hence, we focus our discussion on computational hardness and refer the readers to more literature on space hardness in [18, 19].

2.1.2.1

Complexity Analysis

When analyzing the computational cost (complexity) of an algorithm, we calculate how many basic tasks are executed in the whole running process. Although the basic tasks may be different for different algorithms, they are usually cheap operations that take only few CPU cycles. Hence, it is more convenient to count these basic tasks in complexity analysis rather than actual CPU cycles. A problem may have different input instances. An input instance is not only characterized by the input parameter values but also its size. For example, an instance of SC consists of an universe with size N and M subsets S1 , · · · , SM . The numbers N and M specify the problem size, and the content of S1 , · · · , SM are input parameter values. Given two instances with the same size but different input values, an algorithm may execute a different number of basic tasks. Hence, we usually present the worst case analysis. That is, we present an upper bound on the number of basic tasks necessary for all possible instances with the same size. 16

On the other hand, more importantly, we would like to know how an algorithm’s complexity scales with problem size so that we can know it the algorithm is really expensive when we solve a big problem. In sum, an algorithm’s complexity is an upper bound on the number of basic tasks and is a function of problem size. We use big O notation to describe how an algorithm’s complexity scales with the problem size. A function f (N ) is in O(g(N )) if there exists c > 0 and N0 > 0 such that f (N ) ≤ cg(N ) whenever N >= N0 . That is, f (N ) grows slower then g(N ) and is surpassed by g(N ) whenever N is large enough. Hence, O(g(N )) is a set of functions that grow slower than g(N ). If we say an algorithm’s complexity belongs to O(g(N )), then the number of necessary basic tasks grows slower than g(N ). We use the binary search algorithm as an example. Given a sorted array, where each element is a positive integer between 1 and N , and an integer n, 1 ≤ n ≤ N , we can perform binary search to check if n is an element of the array as follows. First, look at the medium in the sorted list. If n is smaller than the medium, check the medium of the first half of the list. Otherwise, check the medium of the last half. Repeat until n is found, or an empty array is reached (in case n is not an element of the original array). If we define the basic task as “comparing two numbers”, which can be done in one CPU cycle, then the complexity of binary search belongs to O(log2 N ). Clearly, this algorithm will stop after performing at most log2 N comparisons. Since the complexity measurement is in the unit of a basic task, 17

Table 2.1: Common Functions in Complexity Analysis functions

log2 N

N (linear)

N 2 (polynomial)

N 3 (polynomial)

2N (exponential)

N = 10

3.32

10

100

1000

1024

N = 100

6.64

100

10000

1000000

N = 1000

9.96

1000

106

109

1.27 × 1030 10301

which is proportional to running time, we can also say the binary search algorithm runs in O(log2 N ) time, or the binary search is a logarithmic time algorithm. Table 2.1 lists the common functions that describe an algorithm’s complexity. We often use logarithmic algorithm, polynomial algorithm, etc, to describe the scalability. We can see the significant difference when N is large. Assume that a modern processor can complete 106 basic tasks per second. Then for small problem size (N = 10), an exponential algorithm can finish in the blink of an eye just as a logarithmic algorithm does. However, for big problems (N = 1000), an exponential algorithm would run forever (10290 years) while the other algorithms still finish in seconds. Hence, we definitely want to avoid running an exponential algorithm to solve a big problem. A polynomial algorithm is much better than an exponential one. If possible, a low-degree polynomial algorithm is even better. In computational theory research, we classify the problems by their hardness. There is a class called NP-hard problems, where researchers believe there is no polynomial time algorithm to solve them1 . Since the running time of an exponential algorithm is not tolerable for big problem size, we study polynomial time 1

Formally, there is no polynomial time algorithm to solve an NP-hard problem unless P = NP [17].

18

approximation algorithms for the class of NP-hard problems. For other classes of problems, we refer readers to more literature in [19].

2.1.2.2

Proof of NP-hardness

The first step to study approximation algorithms of a problem is to know if the problem is NP-hard. We prove NP-hardness of a problem A by reducing another NP-hard problem B to it. In general, we say problem B is reducible to problem A if an algorithm for solving A can be “efficiently” used to solve B. For example, if we have a black box for solving A, then by using this black box a polynomial number of times, and other extra steps with polynomial complexity, we can solve B. This reduction implies A is at least as hard as B. Hence, if B is NP-hard, then A is at least NP-hard. On the other hand, this reduction also implies that if the black box is a polynomial time algorithm, then there exists a polynomial time algorithm that solves B. However, no polynomial time algorithm is known for an NP-hard problem B. Hence, there is no polynomial time algorithm for A. We reduce the vertex cover problem (VC) to the set cover problem (SC), as an example to show that SC is NP-hard. In VC, the universe to be covered is the set of edges, E, and each subset is represented by a vertex, whose elements are the edges that touch it. Hence, we transform a VC instance to a SC instance. The solution of this SC problem is exactly the minimum number of vertices that covers all edges. Clearly, the transformation takes linear time (O(E)) and we can 19

use exactly the same algorithm of SC only once to solve the transformed instance. That is, VC is reducible to SC. Since we know VC is NP-hard, SC is at least NP-hard.

2.1.3

LP Relaxation and Rounding Algorithms

In this section, we introduce one approach of approximating a problem, called linear programming (LP) relaxation. For each combinatorial optimization problem, there exists a corresponding integer programming (IP) formulation [9]. An IP instance is NP-hard [20], however, we know that if we relax the integer constraints in the formulation, that is, we allow fractional solutions, then it becomes an LP formulation, which we know there exist polynomial time algorithms to solve [21]. Hence, whenever we formulate an integer programming, we solve the corresponding LP formulation and design a rounding algorithm to round a fractional solution to an integer one if necessary. In the following, we introduce two rounding algorithms using VC and SC as examples.

2.1.3.1

Deterministic Rounding Algorithms

A deterministic rounding algorithm is a fixed mapping from the fractional numbers to integers. We use VC as an example. In a VC problem, we are choosing the minimum number of vertices to cover all of edges. The binary variable xi corresponds to selecting vertex i or not. If we allow the binary variable to be any fractional 20

variable between 0 and 1, then we will get an optimal solution of the LP, denoted by x?i , i = 1, · · · , N . Let xî , i = 1, · · · , N , denote the rounded solution. We define the rounding algorithm as xî =

    1 if x?i ≥ 12 ,

(2.1)

   0 otherwise.

The LP relaxation combined with the rounding algorithm outputs a vertex cover where the number of required vertices is no more than 2 times the minimum [17]. First we verify {ˆ xi }N i=1 is indeed a vertex cover. In the LP formulation, we have x?n + x?m ≥ 1 for each edge (m, n) ∈ E. That is, we have either xm ≥

1 2

or xn ≥ 12 ,

which implies either vertex m or vertex n will be selected to cover edge (m, n) as required. Now we look at the performance of {ˆ xi }. Let OP T denote the minimum objective of the IP. We know that the minimum of LP relaxation is always smaller than OP T since there are fewer constraints. That is,

P

i∈[N ]

x?i ≤ OP T . From the

rounding algorithm, we have xî ≤ 2x?i for all i. Hence, X

i∈[N ]

xî ≤ 2

X

i∈[N ]

x?i ≤ 2OP T.

(2.2)

Solving an LP takes polynomial time, and this rounding algorithm runs in O(N ) time. Hence, the above algorithm is a 2-approximation.

21

2.1.3.2

Randomized Rounding Algorithms

The LP solution implies how the variables should be set to achieve the optimal objective. For example, if a fractional solution is very close to 1, we know the corresponding binary variable is likely to be 1. Instead of using a fixed mapping between fractional numbers and integers, a randomized rounding algorithm rounds the fractional number by following a probability distribution, which usually depends on the optimal LP solution. Since the rounding process is not fixed, the performance may differ from time to time. For some trials the rounding result may not be feasible. Hence, we analyze the expected performance of the randomized rounding algorithm, and provide a confidence bound on feasibility.

We use the set cover problem SC as an example. In an SC problem, we are choosing the minimum number of subsets whose union covers the universe [N ]. The binary variable xi corresponds to selecting the ith subset or not. First we solve the LP relaxation and get the optimal LP solution {x?i }M i=1 . Then, we design the randomized rounding algorithm as

xî =

    1 with probability x?i ,

   0 with probability 1 − x?i .

(2.3)

22

That is, we include the ith subset with probability x?i . This algorithm gives the following expected performance,

E{

X

i∈[M ]

xî } =

X

i∈[M ]

E{ˆ xi } =

X

i∈[M ]

x?i ≤ OP T.

(2.4)

Now, using this algorithm as a basic block, we repeat it for 2 log N times. Let Cr denote the output (the indices of selected subsets) in the rth round. The final output, C =

S2 log N r=1

Cr , includes all the subsets that have been selected from

first round to final round. In [16], this algorithm is proposed and is verified as a 2 log N -approximation. From (2.4), |Cr | ≤ OP T for all r and hence we have |C| ≤ 2 log N OP T as required. Now we analyze the confidence on C being a set cover. For any element j in the universe and any r, we have

P{j is not covered by Cr } =

Y

i:j∈Si

(1 − x?i ) ≤

Y

i:j∈Si

1 ? e−xi ≤ . e

(2.5)

We use the fact that 1 − x ≤ e−x for all x ∈ [0, 1] and there exists at least one subset that contains j to get the above result. Then, 2 log N

P{j is not covered by C} =

Y r=1

P{j is not covered by Cr } ≤ e−2 log N =

1 . N2 (2.6)

23

Therefore,

P{C is not a set cover} = P{∃j ∈ [N ] not covered} ≤

1 . N

(2.7)

This algorithm involves an LP and 2 log N linear-time rounding algorithms. Clearly, this is a polynomial time algorithm with 2 log N approximation and outputs a set cover with probability at least 1 −

2.1.4

1 . N

Other Approximation Algorithms

There are other commonly seen approximation approaches, like greedy algorithms, local search and dynamic programming (DP). We will briefly introduce greedy algorithms and local search, and will discuss dynamic programming in the next section. We refer the readers to [17] on more about these approximation algorithms. A greedy algorithm makes sequential decision without considering the future and the past. That is, it makes a decision which is the best to current state. For example, in a network routing problem, the goal is to find the shortest path from a source to a destination. Given the current location, greedy algorithm choose the closet point over the neighbors as the next hop. It does not consider the next hop may be a dead end or further from destination than current location. A local search algorithm starts with a feasible solution and tries to improve it by a “local” move. It stops if there is no local move that can further improve the performance. For the same network routing problem, the local search starts with a 24

path from source to destination. Then it thinks about if changing any intermediate node to another node will make the whole solution a better path. Typically, we have to show that the local search algorithm does stop within a limit amount of time (polynomial time). Moreover, we have to bound the loss due to the algorithm being trapped at a local optimum.

2.2

Dynamic Programming and FPTAS

The dynamic programming method (DP) [22] breakdowns a problem by first solving the necessary sub-problems and combining them in the end. These sub-problems are closely related in that solving one sub-problem may rely on the solutions of other sub-problems. Hence, each sub-problem is solved once, and the result is stored in the memory, which is accessible when necessary. This approach is often used in sequential decision making processes in order to optimally reach the goal (destination, terminal states, etc.). We will use the knapsack problem as an example, and illustrate the dynamic programming algorithm to solve it. Finally, we discuss on an approximation scheme that is closely related to dynamic programming.

2.2.1

The Knapsack Problem

The Knapsack Problem (KP) seeks for packing multiple items as valuable as possible into a backpack with limited volume [23]. Formally, given a set of items, 1, · · · , N , with values v1 , · · · , vN and sizes c1 , · · · , cN , we want to select a subset of 25

items where the total value is maximized and the total size is less than the backpack volume C. That is,

KP : min

X

xi v i

i∈[N ]

s.t.

X

i∈[N ]

xi ci ≤ C,

xi ∈ {0, 1}.

The binary variable xi denotes whether the ith item is selected or not. We assume the rest of numbers are positive integers, and ci ≤ C for all i ∈ [N ].

The dynamic programming method can solve the KP exactly. We first define the sub-problems. Let C[j, v] denotes the minimum size induced by packing a subset of items from {1, · · · , j} to achieve value v. Specifically,

KPsub : min

X

y i ci

i∈[j]

s.t.

X

yi vi = v,

i∈[j]

yi ∈ {0, 1}. 26

We claim that if we solve all the sub-problems for j = 1, · · · , N and v = 1, · · · , N vmax , where vmax = maxi∈[N ] vi , then we can solve the KP. Clearly, the optimal packing has value at most N vmax .

Suppose we have already solve C[N, v] for all

v ∈ {1, · · · , N vmax }. We can find the optimal solution by

max v s.t. C[N, v] ≤ C.

(2.8)

Let C[N, V ? ] be the optimal solution given by (2.8). We prove that C[N, V ? ] is optimal by contradiction. Assume C[N, V ? ] is not optimal, then there exists a packing that has value V 0 > V ? and has total size C 0 ≤ C. Since C[N, V ? ] is optimal given by (2.8), we have C[N, V 0 ] > C, which is a contradiction as required.

The following equation is the core idea of dynamic programming. We are solving the sub-problem, C[j, v], packing a subset of items in {1, · · · , j} that uses the least volume and achieves value v. C[j, v] relies on the packing strategies of C[j − 1, v] and C[j − 1, v − vj ]. That is,

C[j, v] = min{C[j − 1, v], C[j − 1, v − vj ] + cj }.

(2.9)

For item j, we have two choices. If we don’t select item j, then we have to pack a subset of {1, · · · , j − 1} to achieve total value v. On the other hand, if we select item j, then we have to find out the least size packing of {1, · · · , j −1} that achieves value v − vj . Hence, we first solve C[j − 1, v] for all v ∈ {1, · · · , V } and then we 27

can solve C[j, v]. The dynamic programming procedure starts solving from the first item, given C[1, v] =

    c1

if v = v1 ,

   ∞ otherwise.

(2.10)

In total, we have to solve N × N vmax sub-problems. Since each sub-problem is as simple as choosing the lesser of two numbers, we define it as a basic task. Hence, the dynamic programming runs in O(N 2 vmax ) time. The complexity not only depends on the problem size N but also depends on vmax , which is related to the input instance. This is not a polynomial time algorithm. We call it a pseudo-polynomial time algorithm. Not only when the values are big may the algorithm results in long running time, but if the values are positive real numbers, the number of subproblems is unbounded. Although the pseudo-polynomial time is good enough for some problem instances, we are interested in a polynomial time algorithm for all instances by sacrificing the accuracy to some extent. That is, we seek for a polynomial time approximation algorithm with performance guarantee.

2.2.2

FPTAS: Fully Polynomial Time Approximation Scheme

A Fully Polynomial Time Approximation Scheme (FPTAS) [9] clearly illustrates the trade-off between solution accuracy and algorithm complexity. For an FPTAS, we can improve its solution accuracy with the increase on complexity. Hence, 28

an FPTAS can theoretically approach the optimum arbitrarily close, however, the increasing complexity may become considerable. Specifically, for a maximization problem, an FPTAS guarantees (1 − ) approximation ratio. That is, it always outputs a solution that achieves at least (1−) times of the maximum objective. Here is a real number between 0 and 1. Its complexity is bounded by a polynomial of problem size,

1

and the number of bits to describe

the problem instance, |I|. We have seen that the problem size is closely related to the complexity of an algorithm. Moreover, by the definition of FPTAS above, we can approach the optimum closer by setting smaller , however, its complexity increases with 1 . The last factor is |I|. Since a computer runs in binary operations, the number of basic operations increases with the number of bits to describe an input instance. For example, if we know all the numbers in an instance is bounded by T , then, in total, these numbers can be described by O(N log2 T ) = O(|I|) bits. Hence, an FPTAS runs in poly(N, 1 , log2 T ) time2 . In the following, we illustrate an FPTAS for KP [24]. From the dynamic programming approach, the number of sub-problems scales with O(N 2 vmax ). If we fixed the problem size, the FPTAS comes from the quantization on the values of each item such that the number of sub-problems scales slower than O(vmax ). By definition, we have to find a suitable quantization step size such that the number 2

One can see that |I| also depends on problem size. For the knapsack problem, there are O(N ) input numbers. If each of them can be described by log2 vmax bits, then |I| = O(N log2 vmax ). Hence, the formal definition on the complexity is that FPTAS runs in poly( 1 , |I|) time [9]. We separate problem size from |I| for cleaner presentation

29

of sub-problems is bounded by O(log2 vmax ). We will see that the complexity of the following FPTAS turns out to be independent of vmax , which is even better than the minimum requirement.

Let V ? be the maximum value given by the optimal packing of KP. There is a simple bound on V ? , that is, vmax ≤ V ? ≤ N vmax . Clearly, the maximum value is at least vmax achieved by only packing the most valuable item, and is less than N vmax . We design the quantization step size and the quantizer as

δ=

vmax , N

(2.11)

qδ (x) = k, if kδ ≤ x < (k + 1)δ.

(2.12)

Now we consider the quantized dynamic programming as follows. We solve C[j, k] by C[j, k] = min{C[j − 1, k], C[j − 1, k − q(vj )] + cj }.

(2.13)

In the quantized dynamic programming, we no longer have to solve N 2 vmax subproblems. Each quantized value falls in the dynamic range

0 ≤ q(vj ) ≤ q(vmax ) ≤ d

N e.

(2.14)

Hence, the number of sub-problems is bounded by O(N × N q(vmax )) = O(N 3 1 ). The quantized dynamic programming has performance loss. Let Vˆ 0 be the total 30

quantized value achieved by the packing strategy solved by the quantized DP, which implies its original value V 0 is at least δ Vˆ 0 . Let I ? be the set of items selected by the optimal strategy. Assume the packing strategy solved by the quantized DP is different from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a different strategy, we have

P

i∈I ?

q(vi ) ≤ Vˆ 0 .

Otherwise, the quantized DP would rather pick I ? . Hence, we have

V 0 ≥ δ Vˆ 0 ≥ δ

X i∈I ?

q(vi ) ≥ V ? − N δ ≥ V ? − vmax ≥ (1 − )V ? .

(2.15)

We use the fact that in quantized DP, each quantized value loses at most δ compared to its original value. Hence, comparing V ? and δ

P

i∈I ?

q(vi ), the total loss is at

most N δ = vmax . Since vmax ≤ V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 − )V ? .

As we have shown, the FPTAS runs in O(N 3 1 ) time and guarantees (1 − ) approximation ratio. Note that in (2.11), δ could be less than one for small and vmax , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small vmax , the original DP works just fine. However, if vmax is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 31

Otherwise, Otherwise, the thequantized quantized DP DP would would rather rather pick pick I ?DP I. ?DP Hence, .the Hence, wewehave have ? ? quantized areent the from same, I ? (If then they theare quantized the same, DP then achieves the ent optimum). from DP I ?achieves (If they the arethe optimum). the same, then quantized DP achieves thes ent entfrom from Ithe I (If(If they they are are the the same, same, then then quantized quantized achieves achieves the the optimum). optimum). Since the quantized DP outputs a di↵erent

P P P PP 0 X q(v DP Since outputs the quantized a di↵erentDP strategy, outputs wea have di↵erent strategy, Since  the we Vôutputs quantized have . DP Vˆ 0 .a di↵erent wei )ihave Since Since the thequantized quantized DP outputs a ai2I di↵erent di↵erent strategy, strategy, wehave have )Vˆ? Vˆ0 . 0 .i2I ? q ? q(v ? q(voutputs i )DP i)  ? ? q(v 0 we 0 strategy, i2I ˆ i2I i2I Otherwise,Vthe quantized DPq(v would V VratherNpick i) X X 0 0 0ˆ 0 ? ? ? ? ? ? ? ˆ i2I V V ?V V q(v q(v NN ?V?V ✏v✏v ✏)V ✏)V. . (2.15) (2.15) i) i) V V max max (1(1 ? tized Otherwise, DP would therather quantized pick IDP . Hence, would rather we have pick Otherwise, I . Hence, we the have quantized DP would rather pick I ? . Hence, we have Otherwise, Otherwise, the thequantized quantized DP DP would would rather rather pick pick I I. .Hence, Hence, we wehave have ? ? i2Ii2I

X i2I ?

0 q(vi ) V 0 V ? VˆN

X? ? ? V 0 Vˆ V0 V0 Vˆ 0 k v V q(viVˆ) VVˆ? N X X X X ? ? ˆ ?ˆ0? ˆ? ?ˆ ? 0 0 ˆ 0ˆ 0 ˆ?0ˆ 0 ? ? ? ? ? ? ? ? ?(1(1 ? 0 ?ˆ?0ˆ 0 k k v v V 0V V V q(v V V Vi(2.15) V VNV (1l(1Vl i2I (2.16) (2.16) ˆVV V? V? .VNN V ? q(v✏v VV. V ✏vmax (2.15) Vq(v (1 )V?V ? VV ✏)V . V V0 ✏)V q(v ✏v✏v ✏)V ✏)V✏v .✏)V . ✏)V (1 (2.15) (2.15) i )maxV (1 N max i )i ) V ✏)V max max i2I ?

? ? i2I i2I

i2I ?

We use the fact that in quantized DP, each q ? ˆ? 0 ˆ0 kvalue v Vloses Vatat Vˆ ? V V Vˆ We We use use the the fact fact that that inin quantized quantized DP, DP, each each quantized quantized value loses most most compared compared ? 0 0 ? ˆ? ? ˆ? ? ?0 0? ˆ 0ˆˆ0 ? to 0 ? ?original ??? ˆVˆ0 Vîts V ? Vˆ ? Vˆ ? Vk0 vVˆ 0 V ?Vˆ 0 Vˆ ?V ? Vˆ l? V(10 Vˆ✏)V lV Vˆ(1 k Vˆ✏)V v V VV V VV V(2.16) VHence, l comparing (1 ✏)V ? V k?Vˆk0 v V v V? V?(2.16) V V V V? 0 l lVˆ 0(1value. (1Vˆ ✏)V ✏)V (2.16) (2.16) PP ? ? totoitsitsoriginal originalvalue. value.Hence, Hence,comparing comparingV V and and i2Ii2I thetotal totalloss lossis isatat ? q(v ? q(v i ),i ),the N fact = that ✏vmax Since vmax V ? ,quan we Wemost use the in. quantized DP,each Figure 2.2: Relationship between quantized domain and original domain n quantized We use theDP, facteach thatquantized in quantized value DP, loses each atN quantized most We use value the loses fact at most ineach quantized compared DP, each quantized value loses at most We We use use the the fact fact that that in inquantized DP, each quantized value value loses loses atatmost most compared compared ? quantized ? most most N = = ✏vcompared ✏v . quantized .Since Since vthat vDP, , weachieve achieve the the final final result. result. That That is,is, the the max max max maxV V, we objective value given by quantized DP is? at to its original value. Hence, comparing V an P P P PP ? ? ? ? is Hence, to its original comparing value. V ?2.2.3 Hence, and comparing q(v ), the V and total to loss its original is q(v at ), value. the total Hence, loss comparing at V and q(v ), the tota to to its its original original value. value. Hence, Hence, comparing comparing V V and and q(v q(v ), ), the the total total loss loss is is at at ? ? ? Guidelines to Derive FPTAS from DP i i i ? ? i i ? ? i2I i2I byby i2I i2I i2I objective objectivevalue valuegiven given quantized quantizedDP DPis isatatleast least (1 (1 ✏)V ✏)V. . most N = ✏vmax . Since vmax  V ? , we ach ? ? Since mostvmax N  = ✏v V ?max , we . Since achieve vmax the final VN?N ,result. we That most the is, N final the ✏v result. vis, the  the Vthe ,have we achieve final most most ==achieve ✏v✏v . . Since Since v=max vmax .V V?Since ,That ,weweachieve achieve final final result. result.the That That is,is,result. the the max max max max As we shown, the FPTAS runs T in Since dynamic programming is often used in solving a largevalue set given of optimization objective by quantized DP is at lea ? ? 1✏)V 3is1?at by objective quantized value DPgiven is at by least quantized (1 objective ✏)V DP . As isvalue at objective (1 shown, ✏)V value .the by quantized DP3✏)V (1guarantees ✏)V ? . objective value given given byshown, by quantized quantized DP DP isisruns at at least least (1(1 .?time . least As weleast wehave have thegiven FPTAS FPTAS runs ininO(N O(N ) )time and and guarantees (1(1 ✏)✏) approximation ✏ ✏ ratio. Note that in (2.11), problems but results in pseudo-polynomial time complexity, we want to know, given approximation approximationratio. ratio.Note Notethat thatinin(2.11), (2.11), bebeless less than than one one forforsmall small✏ ✏and and vmaxcould , could in which we don’t need quantized D As section, we havewe shown, the FPTAS runs in O 31 3 1 an FPTAS for it. In this 3 1some 1 1 an original DP, if there exists present 3 3 wn, theAsFPTAS we have runs shown, in O(N the FPTAS runs inhave O(N As we shown, the FPTAS O(N ) timeAsAs and )(1time ✏)have and guarantees ✏)✏ )✏ )runs ) time (1 and guarante weweguarantees have shown, shown, the the FPTAS FPTAS runs runs inin(1 O(N O(N time timein and and guarantees guarantees (1 ✏) ✏) ✏ ✏ ✏ sub-problems. for instances with s vmax vmax , in , inwhich whichwewedon’t don’tneed needquantized quantized DP DPbecause becauseitHence, itincreases increases the thenumber number ofof approximation ratio.the Note that in (2.11), cou general guidelines for deriving an FPTAS from the original DP. We refer readers Note approximation that in (2.11), ratio. could Note that be less in (2.11), than one could for approximation small beNote less ✏Note and than ratio. for Note smallcould that ✏ and inbebe (2.11), could than one for s approximation approximation ratio. ratio. that thatone in in(2.11), (2.11), could less lessthan than one onebe for forless small small ✏ ✏and and fine. However, is big enough such sub-problems. sub-problems.Hence, Hence,forforinstances instanceswith with small small vmax vmax , the , ifthevoriginal original DP DP works worksjust just max v , in which we don’t need quantized DP max to [25] for sufficient conditions that a DP formulation possesses an FPTAS. on’t vmax need , in quantized which we DP don’tbecause need vquantized increases DPthe because vwe number , initneed which of increases we don’t the number need quantized of DP because itnumber increases vitmax , ,ininwhich which we don’t don’t needquantized quantized DP DPbecause because ititincreases increases the thenumber ofof the max max is considerable, we need the quantized DP fine. fine.However, However,if ifvmax vmaxis isbigbigenough enoughsuch suchthat thatthe therunning runningtime timeofoforiginal originalDP DP sub-problems. Hence, for instances with sma e, sub-problems. for instances with Hence, small for vinstances ,sub-problems. the original with DP sub-problems. vworks , for the just original DP for works instances just small vand ,the the original sub-problems. Hence, Hence, for instances instances with with small small vmax vmax , with ,the the original original DP DP works works just just DP We start from Fig.small 2.2 to examine the Hence, values on the quantized domain max max max provable approximation guarantee. is isconsiderable, considerable,weweneed needthe thequantized quantizedDP, DP,which whichhas hasshorter shorterrunning runningtime timeand and fine. However, if v is big enough such tha max is bigHowever, enough such if vmax that is big the enough running such time that of original fine. the running However, DP time if v of original is big enough DP such that the running time of o fine. fine. However, However, if if v v is is big big enough enough such such that that the the running running time time of of original original DP DP maxfine. max max max original domain, and see how the performance bound is derived for the knapsack provable provableapproximation approximationguarantee. guarantee. is considerable, we need the quantized DP, w need is considerable, the quantizedwe DP, need which the has quantized shorter DP, running which is time considerable, has and shorter running we need time the quantized and has DP, which has shorter running is is considerable, considerable, we we need need the the quantized DP, DP, which which hasshorter shorter running running time timeand and problem in the last section. First, in thequantized quantized domain, we leverage the fact 3030 provable approximation guarantee. ionprovable guarantee. approximation guarantee. provable approximation guarantee. 0 provable provable approximation approximation guarantee. guarantee. ˆ that the quantized DP outputs a strategy whose quantized value V is higher than 30 by the optimal packing 30 strategy. Mapping the quantized objective Vˆ ? achieved

these numbers back to the original domain, we have δ Vˆ 0 ≥ δ Vˆ ? . Our final goal is to get the result that V 0 ≥ (1 − )V ? , where V 0 is the value achieved by the quantized DP’s strategy and V ? is the optimum. On one hand, since the quantizer always underestimates an input value, we have a lower bound on V 0 , i.e., V 0 ≥ δ Vˆ 0 . On the other hand, the quantization loss of V ? is no more than lδ, where l is the number 32

3030

of quantization that were performed when solving the quantized DP. Since Vˆ ? is the sum of quantized values of the packed items, we have l ≤ N . Hence, we have

V 0 ≥ V ? − lδ ≥ V ? − N δ = V ? − N

Vlow . N

(2.16)

We can see that if Vlow is a lower bound of V ? , we arrive at the final result V 0 ≥ (1 − )V ? . The quantization step size plays an important role on the performance bound and the complexity of a quantized DP. The above example gives us some ideas on how to design the step size using reverse engineering. We list the guidelines as follows. • Find Vlow and Vup such that Vlow ≤ V ? ≤ Vup , where Vup = poly(N ) × Vlow . • Identify l and set δ as

Vlow . l

To summarize, first we need an upper bound and lower bound of V ? . Given a problem instance, we try to sandwich V ? by the input numbers of the instance. For a KP with items’ values v1 , · · · , vN , we have vmax ≤ V ? ≤ N vmax , where vmax = maxi∈[N ] vi . The upper bound Vup specifies the dynamic range, that is, we only have to solve the sub-problems with values fall in [0, Vup ]. The lower bound Vlow should appear in the numerator of δ. We notice that

Vup δ

dominates the number

of sub-problems of the quantized DP. Hence, if we can find Vup and Vlow such that Vup = poly(N ) × Vlow (a factor bounded by a polynomial of problem size), then 33

Algorithm 1 Binary Search FPTAS 1: procedure F P T AS(N, , Vup ) 2: for r ← 1, log2 Vup do up 3: Vr ← 2Vr−1 + c, δr ← Vl2up r 4: V 0 ← DP (q, Vr , δr ) 5: if V 0 ≥ V2up r then 6: return 7: end if 8: end for 9: end procedure

. solve space [0, Vr ] using step size δr . lower bound found, return

the number of sub-problems would be independent of input values vi . Finally, by designing δ as mentioned and following the derivation achieve the desired result.

2.2.3.1

When the optimum cannot be properly bounded

To sandwich V ? by the bounds as required is sometimes non-trivial. Given a problem instance, to find an upper bound of V ? is easy, but then we need a non-trivial lower bound satisfies Vlow = poly(N ) × Vup . In the cases where we fail to find a suitable lower bound, we perform binary search that iteratively searches the lower bound of V ? in the active space and adapts the quantization step size to guarantee its solution is accurate enough. Algorithm 1 summarizes the process of searching Vlow . It calls the quantized DP for solving the spaces [0, Vup ], [0, V2up ], [0, V4up ], · · · , until Vlow has been found. We see in line 5, the termination criterion implies V ? ≥

Vup . 2r

Assume that this algorithm 34

stops at the rth round, using δr =

Vup . l2r

From (2.16), with the newly found lower

bound on V ? , we have

V 0 ≥ V ? − lδr ≥ V ? −

Vup ≥ (1 − )V ? . r 2

(2.17)

Now we bound the running time of Algorithm 1. For the rth round, Algorithm 1 calls the quantized DP for solving the space [0, Vr ], where Vr =

Vup 2r−1

+ c, with step

size δr . We add a small constant c to avoid missing the optimal packing strategy. That is, in round r−1, if V 0
p. That is, we have the confidence that more than p% of time the latency is less than t ms under the stochastic environment, where p and t are arbitrary numbers. We propose PTP, which runs in polynomial time and gives near-optimal solution. Last, we consider the resource that are distributed over the network, including on-device resource and network bandwidth, may vary with time, and hence we have to learn the information and adapt to changes at run time. We formulate the online learning scenarios as multi-armed bandit problems, where devices and channels are modeled as arms and give the payoffs with unknown statistics as their performance metrics. Different from the existing MAB algorithms that can freely probe the desired arm at each time slot, our task assignment not only makes decisions on 54

selecting devices but also affects on the channel usage. Hence, we have to develop algorithms that jointly consider probing the devices and the channels between them. For stationary bandit processes, we adapt the sampling method, DSEE, in [45] and design a new sequence to sample both devices and channels. Our performance analysis shows that the new sampling method can achieve O(ln T ) regret, which is in the same order as the guarantee provided by DSEE. For non-stationary bandit processes, we adapt the adversarial MAB formulation that does not make any assumptions on the stochastic processes. Since the proposed Exp3 algorithm [30] is not applicable to our task assignment scenario, we propose a new algorithm, MABSTA, which jointly learns the performance on unknown devices and channels. Our performance analysis also shows that MABSTA achieves the same order of √ regret, O( T ), as Exp3 does.

3.4

System Prototypes

We classify the system prototypes by the number of devices (servers) that participate in collaborative computing. One extreme is the typical computational offloading approach, where there exists one to one connection between a local device and a cloud server. MAUI [10] and CloneCloud [38] are systems that leverage the resources in the cloud. Odessa [11] identifies the bottleneck stage at run time, suggests offloading strategy and leverages data parallelism to mitigate the load on a 55

mobile phone. Another extreme is to exploit the computational resources on multiple connected devices. Shi et al. [47] investigate the nearby mobile helpers reached by intermittent connections. CWC [48] uses idle mobile devices connected in the network, like mobile devices held by company employees, to run certain tasks as an alternative to enterprise servers. Between these two extremes, MapCloud [49] is a hybrid system that makes run-time decision on using “local” cloud with less computational resources but faster connections or using “public” cloud that is distant away with more powerful servers but longer communication delay. Cloudlets [50] is a 3-hierarchy system that introduces a middle tier between mobile devices and the cloud, which forwards the intensive tasks or data to the cloud if necessary or takes the job if possible, considering QoS constraints. COSMOS [39] finds out the customized and economic cluster in its size and its setup time considering the task complexity. These system prototypes have demonstrated the promising applications of collaborative computing over multiple devices. However, an optimization formulation is essential so that a system can make intelligent decisions based on the run-time environment. As the numerical results presented in Chapter 4, the optimized strategy significantly outperforms the heuristic strategy in scenarios where staying at the local optimal incurs considerable performance loss. Hence, we are positive on the integration of these pioneer algorithms with the system to make the best value of collaborative computing.

56

Chapter 4

Deterministic Optimization with Single Constraint

In this study, assuming the application profile and the resource profiles are known and deterministic, we formulate a task assignment problem to minimize the application latency, subject to a cost constraint. We show that this formulation is NP-hard and propose Hermes, a Fully Polynomial Time Approximation Scheme (FPTAS) that provides a solution with latency no more than (1 + ) times of the minimum while incurring complexity that is bounded by a polynomial in problem size and 1 . We evaluate the performance by using a real data set collected from several benchmarks, and show that Hermes improves the latency by 16% compared to a previously published heuristic and increases CPU computing time by only 0.4% of overall latency. This chapter is based on our works in [33, 36]. 57

Table 4.1: Notations of SCTA Notation

Description

mi

workload of task i

dmn

the amount of data exchange between task m and n

G(V, E) C(i) l

set of children of node i

the depth of task graph (the longest path)

din

the maximum indegree of task graph

δ

quantization step size

[N ] x ∈ [M ]N (j) Ti (jk) Tmn (j) Ci (jk) Cmn

set {1, 2, · · · , N }

assignment strategy of tasks 1 · · · N

latency of executing task i on device j latency of transmitting data between task m and n from device j to k cost of executing task i on device j cost of transmitting data between task m and n from device j to k

D(i, x)

4.1

task graph with set of nodes V and set of edges E)

accumulated latency when task i finishes, given strategy x

Problem Formulation

We call our formulation as a Single-Constrained Task Assignment problem (SCTA), as a comparison with the Multi-Constrained Task Assignment problem (MCTA) that is presented in the next chapter. Table 4.1 summarizes the notations used in SCTA. We introduce each component in the formulation as follows.

4.1.1

Task Graph

An application profile can be described by a directed graph G(V, E) as shown in Fig. 4.1, where nodes stand for tasks and directed edges stand for data dependencies. A task precedence constraint is described by a directed edge (m, n), which implies 58

outhern California s, CA, USA bkrishna}@usc.edu

Bedminster, NJ, USA Email: [email protected]

vices increasingly able to connect here, resource-constrained devices oading of computational tasks to e or improve performance. It is of ments of tasks to local and remote ount the application-specific profile, resources, and link connectivity, nergy consumption costs of mobile y-sensitive applications. Given an k dependency graph, we formulate minimize the latency while meeting constraints. Different from most of on an integer linear programming d and is not applicable to general tency metrics, or on intuitively deheoretical performance guarantees, fully polynomial time approximahm to solve this problem. Hermes ency no more than (1 + ✏) times ring complexity that is polynomial valuate the performance by using everal benchmarks, and show that y by 16% (36% for larger scale previously published heuristic and e by only 0.4% of overall latency.

Warren, MI, USA Email: [email protected]

start

10.5 5 3

split

9.7

15

3

1.2 5

5 2

8.5 5.5

10 5

5.5

1.2

8

3

8.5

1.2 5

3.3 10

10

5

5

5

5 3

10 10

15.5 1 final

Fig. 1: A task graph of an application, where nodes specify Figure 4.1: An example of application task graph tasks with their workloads and edges imply data dependency labeled with amount of data exchange.

an example shown in Fig. 1. A task is represented by a node whose weight specifies its workload. Each edge shows the data RODUCTION that task n dependency relies on thebetween result of task m. and That is, task with n cannot start until it gets two tasks, is labelled the amount es are connected, lots of resource of data being communicated between them. An offloading the result task m.selects The weight theconsidering workload of the task, a subsetonofeach tasksnode to bespecifies offloaded, of cloud computing, become ac- ofstrategy er suffering from stringent battery the balance between how much the offloading saves and how while like the weight each shows the data in communication between much on extra costedge is induced. On amount the otherofhand, addition to or limited processing power, targeting a single remote server, which involves only binary run computation-intensive tasks two tasks. In addition thetask, application profile, there are someschemes parameters related decision on to each another spectrum of offloading the remote resource, more sophisg heavy loads of data processing make use of other idle and connected devices in the network the graph[7], measure in ourthe complexity We use N to denote [8], where decision analysis. is made over multiple devices the number an be realized in timelytofashion Thus, computation offloading— based on their availabilities and multiple wireless channels. In of tasks M atorigorous denote optimization the number formulation of available of devices in theand network. For the problem ve tasks to more resourceful sev-andsum, l approach to save resources on the scalability of corresponding algorithm are the key issues there initial task (task 1) that starts the application and a that need to isbeanaddressed. the processing time [3],each [4], task [5]. graph, In general, we are concerned in this domain with a task asffloading invokes extra communitask (task N ) that terminates it. A devices, path from initial task to final task can signment problem over multiple subject to constraints. ation and profiling data final that must Furthermore, task dependency must be taken into account in ervers. The additional communibe [6]. described by a sequence of nodes, where pairThe of consecutive involving latency as aevery metric. authors of nodes are consumption and latency In formulations Odessa [11] present a heuristic approach to task partitioning be modeled by a task graph, as fora improving latency metrics, involvingnumber iter- of nodes connected by directed edge. Weand usethroughput l to denote the maximum by NSF via award number CNS-1217260. ative improvement of bottlenecks in task execution and data in a path, i.e., the length of the longest path. Finally, din denotes the maximum

59

indegree in the task graph. Using Fig. 4.1 as an example, we have l = 7 and din = 2.

4.1.2 (j)

Let Ci

Cost and Latency (jk)

be the execution cost of task i on device j and Cmn be the transmission

cost of data between task m and n though the channel from device j to k. Similarly, (j)

the latency consists of execution latency Ti

(jk)

and the transmission latency Tmn .

Given a task assignment strategy x ∈ {1 · · · M }N , where the ith component, xi , specifies the device that task i is assigned to, the total cost can be described as follows. Cost =

X

i∈[N ]

(xi )

Ci

+

X

(xm xn ) Cmn

(4.1)

(m,n)∈E

As described in the equation, the total cost is additive over nodes (tasks) and edges of the graph. For a tree-structure task graph, the accumulated latency up to task i depends on its preceding tasks. Let D(i, x) be the accumulated latency when task i finishes given the assignment strategy x, which can be recursively defined as

D(i, x) = max

m∈C(i)

n o (x x ) (x ) D(m, x) + Tmim i + Ti i .

(4.2)

We use C(i) to denote the set of children of node i. For example, in Fig. 4.2, the children of task 6 are task 4 and task 5. For each branch leading by node m, the 60

latency is accumulating as the latency up to task m plus the latency caused by data transmission between m and i. D(i, x) is determined by the slowest branch.

4.1.3

Optimization Problem

Consider an application, described by a task graph, and a resource network, de(jk)

(j)

(j)

(jk)

scribed by {Ci , Cmn , Ti , Tmn }, our goal is to find a task assignment strategy x that minimizes the total latency and satisfies the cost constraint, that is,

SCTA : min D(N, x)

x∈[M ]N

s.t. Cost ≤ B.

The Cost and D(N, x) are defined in (4.1) and (4.2), respectively. The constant B specifies the cost constraint, for example, energy consumption of battery-operated devices. In Section 4.3, we propose an approximation algorithm based on dynamic programming to solve this problem and show that its running time is bounded by a polynomial of

1

with approximation ratio (1 + ). 61

4.2

Proof of NP-hardness of SCTA

We reduce the 0-1 knapsack problem to a special case of SCTA, where a binary partition is made on a serial task graph without considering data transmission. Since the 0-1 knapsack problem is NP-hard [15], SCTA is at least NP-hard.

(0)

Assume that Ci

= 0 for all i, the special case of SCTA can be written as

SCTA0 :

min

xi ∈{0,1}

s.t.

N X i=1 N X

(0)

(1 − xi )Ti (1)

xi C i

i=1

(1)

+ xi Ti

≤ B.

Given N items with their values {v1 , · · · , vN } and weights {w1 , · · · , wN }, one wants to decide which items to be packed to maximize the overall value and satisfies the total weight constraint, that is,

KP :

max

xi ∈{0,1}

s.t.

N X

i=1 N X i=1

xi vi xi wi ≤ B. 62

Now KP can be reduced to SCTA0 by the following encoding

(0)

= 0, ∀i

(1)

= −vi ,

(1)

= wi .

Ti

Ti

Ci

By giving these inputs to SCTA0 , we can solve KP exactly, hence,

KP ≤p SCTA0 ≤p SCTA.

4.3

(4.3)

Hermes: FPTAS Algorithms

We first propose the approximation scheme to solve SCTA for a tree-structure task graph and prove that this simplest version of the Hermes algorithm is an FPTAS. Then we solve for more general task graphs by calling the proposed algorithm for trees a polynomial number of times.

4.3.1

Tree-structured Task Graph

We propose a dynamic programming method to solve the problem on tree-structured task graphs. For example, in Fig. 4.2, the minimum latency when task 6 finishes depends on when and where task 4 and 5 finish. Hence, prior to solving the minimum latency of task 6, we want to solve both task 4 and 5 first. We exploit the 63

ask (task N ) that terminates it. A path from initial task al task can be described by a sequence of nodes, where pair of consecutive nodes are connected by a directed We use l to denote the maximum number of nodes in a i.e., the length of the longest path. Finally, din denotes aximum indegree in the task graph.

6 finish

Fig. 2: A tree-structured task graph, in which the two subproblems can be independently solved.

start

st and Latency

1

2

use the general cost and latency functions in our (j) tion. Let Cex (i) be the execution cost of task i on (jk) e j and Ctx (d) be the transmission cost of d units of rom device j to device k. Similarly, the latency consists (j) ecution latency Tex (i) and the transmission latency (d). Given a task assignment strategy x 2 {1 · · · M }N , the ith component, xi , specifies the device that task i is ed to, the total cost can be described as follows.

4

latency

3

5

y=t

6

finish

N X

X

x=B

cost

(x tree-structured x ) (x ) taskin graph CostA = tree-structured Cex (i) +Figure 4.2: CtxAtask (dmn ) (1) 2: graph, the solves two subFig. which 3: The algorithm each sub-problem for the minFigure 4.3: Hermes’ methodology with independent child sub-problems i=1 (m,n)2E cost within latency constraint t (the area under the lems can be independently solved.imum horizontal line y = t). The filled circles are the optimums i

m

n

scribed in the equation, the total cost is additive over Algorithm 2 Find maximum latency given the problem instance (tasks) and edges of the graph. On the other hand, the of each sub-problems. Finally, it looks for the one that has the 1: procedure F IN D∆ (N ) mulated latency up to task i depends on its preceding minimum latency of all filled circles in the left plane x  B. 2: q ←i BFS (G, N ) . can run BFS from node N and store visited nodes in order latency Let D(i) be the latency when task finishes, which 3: for i ← q.end, q.start do . start from the last element in q ursively defined as III.latency H ERMES : FPTAS A LGORITHMS 4: ifo i is a leaf then . L[i, j]: max finishing task i on device j n (j) (xm xi ) (i) (m) (x ) i ∀j ∈ [MIn] the appendix, we prove that our task assignment problem 5:) + D L[i, = max Ttx (dmi + Texj] ← (i).Ti (2) m2C(i) 6: else P is NP-hard for any task graph. In this section, we first 7: for j ← 1, M do propose the approximation scheme to solve problem P for se C(i) to denote the set of children of node i. For (j) (kj) } this simplest version 8: L[i, j] ← T + {L[m,and k] +prove Tmi that m∈C(i) maxk∈[M amax tree-structure task ]graph ple, in Fig. 2, the children of task 6 are task 4 and task i end for as the of the Hermes algorithm is an FPTAS. Then we solve for each child node m, the 9: latency is accumulating 10: end if transmission more general task graphs by calling the proposed algorithm for y up to task m plus the latency caused by (i) 11: end forslowest branch. trees a polynomial number of times. Finally, we show that the is determined by the mi . Hence, D 12: ∆ ← maxj∈[M ] L[N, j] Hermes algorithm also applies to the dynamic environment. y = t13: end procedure ptimization Problem A. Tree-structured Task Graph nsider an application, described by a task graph, and a We propose a dynamic programming method to solve the ce network, described by the processing powers and link fact that the sub-trees rooted by task 4 and 5 are independent. That the in Fig. problem withtask tree-structured task graph. For is, example, ctivity between available devices, our goal is to find a ssignment strategy x that minimizes the total latency and 2, the minimum latency when the task 6 finishes depends on task 4the andstrategy 5 finish.on Hence, assignment strategy on task 1, 2 when and 4and doeswhere not affect task 3prior and to solving es the cost constraint, that is, the minimum latency of task 6, we want to solve both task cost P : min D(N ) N We exploit and the fact that the sub-trees independently combine them when rooted by x2[M5. ] Hence, we can solve xthe = sub-problems B 4 and 5 first. task 4 and task 5 are independent. That is, the assignment s.t. Cost  B. strategy on task 1, 2 and 4 does not affect the strategy on task considering task 6. Cost and D(N ) are defined in Eq. (1) and Eq. (2), 3 and 5. Hence, we can solve the sub-problems respectively combine considering task 6. cost ctively. The constant B specifies the the costsub-problem constraint, asand We define follows. Letthem C[i, when j, t] denote the minimum xample, energy consumption of mobile devices. In the We define the sub-problem as follows. Let C[i, j, t] denote wing section, we propose an approximation algorithm the minimum costt. when finishing task ibyonsolving device j within when finishing task i on device j within latency We will show that on dynamic programming to solve this problem and latency t. We will show that by solving all of the sub-problems that it runs in polynomial time in 1✏ with approximation for i 2 {1, · · · , N }, j 2 {1, · · · , M } and t 264[0, T ] with (1 + ✏). sufficiently large T , the optimal strategy can be obtained by

3: The algorithm solves each sub-problem for the minm cost within latency constraint t (the area under the zontal line y = t). The filled circles are the optimums ach sub-problems. Finally, it looks for the one that has the mum latency of all filled circles in the left plane x  B.

Algorithm 3 Hermes FPTAS for tree-structured task graph 1: procedure F P T AStree (N, ) . solve sub-problems for task N 2: ∆ ← F IN D∆ (N ) . find the dynamic range [0, ∆] that covers all assignment strategies 3: q ← BFS (G, N ) . run BFS from node N and store visited nodes in order 4: for r ← 1, log2 ∆ do ∆ ∆ , δr ← l2 5: ∆r ← 2r−1 r ˜ ← DP (q, ∆r , δr ) 6: x . solve sub-problems in [0, ∆r ] using step size δr ˜) . L(˜ x) ≡ D(N, x 7: if L(˜ x) ≥ (1 + ) 2∆r then 8: return 9: end if 10: end for 11: end procedure 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

procedure DP (q, Tup , δ) K ← d Tδup e for i ← q.end, q.start do . start from the last element in q if i is a leaf then . initialize C values of leaves ( (j) (j) ∀k ≥ qδ (Ti ) Ci C[i, j, k] ← ∞ otherwise else for j ← 1, M , k ← 1, K do Calculate C[i, j, k] from (4.7) end for end if end for kmin ← minj∈[M ] k s.t. C[N, j, k] ≤ B end procedure

all of the sub-problems for i ∈ [N ], j ∈ [M ] and t ∈ [0, ∆] with sufficiently large ∆, the optimal strategy can be obtained by combining the solutions of these subproblems. Fig. 4.3 shows our methodology. Each circle marks the performance given by an assignment strategy, with x-component as cost and y-component as latency. Our goal is to find out the red circle, that is, the strategy that results in minimum latency and satisfies the cost constraint. Under each horizontal line 65

y = t, we first identify the circle with minimum x-component, which specifies the least-cost strategy among all of strategies that result in latency at most t. These solutions are denoted by the filled circles. In the end, we look at the one in the left plane (x ≤ B) whose latency is the minimum. Instead of solving infinite number of sub-problems for all t ∈ [0, ∆], we discretize the time domain by using the quantization function

qδ (x) = k, if (k − 1)δ < x ≤ kδ.

(4.4)

It suffices to solve all the sub-problems for k ∈ {1, · · · , K}, where K = d ∆δ e. We will analyze how the performance is affected due to the loss of precision by doing quantization and the trade-off with algorithm complexity after we present our algorithm. Suppose we are solving the sub-problem C[i, j, k], given that all of sub-problems of the preceding tasks have been solved, the recursive relation can be described as follows.

(j)

C[i, j, k] = Ci + (j)

km = qδ Ti

min {

xm :m∈C(i)

m∈C(i)

(x j)

+ Tmim

X

.

(x j)

C[m, xm , k − km ] + Cmim },

(4.5)

(4.6)

That is, to find out the minimum cost within latency k at task i, we trace back to its child tasks and find out the minimum cost over all possible strategies, with the latency that excludes the execution delay of task i and data transmission delay. 66

As the cost function is additive over tasks and the decisions on each child task is independent with each other, we can further lower down the solution space from M z to zM , where z is the number of child tasks of task i. That is, by making decisions on each child task independently, we have

(j)

C[i, j, k] = Ci +

X

m∈C(i)

(x j)

min {C[m, xm , k − km ] + Cmim }.

xm ∈[M ]

(4.7)

After solving all the sub-problems C[i, j, k], we solve for the optimal strategy by performing the combining step as follows.

min k s.t. C[N, j, k] ≤ B.

j∈[M ]

(4.8)

Let |I| be the number of bits that are required to represent an instance of our problem. As an FPTAS runs in the time bounded by a polynomial of problem size, |I| and 1 [9], we have to bound K by choosing ∆ that is larger enough to cover the dynamic range, and choosing the quantization step size δ to achieve the required approximation ratio. To find ∆, we solve an unconstrained problem for maximum latency given the input instance. We also propose a polynomialtime dynamic programming to solve this problem exactly, which is summarized in Algorithm 2. To realize how the solution provided by Hermes approximates the minimum latency, we take iterative approach and reduce the dynamic range and step size for each iteration until the solution is close enough to the minimum. 67

We summarize Hermes for tree-structure task graph in Algorithm 3. For rth iteration, we solve for half of the dynamic range with half of the step size compared to last iteration. The procedure DP solves for the minimum quantized latency ˜ be the output strategy based on the dynamic programming described in (4.7). Let x suggested by the procedure and L(˜ x) be the total latency. Algorithm 3 stops when L(˜ x) ≥ (1 + ) 2∆r , or after running log2 ∆ iterations, which implies the smallest precision has been reached.

Theorem 1. Algorithm 3 runs in O(din N M 2 l log2 ∆) time and admits a (1 + ) approximation ratio.

Proof. From Algorithm 3, each DP procedure solves N M K sub-problems, where K = d ∆δrr e = O( l ). Let din denote the maximum indegree of the task graph. For solving each sub-problem in (4.7), there are at most din minimization problems over M devices. Hence, the overall complexity of a DP procedure can be bounded by

l O(N M K × din M ) = O(din N M 2 ).

(4.9)

Algorithm 3 involves at most log2 ∆ iterations, hence, it runs in O(din N M 2 l log2 ∆) time. Since both l and din of a tree can be bounded by N , and log2 T is bounded by the number of bits to represent the instance, Algorithm 3 runs in polynomial time of problem size, |I| and 1 . 68

Now we prove the performance guarantee provided by Algorithm 3. For a given ˆ strategy x, let L(x) denote the quantized latency and L(x) denote the original one. That is, L(x) = D(N, x). Assume that Algorithm 3 stops at the rth iteration and ˜ . As x ˜ is the strategy with minimum quantized outputs the assignment strategy x ˆ x) ≤ L(x ˆ ? ), where x? denotes the optimal latency solved by Algorithm 3, we have L(˜ strategy. For a task graph with depth l, only at most l quantization procedures have been taken. By the quantization defined in (4.4), it always over-estimates by at most δr . Hence, we have

ˆ x) ≤ δr L(x ˆ ? ) ≤ L(x? ) + lδr L(˜ x) ≤ δr L(˜

(4.10)

Since Algorithm 3 stops at the rth iteration, we have

(1 + )

∆ ∆ ≤ L(˜ x) ≤ L(x? ) + lδr = L(x? ) + r . r 2 2

(4.11)

That is, ∆ ≤ L(x? ). 2r

(4.12)

From (4.10), we achieve the approximation ratio as required.

L(˜ x) ≤ L(x? ) + lδr = L(x? ) +

∆ ≤ (1 + )L(x∗ ). 2r

(4.13)

69

Algorithm 4 Hermes FPTAS for serial trees 1: procedure F P T ASpath (N ) . min. cost when task N finishes at devices 1, · · · , M within latencies 1, · · · , K 2: for root il , l ∈ {1, · · · , n} do . solve the conditional sub-problem for every tree 3: for j ← 1, M do 4: Call F P T AS2 tree (il ) conditioning j with modification described in Algorithm Hermes FPTAS for serialontrees (4.14) 1: procedure F P T ASpath (N ) . min. cost when task N finishes at devices 1, · · · , M w for root il , l 2 {1, · · · , n} do . solve the conditional su 5: end for2: 3: for j 1, M do 6: end for 4: Call F P T AStree (il ) conditioning on j with modification described in Eq. (7) 7: for l ← 2, n do for l 2, n do 8: Perform5: combining step in (4.15) to solve C[i , jl , kl ] 6: Perform combining step in Eq. (8) to lsolve C[il , jl , kl ] 9: end for 7: end procedure 10: end procedure i1

chain

i3

i2

tree

tree

Fig. 4: A task graph of serial trees

Figure 4.4: A task graph of serial trees As chain is

with depth l, only at most l quantization procedures are taken. aByspecial case ofdefined a tree, Algorithm 3 also applies the quantization in Eq. (3), it always over estimates by at most . Hence, we have

C[i, j, k|j1 ] = ( (j) (j j) Cex (i) + Ctx1 (di1 i ) 1

8k q (T otherwise

To solve C[i2 , j2 , k2 ], the minimum perform the combining step as

to the task as-

C[i2 , j2 , k2 ] = min

min

j2[M ] kx +ky =k2

C[i1

signment problem of serial tasks.ˆ Instead using the ILP solverSimilarly, to solve the ˆ ⇤of combining C[i2 , j2 , kx ] a L(˜ x)  L(˜ x)  L(x )  L(x⇤ ) + l (5) = c mmax , that is, the latency when the most

C[i3 , j3 , k3 ]. Algorithm 2 summarize

intensive task must be assigned to a device, the optimal solve it. L(x⇤ ), is at least Tmin . From Eq. (5), we have latency,

M calls on different conditions. Fur n can be bounded by N . The laten (1 + ✏) optimal, which leads to the total latency. Hence, Algorithm 2 is

Let T

min rmax formulation for serial tasks proposed previously in [10], we have therefore provided intensive task is executed at the fastest device. As the most assignment strategy for serial trees. T

an FPTAS to

L(˜ x)  L(x⇤ ) + l = L(x⇤ ) + ✏Tmax  (1 + ✏

rmax )L(x⇤ ). (6) rmin

For realistic resource network, the ratio of the fastest CPU rate and the slowest CPU rate is bounded by a constant c0 . SerialLetTrees ✏0 = c10 ✏, then the overall complexity is still bounded by 2 O(din N M 2 l✏ ) and Algorithm 1 admits an (1 + ✏) approximation ratio. Hence, Algorithm 1 is an FPTAS.

C. Parallel Chains of Trees

We take a step further to extend plicated task graphs that can be view trees, as shown in Fig. 1. Our approa by calling F P T ASpath with the con In [11], several of applications are modeled as task graphs that start they fromsplit. a unique For example, in Fig. 1 ther As chain is a special case of a tree, the Hermes FPTAS be solved independently by conditi 1 also applies to the task assignment problem of The combining procedure consists o initial task, then Algorithm split to multiple parallel tasks and finally, all the tasks are merged serial tasks. Instead of using the ILP solver to solve the C[N, j, k|j split ] by Eq. (4) conditione formulation for serial tasks proposed previously in [9], we C[N, j, k] can be solved similarly into one final task. taskangraph tree.in In this have Hence, therefore the provided FPTASisto neither solve it. a chain nor ablocks Eq. (8). By calling F P T A this proposed algorithm is also an F B. Serial Trees

4.3.2

section, we show that by calling Algorithm 3 in polynomial number of times, Hermes can solve the

Most applications start from a unique initial task, then split to multiple parallel tasks and finally, all the tasks are merged taskintograph thattask. consists trees.is neither a chain one final Hence, of theserial task graph nor a tree. In this section, we show that by calling Algorithm 1 in polynomial number of times, Hermes can solve the task graph that consists of serial of trees. The task graph in Fig. 4 can be decomposed into 3 trees connecting serially, where the first tree (chain) terminates in task i1 , the second tree terminates in task i2 . In order to find C[i3 , j3 , k3 ], we independently solve for every tree, with the condition on where the root task of the former tree ends. For

D. Stochastic Optimization

The dynamic resource network, w and link qualities are changing, make strategy vary with time. For Herme mal strategy based 70 on the profiling formulate a stochastic optimization the expected latency subject to exp both latency and cost metrics are can directly apply the deterministic one by assuming that the profilin expectations. However, it is not clea

The task graph in Fig. 4.4 can be decomposed into 3 trees connecting serially, where the first tree (chain) terminates in task i1 , the second tree terminates in task i2 . In order to find C[i3 , j3 , k3 ], we independently solve for every tree, with the condition on where the root task of the former tree ends. For example, we can solve C[i2 , j2 , k2 |j1 ], which is the strategy that minimizes the cost in which task i2 ends at j2 within delay k2 and given task i1 ends at j1 . Algorithm 3 can solve this sub-problem with the following modification for the leaves.

C[i, j, k|j1 ] =

    Ci(j) + Ci(j1 i1 j)    ∞

(j)

∀k ≥ qδ (Ti

(j j)

+ Ti1 i1 ),

(4.14)

otherwise

To solve C[i2 , j2 , k2 ], the minimum cost up to task i2 , we perform the combining step as C[i2 , j2 , k2 ] = min

min

j∈[M ] kx +ky =k2

C[i1 , j, kx ] + C[i2 , j2 , ky |j].

(4.15)

Similarly, combining C[i2 , j2 , kx ] and C[i3 , j3 , ky |j2 ] gives C[i3 , j3 , k3 ]. Algorithm 4 summarizes the steps in solving the assignment strategy for serial trees. To solve each tree involves M calls on different conditions. Further, the number of trees n can be bounded by N . The latency of each tree is within (1 + ) optimal, which leads to the (1 + ) approximation of total latency. Hence, Algorithm 4 is also an FPTAS.

71

4.3.3

Parallel Chains of Trees

We take a step further to extend Hermes for more complicated task graphs that can be viewed as parallel chains of trees, as shown in Fig. 4.1. Our approach is to solve each chains by calling F P T ASpath with the condition on the task where they split. For example, in Fig. 4.1 there are two chains that can be solved independently by conditioning on the split node. The combining procedure consists of two steps. First, solve C[N, j, k|jsplit ] by (4.7) conditioned on the split node. Then C[N, j, k] can be solved similarly by combining two serial blocks in (4.15). By calling F P T ASpath at most din times, this proposed algorithm is also an FPTAS.

4.3.4

More General Task Graph

The Hermes algorithm in fact can be applied to even more general graphs, albeit with weaker guarantees. In this section, we outline a general approach based on identifying the “split nodes” — nodes in the task graph with more than one outgoing edge. From the three categories of task graph we have considered so far, each split node is only involved in the local decision of two trees. That is, in the combining stage shown in (4.15), there is only one variable on the node that connects two serial trees. Hence, the decision of this device can be made locally. Our general approach is to decompose the task graph into chains of trees and call the polynomial time procedure F P T ASpath to solve each of them. If a split node connects two trees from different chains, then we cannot resolve this condition variable and have to 72

keep it until we make the decision on the node where all of involved chains merge. We use the task graph in Fig. 4.1 to show an example: as the node (marked with split) splits over two chains, we have to keep it until we make decisions on the final task, where two chains merge. On the other hand, there are some nodes that split locally, which can be resolved in the F P T ASpath procedure. A node that splits across two different chains requires O(M ) calls of the F P T ASpath . Hence, the overall complexity of Hermes in such graphs would be O(M S ), where S is the number of “global” split nodes. If the task graph contains cycles, similar argument can be made as we classify them into local cycles and global cycles. A cycle is local if all of its nodes are contained in the same chain of trees and is global otherwise. For a local cycle, we solve the block that contains it and make conditions on the node with the edge that enters it and the node with the edge that leaves it. However, if the cycle is global, more conditions have to be made on the global split node and hence the complexity is not bounded by a polynomial.

The structure of a task graph depends on the granularity of partition. If an application is partitioned into methods, many recursive loops are involved. If an application is partitioned into tasks, which is a block of code that consists of multiple methods, the structure is simpler. As we show in the following, Hermes can tractably handle practical applications whose graph structures are similar to benchmarks in [11]. 73

Numerical Evaluation 3 bound Hermes optimal

2.5

ratio

2

1.5

1

6

5

4

3

2

1

0.5 0.4

0.3

0.2

ε

Figure 4.5: Hermes converges to the optimum as decreases

2

1.8

bound 1.6

ratio

4.4

1.4

1.2

1

optimal 0.8

3

1

0.5

0.4

0.3

0.2

ε

Figure 4.6: Hermes over 200 different application profiles

74

avg latency (ms)

40 Hermes optimal 35

30

25

20

60

avg cost

50 40 30 20 10 0

6

5

4

3

2

1

ε

Figure 4.7: Hermes in dynamic environments First, we verify that Hermes provides near-optimal solution with tractable complexity and performance guarantee. Then, we use the real data set of several benchmark profiles to evaluate the performance of Hermes and compare it with the heuristic Odessa approach proposed in [11].

4.4.1

Algorithm Performance

From our analysis result in Section 4.3, the Hermes algorithm runs in O(din N M 2 l log2 T ) time with approximation ratio (1 + ). In the following, we provide the numerical results to show the trade-off between the complexity and the accuracy. Given the 75

7

10

Brute Force Hermes 6

10

5

CPU time (ms)

10

4

10

3

10

2

10

1

10

10

12

14

16

18

20

number of tasks

Figure 4.8: Hermes: CPU time measurement

task graph shown in Fig. 4.1 and M = 3, the performance of Hermes versus different values of is shown in Fig. 4.5. When = 0.4, the performance converges to the minimum latency. Fig. 4.5 also shows the bound of worst case performance in dashed line. The actual performance is much better than the (1 + ) bound.

We examine the performance of Hermes on different problem instances. Fig. 4.6 shows the performance of Hermes on 200 different application profiles. Each profile is selected independently and uniformly from the application pool with different task workloads and data communications. The result shows that for every instance we have considered, the performance is much better than the (1 + ) bound and converges to the optimum as decreases. 76

If the means of these stochastic processes are known, Hermes can solve for the best strategy based on these means. Fig. 4.7 shows that how the strategies suggested by Hermes perform under the dynamic environment. The average performance is taken over 10000 samples. From Fig. 4.7, the solution converges to the optimal one as epsilon decreases, which minimizes the expected latency and satisfies the expected cost constraint.

4.4.2

CPU Time Evaluation

Fig. 4.8 shows the CPU time for Hermes to solve for the optimal strategy as the problem size scales. We use Apple Macbook Pro equipped with 2.4GHz dual-core Intel Core i5 processor and 3MB cache as our testbed and use java management package for CPU time measurement. For each problem size, we measure Hermes’ CPU time over 100 different problem instances and show the average with vertical bar as standard deviation. As the number of tasks (N ) increases in a serial task graph, the CPU time needed for the Brute-Force algorithm grows exponentially, while Hermes scales well and still provides the near-optimal solution ( = 0.01). From our complexity analysis, for serial task graph l = N , din = 1 and we fix M = 3, the CPU time of Hermes can be bounded by O(N 2 ). 77

4.4.3

Benchmark Evaluation

In [11], Ra et al. present several benchmarks of perception applications for mobile devices and propose a heuristic approach, called Odessa, to improve both makespan and throughput with the help of a cloud connected server. They call each edge and node in the task graph as stages and record the timestamps on each of them. To improve the performance, for each data frame, Odessa first identifies the bottleneck, evaluates each strategy with simple metrics and finally select the potentially best one to mitigate the load on the bottleneck. However, this greedy heuristic does not offer any theoretical performance guarantee, as shown in Fig. 4.9 Hermes can improve the performance by 36% for task graph in Fig. 4.1. Hence, we further choose two of benchmarks, face recognition and pose recognition, to compare the performance between Hermes and Odessa. Taking the timestamps of every stage and the corresponding statistics measured in real executions provided in [11], we emulate the executions of these benchmarks and evaluate the performance. In dynamic resource scenarios, as Hermes’ complexity is not as light as the greedy heuristic (86.87 ms in average) and its near-optimal strategy needs not be updated from frame to frame under similar resource conditions, we propose the following online update policy: similar to Odessa, we record the timestamps for online profiling. Whenever the latency difference of current frame and last frame goes beyond the threshold, we run Hermes based on current profiling to update the strategy. By doing so, Hermes always gives the near-optimal strategy for current 78

resource scenario and enhances the performance at the cost of reasonable CPU time overhead due to resolving the strategy. As Hermes provides better performance in latency but larger CPU time overhead when updating, we define two metrics for comparison. Let Latency(t) be the normalized latency advantage of Hermes over Odessa up to frame number t. On the other hand, let CP U (t) be the normalized CPU advantage of Odessa over Hermes up to frame number t. That is, t

1X LO (i) − LH (i) , Latency(t) = t i=1 CP U (t) =

C(t) t X 1 X CP UH (i) − CP UO (i) , t i=1 i=1

(4.16)

(4.17)

where LO (i) and CP UO (i) are latency and update time of frame i given by Odessa, and the notations for Hermes are similar except that we use C(t) to denote the number of times that Hermes updates the strategy up to frame t. To model the dynamic resource network, the latency of each stage is selected independently and uniformly from a distribution with its mean and standard deviation provided by the statistics of the data set measured in real applications. In addition to small scale variation, the link coherence time is 20 data frames. That is, for some period, the link quality degrades significantly due to possible fading situations. Fig. 4.10 shows the performance of Hermes and Odessa for the face recognition application. Hermes improves the average latency of each data 79

frame by 10% compared to Odessa and increases CPU computing time by only 0.3% of overall latency. That is, the latency advantage provided by Hermes wellcompensates its CPU time overhead. Fig. 4.11 shows that Hermes improves the average latency of each data frame by 16% for pose recognition application and increases CPU computing time by 0.4% of overall latency. When the link quality is degrading, Hermes updates the strategy to reduce the data communication, while Odessa’s sub-optimal strategy results in significant extra latency. Considering CPU processing speed is increasing under Moore’s law but network condition does not change that fast, Hermes provides a promising approach to trade-in more CPU for less network consumption cost.

4.4.4

Discussion

We have formulated a task assignment problem and provided a FPTAS algorithm, Hermes, to solve for the optimal strategy that makes the balance between latency improvement and energy consumption of battery-operated devices. Compared with previous formulations and algorithms, to the best of our knowledge, Hermes is the first polynomial time algorithm to address the latency-resource trade-off problem with provable performance guarantee. Moreover, Hermes is applicable to more sophisticated formulations on the latency metrics considering more general task dependency constraints as well as multi-device scenarios. The CPU time measurement shows that Hermes scales well with problem size. We have further emulated 80

the application execution by using the real data set measured in several mobile benchmarks, and shown that our proposed online update policy, integrating with Hermes, is adaptive to dynamic network change. Furthermore, the strategy suggested by Hermes performs much better than greedy heuristic so that the CPU overhead of Hermes is well compensated.

81

45

Odessa Hermes

avg: 36.0095

latency (ms)

40

35

30

25

avg: 26.4896 20

0

20

40

60

80

100

120

140

160

180

200

frame number

Figure 4.9: Hermes: 36% improvement for the example task graph 1100 1000

Odessa Hermes

avg: 682

latency (ms)

900 800 700 600 500 400 300 3

avg: 621 0

20

40

60

80

100

120

140

160

180

200

10

Odessa extra latency Hermes extra CPU overhead time (ms)

2

10

1

10

0

10

0

20

40

60

80

100

120

140

160

180

200

frame number

Figure 4.10: Hermes: 10% improvement for the face recognition application avg: 6261

12000

Odessa Hermes

avg: 5414

latency (ms)

10000 8000 6000 4000 2000 0

0

50

100

150

200

Odessa extra latency Hermes extra CPU overhead

3

time (ms)

10

2

10

1

10

0

50

100

150

200

frame number

Figure 4.11: Hermes: 16% improvement for the pose recognition application 82

Chapter 5

Deterministic Optimization with Multiple Constraints

In this study, we follow the assumption of known and deterministic profiles and formulate an optimization problem to find out the best task assignment that minimizes the overall latency, but is subject to individual constraints on each device. This multi-constrained formulation clearly attributes the cost to each device separately. Hence, we can avoid assignments that mostly rely on a single device and hence drain its battery. We show that our formulation is NP-hard, propose two polynomial-time approximation algorithms with provable performance guarantees with respect to the optimum, and verify our analysis through numerical simulation. This chapter is based on our work in [34]. 83

block diagram

resource network

data frame

task assignment

tiler stage 1

A

detect detect detect stage 2

feature merger stage 3

B

graph splitter stage 4

classify classify stage 5

C recog merge stage 6

output

Figure 5.1: Collaborative computing on face recognition application

device

task

device j task i

task i+1 device k

Figure 5.2: Graphical illustration of MCTA

84

5.1

Problem Formulation

Fig. 5.1 takes the face recognition application from [11] as an example1 , which consists of several major stages (tasks) like face detection and classifiers. The system processes each coming image and outputs the result as a set of names, which significantly reduce the data size compared to the raw data. Formally, suppose a data processing application consists of N stages (tasks), where each data frame goes though N stages in order to be processed. There are M available devices in the network. Our goal is to find the optimal task assignment over these devices. That is, for each task i, find a device j to execute it such that the overall latency when finishing N tasks is minimized. Furthermore, each device has an individual cost constraint. We can view this problem as finding the optimal path from the 1st stage to the N th stage of the trellis diagram as shown in Fig. 5.2. For cleaner presentation, each edge on the trellis diagram is denoted by a 3tuple (i, j, k). Our decision variable is a binary variable xijk ∈ {0, 1}, where being 1 denotes the assignment of task i on device j and task i + 1 on device k. If edge (i, j, k) is selected, a latency is induced and is denoted by Tijk , which is the sum of execution latency of task i on device j and potential data transmission latency from device j to k if j 6= k. Furthermore, let Cij denote the cost of executing task e i on device j. Let Cijk be the data emission cost of transmitting the intermediate r result of task i from device j to k. Similarly, Cijk be the data receiving cost induced 1

We neglect some control signals exchanged between stages, as they are relatively small compared to the data frame.

85

on device k. Moreover, we use the notation [N ] to denote the set {1, 2, · · · , N }. The multi-constrained task assignment (MCTA) problem can be formulated as a linear integer programming.

MCTA : min

X X

xijk Tijk

i∈[N ] j,k∈[M ]

s.t.

X X

i∈[N ] k∈[M ]

X

xijk =

j∈[M ]

X

e r xijk (Cij + Cijk ) + xikj Cikj ≤ Bj , ∀j ∈ [M ]

j,k∈[M ]

X

l∈[M ]

xi+1,kl , ∀i ∈ [N − 1], k ∈ [M ]

xijk = 1, ∀i ∈ [N ]

xijk ∈ {0, 1}

(5.1)

(5.2)

(5.3)

(5.4)

Since the application ends at the N th task, we have CNe jk = CNr jk = 0. Eq. (5.1) is the cost constraint for each device. Constraint (5.2) implements the rule that if we assign task i + 1 on device k, then for the next task we have to pick one edge starting from k. Constraint (5.3) implies that we have to pick exactly one edge for each stage in the trellis diagram.

5.1.1

Hardness of MCTA

We first reduce an NP-hard problem, called the general assignment problem (GAP) [51], to MCTA. That is, MCTA is at least as hard as GAP. Therefore, MCTA 86

is also NP-hard. In GAP, there are N items and M bins. The reward of packing item i to bin j is pij . The goal is to pack each item into exactly one bin such that the total reward is maximized, subject to the cost constraints on each bin. The integer programming formulation of GAP is as follows.

GAP : X X

max

xij pij

i∈[N ] j∈[M ]

X

s.t.

i∈[N ]

X

xij wij ≤ Bj , ∀j ∈ [M ]

(5.5)

xij = 1, ∀i ∈ [N ]

(5.6)

j∈[M ]

xij ∈ {0, 1}

(5.7)

Given an instance of GAP, we will map the instance to the input of MCTA and transfer the corresponding solution back to the solution of GAP. An instance of MCTA can be described by as

e r , Cijk , Bj0 , N 0 , M 0 } {Tijk , Cij , Cijk

87

Solving the GAP problem with instance {pij , wij , Bj , N, M } is equivalent to solving MCTA by the following mapping.

N0 = N M0 = M Bj0 = Bj , ∀j ∈ [M ] Tijk = Pmax − pij , ∀i ∈ [N ], j, k ∈ [M ]

(5.8)

Cij = wij , ∀i ∈ [N ], j ∈ [M ]

(5.9)

e r Cijk = Cijk =0

(5.10)

By defining Pmax = maxi,j pij , Eq. (5.8) transfer the maximization of objective in GAP to the objective in MCTA. Eq. (5.9) maps the cost of packing item i into bin j to the execution cost of task i on device j in MCTA. Given the solution of MCTA, {xijk }, the solution of GAP is to pack item i to bin j if xijk = 1. Since {xijk } satisfies the constraint in (5.3), for each i, there exists one and only one tuple (j, k) of which xijk = 1. That is, for each item i, we can find exactly one bin j to pack the item suggested by the solution of MCTA (k is irrelevant). By the mappings of input parameters and solution, we show that GAP can be reduced to MCTA. Hence, MCTA is NP-hard.

We further verify the upper bound of hardness of MCTA. We notice from Fig. 5.2, solving MCTA is equivalent to finding an optimal and feasible path on 88

the trellis diagram. Hence, we correlate MCTA to the multi-constrained path selection problem (MCP) [52], where one wants to find an optimal and feasible path on a directed acyclic graph (DAG) given a starting node and a destination. Literally, MCTA falls in the set of formulations of MCP. We omit the details of the reduction from MCTA to MCP and refer the reader to more literature on MCP problems [53]. To summarize, we can bound the hardness of MCTA by

GAP ≤p MCTA ≤p MCP.

(5.11)

The relation A ≤p B states that problem A is polynomial-time reducible to problem B. That is, A can be solved by calling the solver of B a polynomial number of times. Since these problems are NP-hard, researchers aim to find polynomial-time algorithms to approximately solve them and give the performance guarantee of the sub-optimal solution. There have been several approximation results for GAP and MCP. Fleischer et al. [51] propose an LP rounding algorithm that approximates GAP by (1 − 1e ). On the MCP problem, if the number of constraints is a constant, then Xue et al. [54] propose a fully polynomial time approximation scheme (FPTAS [9]), which approximates MCP by (1 + ) and the complexity of the algorithm is bounded by a polynomial of

1

and the problem size. However, in MCTA, the number of

constraints grows with the network size M . Hence, directly applying the FPTAS algorithm to MCTA results in exponential complexity. The authors in [55] propose 89

Algorithm 5 Sequential Randomized Rounding Algorithm 1: procedure SARA(x?ijk ) 2: Choose v1 , v2 w.p. x?1jk 3: for i ← 2, · · · , N − 1 do x? 4: Given vi , choose vi+1 = k w.p. P ivxi?k l

5: 6: 7:

end for Return v1 , · · · , vN end procedure

ivi l

an algorithm that admits M approximate ratio, which grows with the number of constraints. Having justified the hardness of MCTA, we propose two polynomialtime algorithms with provable performance guarantees.

5.2

Sequential Randomized Rounding Algorithm

In this section, we solve the LP-relaxation of MCTA and design a randomized rounding algorithm based on the LP solution. For each data frame, our algorithm sequentially assigns each task to a device in order. Unlike most LP rounding algorithms that independently round the fractional solution to integers, our algorithm rounds the assignment on task i depending on the rounding result of task i − 1. Hence, we call it Sequential rAndomized Rounding Algorithm (SARA). If we relax (5.4) to allow each variable falling in the interval [0, 1], then we have an LP-relaxation of MCTA. We first solve the LP-relaxation and design a simple rounding algorithm to round the fractional values to integers, either 0 or 1. Let {x?ijk } denote the optimal solution of the LP-relaxation. We propose SARA shown in Algorithm 5. SARA takes {x?ijk } as input and outputs a sequence of vertices 90

V1 , · · · , VN , which implies the selected path on the trellis diagram in Fig. 5.2. The corresponding task assignment would be assigning task i to device Vi for each i. Since SARA is a randomized algorithm, we use capital letters to denote the output as random variables. On the other hand, we use small letters v1 , · · · , vN to denote constant values. Furthermore, we use Ei = (i, Vi , Vi+1 ) denote the random edge selected in the ith stage of Algorithm 5.

SARA makes task assignment as follows. First, select an edge on the trellis diagram that represents for the assignment on task 1 and task 2 with distribution implied by x?1jk . Starting from task 3, select the device based on the conditional probability given the assignment of the previous task,

P{Vi+1 = v|Vi = u} = P

x?iuv l∈[M ]

x?iul

.

(5.12)

Theorem 2. The expected performance of Algorithm 5 is

X X

x?ijk Tijk ,

(5.13)

i∈[N ] j,k∈[M ]

which is the minimum objective of the LP-relaxation. Furthermore, the expected cost on each device j is X X

e r x?ijk (Cij + Cijk ) + x?ikj Cikj .

(5.14)

i∈[N ] k∈[M ]

91

We first prove the following lemma.

Lemma 1. For all i ∈ {2, · · · , N }, v ∈ {1, · · · , M }, implementing Algorithm 5 results in P{Vi = v} =

X

x?i−1,uv

(5.15)

u∈[M ]

Proof. We prove this lemma by induction. For i = 2, we have

P{V2 = v} =

X

P{V1 = u, V2 = v} =

X

x?1uv .

(5.16)

u

u

Assume the case i = n is true. When i = n + 1,

P{Vn+1 = v} =

X

P{Vn = u, Vn+1 = v}

(5.17)

P{Vn = u}P{Vn+1 = v|Vn = u}

(5.18)

u

=

X u

=

XX u

=

X

w

x? x?n−1,wu P nuv? s xnus

x?nuv .

(5.19) (5.20)

u

Eq. (5.19) uses the fact that the optimal solution {x?ijk } satisfies the constraint in (5.2). That is,

P

w

x?n−1,wu =

P

s

x?nus . Hence, we get the result as required. 92

5.2.1

Proof of Theorem 2

Let Ti be the latency induced by selecting edge (i, Vi , Vi+1 ), which is a random variable depending on Vi and Vi+1 . The expected objective value given by Alg. 5 can be written as E{

X

i∈[N ]

For i = 1, we have E{T1 } = we have

P

E{Ti } = =

jk

X

i∈[N ]

E{Ti }.

(5.21)

x?ijk Tijk implied by Alg. 5. For i = 2, · · · , N − 1,

X j

P{Vi = j}EVi+1 {Ti |Vi = j}

XX j

=

Ti } =

X

x?i−1,uj

u

x?ijk Tijk .

X k

x?ijk Tijk P ? l xijl

(5.22) (5.23) (5.24)

j,k

Eq. (5.23) again uses the fact that {x?ijk } satisfies the constraint in (5.2). Summing up the stage-wise expected values, we achieve the expected performance given in Theorem 2. 93

The expected cost on each device can be derived in the similar way. Let Dij be the cost induced on device j by selecting edge (i, Vi , Vi+1 ). That is,     e  Cij + Cijk         r  Cikj

Dij =

if Vi = j, Vi+1 = k (k 6= j) if Vi = k, Vi+1 = j (k 6= j)

   r e  + Cijj Cij + Cijj          0

(5.25)

if Vi = j, Vi+1 = j otherwise.

The expected cost on device j can be written as

E{

X

i∈[N ]

For i = 1, E{D1j } =

E{Dij } = =

X

u,v∈[M ]

=

X v6=j

=

X

w

k

X

i∈[N ]

E{Dij }.

(5.26)

r e . For i = 2, · · · , N − 1, we have ) + x?1kj C1kj x?1jk (C1j + C1jk

P{Vi = u}P{Vi+1 = v|Vi = u}E{Dij |Vi = u, Vi+1 = v}

XX u,v

P

Dij } =

x? x?i−1,wu P iuv? E{Dij |Vi = u, Vi+1 = v} l xiul

e x?ijv (Cij + Cijv )+

X

r e r x?iuj Ciuj + x?ijj (Cij + Cijj + Cijj )

(5.27)

(5.28) (5.29)

u6=j

e r x?ijk (Cij + Cijv ) + x?ikj Cikj .

(5.30)

k

Theorem 2 implies SARA achieves the optimum of the LP and induces feasible cost on each device in expectation. That is, if we run SARA for each data frame, in 94

the long run, the average latency per frame will converge to the optimum of the LP implied by the strong law of large numbers. Since the optimal solution of MCTA is feasible to its LP-relaxation, SARA achieves the optimal performance in average. The LP-relaxation of MCTA has N M 2 variables and Algorithm 5 runs in O(N ) time. Hence, SARA, including the LP, runs in polynomial time. On the other hand, we can naively formulate an ILP, where for each assignment strategy i, the binary variable yi denotes if the corresponding assignment is selected or not. Similarly, we solve the LP-relaxation and design the rounding algorithm to be selecting assignment i with probability yistar . This naive algorithm also achieves the optimal performance in expectation. However, there are M N assignment strategies, which results in solving an LP whose number of variables grows exponentially with the problem size and sampling over exponentially many assignment strategies (O(M N )). Hence, we propose SARA that runs efficiently and achieves the same performance guarantee.

5.3

A Bicriteria Approximation Algorithm for MCTA with Bounded Communication Costs2

In this section, we assume that device communication costs are bounded, specifie r cally we assume ∀(i, j, k) : Cijk , Cijk ∈ [C, βC]. We make no assumptions about any

overall bounds on task latencies or execution costs and let Cij , Tijk ∈ R+ . Note that 2

This section is a joint work with Dr. Rajgopal Kannan, University of Southern California.

95

Tijk are a combination of two different metrics: task computing latency and data transmission latency and therefore we can consider separating these two components. More specifically, define Fikl as the latency involved in forwarding the results of task i received by device k to a device l. It is reasonable to assume that packet forwarding can be handled directly by the network interface at a device with low overheads and therefore we assume that the packet forwarding latency Fikl ≤ θTîjk for all tasks and devices, where Tîjk = mini,j,k Tijk and θ > 1 is a small constant. It is reasonable to assume θ is small since Tîjk involves both task computing and data transmission latencies while F involves only data forwarding latency. We then define and solve a modified LP-relaxation of the original MCTA integer program by separating the functionalities of task execution and data forwarding and convert the resultant fractional assignment of tasks to devices into an integral assignment using a technique similar to one used in finding minimum cost makespan scheduling [56]. We show that the resultant assignment is a (θ + 1, 2β + 2)-approximation to the optimal integral solution, where the total latency of our assignment is within a (θ + 1)-factor of the optimal latency while all device energy costs are within a (2β + 2)-factor of their original budgets.

96

Consider the following modified version of the original MCTA integer program:

MCTA2 : min

X X

xijk Tijk

i∈[N ] j,k∈[M ]

s.t.

X X

e r xijk (Cij + Cijk + Ci−1,∗j )+

i∈[N ] k∈[M ] e r ) ≤ (β + 1)Bj , ∀j ∈ [M ] + Cij∗ xikj (Cikj

X

j,k∈[M ]

xijk = 1, ∀i ∈ [N ]

xijk ∈ {0, 1} ∀i ∈ [N ], ∀(j, k) ∈ [M ]

(5.31) (5.32)

(5.33)

e r in (5.31) represent the maximum reception and emission and Ci,j∗ where Ci−1,∗j

costs of tasks i − 1 and i to device j, respectively. The integer program of MCTA2 above selects the best transmit-receive pair of devices (j, k) for each task i ∈ [N ] that minimizes the total latency over all N tasks while satisfying (β + 1) upscaled device budget constraints. Once the results of a task are received at a device, they are forwarded to the device executing the next task. Let M0 = {(i, ji0 , ki0 )} denote the optimal solution to MCTA2. Under this notation, task i is executed at device 0 ji0 and the emitted results are received by device ki0 , which then forwards it to ji+1 , 0 the computing device for task i + 1 (only necessary, if ki0 6= ji+1 ). Similarly, let

M∗ = {(i, ji∗ , ki∗ )} denote the optimal solution to the original MCTA problem,

97

where task i is executed at device ji∗ and results transmitted to ki∗ . Under M∗ , we ∗ , ∀i ∈ [N − 1]. Then we have, have ki∗ = ji+1

Lemma 2. The optimal solution M∗ to the original MCTA problem is a feasible solution for the MCTA2 problem.

∗ Proof. Consider task assignments (i − 1, ji−1 , ji∗ ) and (i, ji∗ , ki∗ ) from M∗ . Using

(5.31), the cost to device ji∗ in MCTA2 of this assignment is:

r r e (Ciji∗ + Cije i∗ ki∗ + Ci−1,∗j ∗ ) + (Ci−1,j ∗ j ∗ + Cij ∗ ∗ ) i i−1 i i

r ≤ (β + 1) Ciji∗ + Cije i∗ ki∗ + Ci−1,j ∗ j∗ i−1 i

(5.34) (5.35)

where (5.35) follows from the bounds on communication costs. The last term of (5.35) represents the energy cost to device ji∗ for implementing task assignments ∗ (i − 1, ji−1 , ji∗ ) and (i, ji∗ , ki∗ ) in the optimal MCTA solution M∗ . When summing

this up over all tasks in M∗ , this sum is less than Bj since M∗ is optimal and the result follows.

Lemma 3. The total latency of the optimal solution M0 is at most θ + 1 times the optimal latency of M∗ . 98

Proof. The optimal solution derived from M0 requires forwarding for all tasks i 0 0 with associated cost Fiki0 ji+1 . Thus the latency of this solution such that ki0 6= ji+1

is at most X

i∈[N ]

≤

0 Tiji0 ki0 + Fiki0 ji+1 ≤

X

i∈[N ]

X

i∈[N ]

Tiji0 ki0 + θ · Tîjk

(θ + 1)Tiji0 ki0 ≤ (θ + 1)

X

Tiji∗ ki∗ ,

(5.36) (5.37)

i∈[N ]

where the first part of (5.37) is because Tîjk is the minimum latency for any task i and the second part follows from lemma 2.

Now consider the LP-relaxation of MCTA2 with the additional constraints

e r xijk = 0 if (Cij + Cijk > Bj or Cijk > Bk )

(5.38)

xijk ≥ 0 ∀i ∈ [N ], ∀(j, k) ∈ [M ]

(5.39)

Constraint (5.38) ensures that in the LP-relaxation, there is no fractional assignment of task i to device j with emission to device k if either the combined execution and data emission costs exceed the budget of device j or the data reception costs exceed the budget of device k. Let {˜ xijk } and T˜ denote the optimal solution and the optimal objective function value obtained from the LP-relaxation defined above. Note that T˜ ≤ T 0 = 99

P

i∈[N ]

Tiji0 ki0 . We convert this fractional optimal assignment to an integral multi-

constrained task assignment as follows: For device j ∈ [M ], let nj = d

P

i∈[N ]

P

k∈[M ]

represent the net integral weight of tasks assigned for execution at device j. We define a bipartite graph G = (U, V, E) as follows: For each task i, add a node vi to n

V and for each device j, add vertex set Uj = {u1j , u2j , . . . , uj j } to U . Next for each device j, define a logical vertex set Vj = {vijk } consisting of one logical vertex for each task-device index ijk such that x˜ijk > 0. This logical set represents all tasks i ∈ [N ] with a positive fractional assignment x˜ijk > 0 to device j, with results emitted to device k ∈ [M ]. Each logical vertex vijk is mapped to an actual task vertex vi ∈ V and has the following attributes: a communication e weight defined as Cij + Cijk and an assignment weight defined as x˜ijk .

Sort the logical set Vj in non-increasing order of communication weights. Let Lj = [vl1 , vl2 , . . .] be this sorted list of vertices. Note that each actual vertex vi ∈ V may appear several times in this list. For notational convenience henceforth, let bt denote the assignment weight of a typical vertex vlt ∈ Lj , t = 1, 2, . . .. Divide Lj into nj groups of consecutive vertices as follows: The first group G1j consists of vertices (vl1 , vl2 , . . . vlr ) where

Pr−1 t=1

bt < 1 and we can divide br = b0r + ˆbr ,

Pr−1 0 ˆ where ˆbr ≥ 0, such that t=1 bt + br = 1. If br > 0, then the second group

G2j consists of (vlr , vlr+1 , . . . , vlr+k ). Again, for the last vertex in the group, divide

P br+k = b0r+k + ˆbr+k , where ˆbr+k ≥ 0, such that ˆbr + b0r+k + k−1 t=1 br+t = 1. However

if ˆbr = 0, then the second group G2j does not include vlr and instead consists of

100

x˜ijk e.

(vlr+1 , . . . , vlr+k ). Again, for the last vertex in the group, divide br+k = b0r+k + ˆbr+k , where ˆbr+k ≥ 0, such that b0r+k +

Pk−1 t=1

br+t = 1.

This process is repeated for each of the nj groups. In general, if a vertex vlt appears in two consecutive groups q and q+1 as the last and first vertex respectively, then its assignment weight is split as b0t and ˆbt (as defined above) among groups q and q + 1. WLOG, we use the notation b0 to denote the (possibly split) fractional contribution of the last vertex to its group and ˆb to denote the (possibly split) fractional contribution of the first vertex to its group. Note that with the possible exception of the last group nj , the sum of the (split) assignment weights of all vertices in a group add up to 1. Once the groups have been formed, draw weighted edges in E between the vertices in Lj and Uj as follows: From each vertex vlt in group Gqj , we draw an edge to a single vertex uqj ∈ Uj , 1 ≤ q ≤ nj with edge weight Tijk , where ijk is the label of this logical vertex. (Again, note that logical vertex vlt corresponds to some actual vertex vi ∈ V from which we draw the edge). The degree of each vertex in Uj is at least 1. Each task i appears exactly once as vertex vi in V but is mapped to several logical vertices in Lj ; for each group Gqj to which it is mapped we have an edge (vi , uqj ) ∈ E. Consider a minimum weighted matching M on bipartite graph G such that every vertex vi ∈ V (task i) is matched to exactly one vertex in U (some device j). Further every vertex uqj ∈ Uj is matched to at most one vertex (task) in group 101

Gqj , 1 ≤ q ≤ nj . We convert this matching to an assignment of tasks to devices M = {(i, j, k)}, if logical vertex vijk is included in the matching.

The objective function for this matching can be expressed as min

P

vi ∈V |vijk →vi

Tijk zijk ,

where vijk → vi is true if logical vertex vijk ∈ Vj is mapped to vi ∈ V and zijk ∈ {0, 1} represents the (integer) indicator variable for matching vertex vi to the corresponding vertex in Uj with edge weight Tijk . Hence, the minimum weighted matching can be further interpreted by tuning the fractional weights {˜ xijk } given by the LP-relaxation of MCTA2 without being limited by the budget constraints on each device. Therefore, its resulting minimum value is less than T˜ and hence is less than T 0 , which is the optimum of MCTA2.

For any two consecutive task assignments (i, j, k) ∈ M, (i + 1, l, m) ∈ M, task i are executed at j, the result is emitted to k and forwarded to l. We now show that this violates device budget constraints by at most a 2β + 2 factor.

Lemma 4. For each device j ∈ [M ], the energy cost of the solution obtained through matching M is ≤ (2β + 2)Bj . 102

Proof. For each group Gqj , 1 ≤ q ≤ nj , let vf q,jb and vlq,jd denote the first and last (logical) vertices with f q, lq ∈ [N ] and b, d ∈ [M ]. The energy cost to j for task execution and emission in matching M is therefore bounded by ˜M B j

≤

nj X

Cf q,j +

Cfeq,jb

+

Cfrq−1,∗j

q=1

≤(β + 1)(Bj + ≤(β + 1)(Bj +

nj −1

X

≤ (β + 1)

nj X

(5.40)

q=1

e ) Clq,j + Clq,jk

(5.41)

q=1

nj −1

X

X

e (Cij + Cijk )bi

q=1 vijk ∈Gqj \{vf q,jb ,vlq,jd }

e + (Cf q,j + Cfeq,jb )ˆbf q + (Clq,j + Clq,jd )b0lq )

≤(β + 1)(Bj +

Cf q,j + Cfeq,jb

X

e (Cijk + Cijk )˜ xijk )

(5.42) (5.43)

vijk ∈Vj

Eqs. (5.40) and (5.41) are due to the fact that at most one task vertex from a group Gqj gets matched; vertices are sorted in non-increasing order of weights; Cf 1,j + Cfe1,jb ≤ Bj , and Cfrq−1,∗j ≤ βCfeq,jb from our assumption on bounded communication costs. Eq. (5.42) follows by putting the fractional weights (bt , b0t , ˆbt ) on the sorted list of vertices induces more cost than putting all integral weights on the vertices with least cost in each group. Eq. (5.43) holds since bt corresponds to x˜ijk and so as the sum of b0t and ˆbt in the splitting case.

103

Device j also incurs cost in receiving and forwarding the results of task i in some task assignment (ikj), for some i ∈ [N ], and some k ∈ [M ]. Its energy costs from matching M can be simply bounded as nk X X

k∈[M ] q=1

r e C∗∗j + C∗j∗ ≤ (β + 1)

X

vikj ∈V

r e (Cikj + C∗j∗ )˜ xikj

(5.44)

Adding (5.43) and (5.44) and using the device energy constraint of MCTA2 (5.31) gives the result.

Putting lemma 3 and lemma 4 together, we get the main result,

Theorem 3. M is a (θ + 1, 2β + 2)-approximation to the optimal multi-constrained task assignment MCTA problem.

If we add dummy nodes and edges with infinite weights to the bipartite graph to make it balance and complete, then the minimum weighted matching is equivalent to finding minimum weight perfect matching in the complete weighted bipartite graph, which can be solved in polynomial time [57]. Hence, our proposed bicriteria approximation (BiApp), including solving an LP-relaxation with N M 2 variables and a minimum weighted bipartite matching with O(N ) nodes, runs in polynomial time and provides performance guarantees on both objective and constraints. 104

Table 5.1: SARA/BiApp: Simulation Profile Parameter Interval N/M

10/3 (fixed)

θ

3 (fixed)

Tijk

U nif orm ∼ [0, 18]

Cij e r Cijk , Cijk

Bj

U nif orm ∼ [0, 12]

U nif orm ∼ [4, 8] (β = 2) U nif orm ∼ [50, 100]

Objective − Latency Latency

48

SARA Fractional Opt

47 46 45 0

50

100

150

200

Device 1 − Cost Cost

45

SARA Expected Cost

40 35 0

50

100

150

200


30

SARA Expected Cost

20 10 0

50

100

150

200


30 20

SARA Expected Cost

10 0

50

100

150

200

frame number

Figure 5.3: SARA: optimal performance in expectation

5.4

Numerical Evaluation

We simulate the performance of SARA and BiApp on comprehensive problem in105 stances. Table 5.1 summarizes our simulation profile. The application task graph

Objective − Latency 1.4

ratio

1.2 1 0.8

0

10

20

30

40

50

40

50

40

50

Device 1 − Cost 4

ratio

3 2 1 0 0

10

20

30

Device 2 − Cost 4

ratio

3 2 1 0 0

10

20

30

Device 3 − Cost 4

SARA BiApp ratio (1)

ratio

3 2 1 0 0

10

20

30

40

50

instance

Figure 5.4: BiAPP: (θ + 1, 2β + 2)-approximation (θ = 3, β = 2)

consists of 10 stages and the resource network contains 3 devices. For each problem instance, the single-stage costs and latencies are uniformly and independently drawn from the prescribed intervals. So are the individual cost constraints, Bj . 106

Fig. 5.3 shows the performance of SARA. For each data frame, SARA implements the randomized rounding algorithm based on the solution of LP-relaxation {x?ijk }, and gives the task assignment for processing this data frame. Hence, we present the simulation result as running averages. Let Latency(t) denote the overall latency of frame t. The running average at t = T can be calculated as T 1X Latency(t). avg(T ) = T t=1

(5.45)

In Fig. 5.3, the red lines are the performance of the optimal fraction solution. That is, we set the variables in MCTA as {x?ijk } and get the overall latency and costs induced on each devices. We can see that the average performance of SARA converges to these numbers. Furthermore, since the minimum objective of an LP-relaxation is always smaller than the minimum objective of its original IP, SARA achieves the optimal performance and induces feasible cost on each device asymptotically on average. Fig. 5.4 shows the performance of BiApp over 50 randomly selected problem instances. Given an instance, we present the performance of BiApp as the ratio of BiApp’s induced latency to the optimum. Similarly, we present the ratio of BiApp’s induced cost to the budget Bj on each device. We can see that BiApp achieves its (θ + 1, 2β + 2)-approximation guarantee. That is, the overall latency is no more than (θ + 1) times of the optimal, and the induced cost on each device is no more than (2β + 2) times of the budget. In our simulation setting, θ = 3, β = 2. Hence, 107

BiApp is a (4, 6)-approximation algorithm. We can see that the performance is much better than our theoretically worst-case bound. Note that for some instances, BiApp’s performance is even better than the optimum. This is because BiApp solves the relaxed LP, where there is no constraint on the consistency of the device that receives the result from task i − 1 and the device that executes task i (constraint (5.2) in MCTA). Hence, there is possibly a more economic strategy which first transmits the result of task i − 1 to another device and then forwards the data to the device that executes task i. We examine the same instances on SARA. However, unlike BiApp, which uses the fixed assignment for a given instance, SARA varies the assignment for different data frames. Hence, we run SARA and average the performance over 200 data frames, with the vertical bar as standard deviation. Fig. 5.4 shows that SARA achieves the optimal performance in average and incurs tolerable variance for the instances we have considered.

5.5

Conclusion

Given an application that is partitioned into multiple tasks, we have formulated MCTA that aims to minimize the execution latency and satisfy all the budget constraints on each device, considering the on-device resource and potential data communication overhead over the network. Compared to SCTA that is presented in Chapter 4, MCTA considers both system performance and cost-balancing. Hence, 108

it avoids assigning most of the tasks on a single device, which could possibly drain the battery. We have proved that our formulation is NP-hard and proposed two algorithms with provable guarantees with respect to the optimum, SARA and BiApp. SARA is an LP rounding algorithm that achieves the optimal performance in expectation. BiApp is a bicriteria approximation that has provable performance guarantees on both objective and constraints. Simulation results have shown SARA’s optimality and justified BiApp’s performance guarantees. Especially, for comprehensive problem instances we have considered, BiApp performs much better than theoretical worse-case analysis.

109

Chapter 6

Stochastic Optimization with Single Constraint

In this study, we assume that the task execution latencies and channel transmission latencies are i.i.d. stochastic processes with known distributions. We formulate a stochastic optimization problem that partitions the tasks into two sets, one is the set of tasks that are to be executed at the remote server and the other are ones that remain at the local device. Our task partition provides a probabilistic QoS guarantee. That is, our task assignment strategy guarantees that at least p% of time the application latency is no more than t, where p and t are arbitrary numbers. This chapter is based on our work in [35].

6.1

Problem Formulation

Consider a tree-structure task graph G(V, E) with N nodes as tasks and edges specifying data dependency. For task i, we use a binary variable xi such that either task i is sent to the remote server (xi = 1) or remains at the local device (xi = 0). 110

We use the same set of notations as in Chapter 4. However, since we only have two devices, a local device and a remote server, we use l and r instead of enumeration. For example, let Til denote the latency of executing task i on the local device, and Tir denote the latency of executing task i on the remote device. We model the data transmission latency between two tasks (if necessary) as

rl lr = dmn T c , = Tmn Tmn

(6.1)

where T c is the latency of transmitting a unit amount of data between the local device and the remote server. WLOG, for illustrative proposes, we assume that the channel is symmetric. If task m and task n are executed at the same place, we have ll rr Tmn = Tmn = 0 for all (m, n) in |E|, as there is no data transmission happening

between them.

We focus on the energy cost on the local device. That is, we assume Cir = 0 for all i in [N ]. Furthermore, we model the energy cost as

Cil = pc Til , lr lr rl Cmn = Cmn = pRF Tmn ,

(6.2) (6.3)

where pc is the computing power of the local device and pRF is the power consumption of RF components. 111

Let D(i) denote the accumulated latency when finishing task i, which depends on its child tasks and can be described by the recursive relation,

xm xi D(i) = max D(m) + Tmi + Tixi , m∈C(i)

(6.4)

where C(i) describe the set of child tasks of i.

In the dynamic environment, we consider Til , Tir and T c as i.i.d. stochastic processes with known distribution. We aim to find out the task partition that minimizes the expected cost and is subject to the probabilistic QoS constraint. That is,

PTP : min E{

X

i∈[N ]

Cixi +

X

(m,n)∈|E|

xm xn Cmn }

s.t. P{D(N ) ≤ tmax } > pobj

(6.5)

xN = 0

(6.6)

xi ∈ {0, 1} ∀i ∈ [N ].

(6.7)

We assume the root (last) task always remains at the local device. Eq. (6.5) specifies the QoS guarantee that at least pobj % of time the latency is no more than tmax . 112

Algorithm 6 Probabilistic delay constrained Task Partitioning (PTP)

1: procedure PTP(N, pobj , tmax ) . min. cost from N s.t. P D(N ) ≤ tmax > pobj 2: q ← BFS (G, N ) . run BFS from node N and store visited nodes in order in q 3: for n ← q.end, q.start do . start from the last element in q 4: if n is a leaf then ( . initialize OP T values of leaves pc T¯nl if pk = q(FTnl (tmax )) l 5: OP T [n, pk ] ← ∞ otherwise ( 0 if pk = q(FTnr (tmax )) 6: OP T r [n, pk ] ← ∞ otherwise 7: else 8: for all combinations (OP T ∗ [1, pk1 ] , · · · , OP T ∗ [d, pkd ]) do 9: link to OP T l [n, p∗ ] if p∗ = q(

Z

tmax

0

10:

Y

FD(m) (t − xm dmn T c ) fTnl (tmax − t) dt).

(6.8)

FD(m) (t − (1 − xm )dmn T c ) fTnr (tmax − t) dt).

(6.9)

m∈C(n)

link to OP T r [n, p∗ ] if p∗ = q(

Z

0

tmax

Y

m∈C(n)

11: end for 12: for k ← 1, K do 13: Calculate OP T l [n, pk ], OP T r [n, pk ] by choosing the minimum from their links 14: end for 15: end if 16: end for 17: Trace back the optimal decision from mink∈{1,··· ,K} OP T l [N, pk ] 18: end procedure

Instead of arguing the average latency, this probabilistic constraint is stronger especially in highly variant environment. In the following, we propose an efficient algorithm that gives near-optimal solution to this problem. 113

6.2

PTP: Probabilistic Delay Constrained Task Partitioning

We adapt the dynamic programming approach and use quantization to bound the number of sub-problems. Similar to Hermes in Chapter 4, we solve the sub-problems from the leaves to the root. We define OP T l [i, pk ] as the minimum cost when finishing task i at local under the constraint

q(P D(i) ≤ tmax ) = pk ,

(6.10)

where pk is the k th quantization step between [pobj , 1] and the quantizer is defined as q(x) = pk , ∀x ∈ (pk , pk+1 ], ∀k ∈ {1, · · · , K} .

(6.11)

Since the latency is accumulating as we solve the sub-problems from leaves to root, it is sufficient to just deal with the interval [pobj , 1]. However, for an edge (m, n), instead of simply excluding the delays induced after node m as the case in deterministic analysis, the links between OP T ∗ [m, pm ] and OP T ∗ [n, pn ] (∗ could be l or r) are not obvious for arbitrary pm and pn . To find out OP T l [n, pk ], for illustrative purposes, we assume that node n has two children, m1 and m2 , and we 114

derive the case when both of them are executed at the local device. In this example, we have to identify all possible cases OP T l [m1 , pk1 ] , OP T l [m2 , pk2 ] satisfying q(P D(m1 ) ≤ tmax ) = pk1 ,

q(P D(m2 ) ≤ tmax ) = pk2 ,

q(P D(n) ≤ tmax ) = pk .

(6.12) (6.13) (6.14)

Let F∗ represent the cumulative distribution function (CDF) and f∗ represent the probability density function (PDF). Since D(n) depends on D(m1 ) and D(m2 ) as shown in (6.4), given FD(m1 ) and FD(m2 ) calculated by the optimal solutions of subproblems, we can find FD(n) (x) as follows.

o o n o n n P D(n) ≤ x = P max D(m1 ) , D(m2 ) + Tnl ≤ x Z x n o = fTnl (t) P max{D(m1 ) , D(m2 ) } ≤ x − t dt

(6.15) (6.16)

0

=

Z

0

x

FD(m1 ) (x − t) FD(m2 ) (x − t) fTnl (t) dt

(6.17)

As D(m1 ) and D(m2 ) are independent, the CDF of their maximum is the product of individual CDFs. Hence, Eq. (6.17) holds for multiple children. By calculating the integral in (6.17), we can find (pk1 , pk2 ) such that the latencies D(m1 ) and D(m2 ) satisfy (6.14). 115

To consider the variation of channel state T c , which is the transmission latency per unit data. As the total latency is additive over tasks for each branch, the derivation of CDF also contains convolutions. For illustrative purposes, we present our equations for the case when T c is constant, in which Eq. (6.17) will involve some shifts of CDFs if constant data transmission delay is induced. Suppose node n has d children, we have to consider all possible assignments on its children and all possible pk ’s. We write the set of all possible combinations as

(OP T ∗ [1, pk1 ] , OP T ∗ [2, pk2 ] , · · · , OP T ∗ [d, pkd ]) .

(6.18)

Each ∗ can be independently chosen from {l, r} and km can be independently chosen from {1, · · · , K} for all m in {1, · · · , d}. Hence, creating links for a node n, at the worst case, involves (2K)d integrals shown in (6.17). We summarize our algorithm, PTP, in Algorithm 6. PTP runs in O N K din

time, where din is the maximum in-degree of the task graph. We further investigate the value of K to guarantee that PTP runs in polynomial time. Let the smallest precision of probability measurement be . That is, the system is no longer sensitive to any confidence difference less than . For example, the partition that gives you 90% of confidence may not make any difference on the performance compared to the partition that gives you 90.1% of confidence. Let δ be the quantization step size, i.e. K =

1−pobj . δ

Since the quantizer defined in (6.11) always underestimates the

probability, the solution error happens when the quantization error is large enough 116

Table 6.1: PTP: Simulation Profile Notation Value Til Tir Tc dmn pc

Exp, λ−1 = U nif orm ∼ [10, 100] ms Exp, λ−1 = U nif orm ∼ [1, 10] ms 100 ms

U nif orm ∼ [0.1, 10] MB 0.3 W

pRF

0.7 W

so that PTP judges the optimal solution as a non-feasible one. Let l be the longest path from a leaf node to the root. Since the quantization error accumulates over tasks on each path, the maximum quantization error is at most lδ. As the probability constraint in our optimization problem is a strict inequality, the probability of the optimal solution is at least away from pobj if we neglect the fractions below . Hence, we choose the quantization step size to be

l

to guarantee that given any

instance, the optimal solution is always considered as a feasible solution by PTP. In other words, K can be bounded by O

l

. For the worst case when the task graph

is a chain (l = N ), PTP runs in O N din +1 ( 1 )din time. In next section, numerical results show that PTP does not need such a small δ as our theoretical analysis, but provides the optimal solution most of the time.

6.3


We verify the accuracy PTP for comprehensive problem instances. Our simulation is done as follows. For every problem instance, we fix the task graph as a perfect 117

−3

10

x 10

d = 3, wrong solution d = 3, missing solution 9

8

probability

7

6

5

4

3

2 10

15

20

25

30

35

40

45

50

K

Figure 6.1: PTP: error probability binary tree while the simulation profile is varying. Table 6.1 summarizes how we choose the simulation profile. To model the dynamic resource network, Til and Tir are independent exponential distribution with means drawn uniformly from the given interval. Similarly, the amount of data exchange, dmn is drawn uniformly from the interval. The other parameters remain constant as shown in Table 6.1. Given the profile we choose, we solve the optimal partition by PTP and compare the solution with the one given by the brute force algorithm, which simply checks over all partitions and chooses the optimal and feasible one. Fig. 6.1 shows the performance of PTP that solves the stochastic optimization problem for a task graph with depth 3. In general, we classify the solution errors into three types: first, PTP may provide a feasible solution which is not the optimal one; second, PTP may not find any feasible solution but there exists at least one; 118

third, PTP may provide a non-feasible solution. Since the quantizer in (6.11) always underestimates the probability that the total latency is less than tmax , PTP will never give a non-feasible solution but may miss some feasible solutions for small K (when the quantization error is significant). As shown in Fig. 6.1, the solid line represents the overall error probability, which contains the event of giving a suboptimal solution and the event of not finding any feasible solution but there exists at least one. The dash line represents the probability of the latter. Since K is related to the interval size [pobj , 1], where pobj is chosen to be close to 1 for stronger QoS guarantee, from our simulation result, K = 10 (with corresponding step size 0.01 and pobj = 0.9) provides a good performance with error probability 0.009.

6.4

Discussion

We have formulated a task partition problem, Probabilistic delay constrained Task Partitioning, and have provided an algorithm, PTP, to solve the scenarios when the application task graph is a tree. Our partition provides strong performance guarantee that at least p% of time the application latency is less than t in the dynamic environment, where p and t are arbitrary numbers. Furthermore, instead of relying on an integer programming formulation that may not be solve in polynomial time for all problem instances, we have shown that PTP runs in polynomial time with the problem size and provides the optimal solution most of the time.

119

Chapter 7

Online Learning in Stationary Environments

In this study, we consider the scenario when the devices’ computation ability and channel qualities are unknown and dynamic. We formulate the task assignment as a multi-armed bandit problem, where the performance on each device and channel are modeled as stationary processes that evolve i.i.d. with time. We propose an online algorithm that learns the unknown environment and makes competitive task assignment over the network. This chapter is based on our work in [36].

7.1

Why Online Learning?

Given an application that consists of multiple tasks, we want to assign them on multiple devices, considering the resource availability so that the system performance can be improved. These resources that are accessible by wireless connections form a resource network, which is subject to frequent topology changes and has the following features: 120

Dynamic device behavior: The quantity of the released resource varies with devices, and may also depend on the local processes that are running. Moreover, some of devices may carry microporcessors that are specialized in performing a subset of tasks. Hence, the performance of each device varies highly over time and different tasks and is hard to model as a known and stationary stochastic process. Heterogeneous network with intermittent connections: Devices’ mobility makes the connections intermittent, which change drastically in quality within a short time period. Furthermore, different devices may use different protocols to communicate with each other. Hence, the performance of the links between devices is also highly dynamic and variable and hard to model as a stationary process. From what we discuss above, since the resource network is subject to drastic changes over time and is hard to be modeled by stationary stochastic processes, we need an algorithm that applies to all possible scenarios, learns the environment at run time, and adapts to changes. Existing works focus on solving optimization problems given known deterministic profile or known stochastic distributions [58, 10]. These problems are hard to solve. More importantly, algorithms that lack learning ability could be harmed badly by statistical changes or mismatch between the profile (offline training) and the run-time environment. Hence, we use an online learning approach, which takes into account the performance during the learning phase, and aim to learn the environment quickly and adapt to changes. We start from stronger assumptions on the environment. That is, the resource network is 121

characterized by stationary processes that evolve i.i.d. over time. In Chapter 8, we relax the assumptions so that the result applies to more general scenarios.

7.2

Models and Formulation

An application profile can be described by a task graph, where node i with weight mi specifies the execution complexity of task i, and edge (m, n) with weight dmn specifies the amount of data exchange between task m and task n. An example with weighted task graph is shown in Fig. 4.1. Although these task complexities and data communication are fixed, the task execution latency and data transmission latency are non-deterministic due to the fact that on-device resource (like CPU cycles) and channel bandwidth vary with time. The dynamic environment (resource network) is described by devices’ computation performance and channels’ bandwidth. We use the same set of notations in Chapter 4. Let T (j) be the latency of executing a unit task on device j, and T (jk) be the latency of transmitting a unit amount of data from device j to device k. Hence, the task execution latency and data transmission latency are

(j)

Ti

= mi T (j) ,

(jk) Tmn = dmn T (jk) .

(7.1) (7.2) 122

We assume that T (j) and T (jk) are i.i.d. processes with unknown means θ(j) and θ(jk) , respectively. For some real applications, like the ones considered in [11], a stream of video frames comes as input to be processed frame by frame. For example, a videoprocessing application takes a continuous stream of image frames as input, where each image comes and goes though all processing tasks. Since the performance on each device and channel are unknown, an online algorithm aims to learn the devices and channels by making different task assignments for each data frame (exploration), and make use of the ones with better performance (exploitation). How to balance between exploration and exploitation significantly affects an algorithm’s competitiveness [44].

7.3

The Algorithm: Hermes with DSEE

We adapt the sampling method, deterministic sequencing of exploration and exploitation (DSEE) [45], to learn the unknown environment and derive the performance bound. The DSEE algorithm consists of two phases, exploration and exploitation. During the exploration phase, DSEE follows a fixed order to probe (sample) the unknown distributions thoroughly. Then, in the exploitation phase, DSEE exploits the best strategy based on the probing result. In [45], learning the unknown environment is modeled as a multi-arm banded (MAB) problem, where at each time an agent chooses over a set of “arms”, gets 123

the payoff from the selected arm and tries to learn the statistical information from sensing it, which will be considered in future decision. The goal is to figure out the best arm from exploration and exploit it later on. However, the exploration costs some price due to the mismatch between the payoffs given by the explored arm and the best one. Hence, we have to efficiently explore the environment and compare the performance with the optimal strategy (always choose the best arm). The authors in [45] prove the performance gap compared to the optimal strategy is bounded by a logarithmic function of number of trials as long as each arm is sampled logarithmically often. That is, if we get enough samples from each arm (O(ln V )) compared to total trials V , we can make good enough decision such that the performance loss flats out with time, which implies we can learn and exploit the best arm without losing noticeable payoff in the end. We adapt DSEE to sample all devices and channels thoroughly at the exploration phase, calculate the sample means, and applies Hermes to solve the optimal assignment based on sample means. During the exploration phase, we design a fixed assignment strategy to get samples from devices and channels. For example, if task n follows after the execution of task m, by assigning task m to device j and assigning task n to device k, we could get one sample of T (j) , T (k) and T (jk) . Since sampling all the M 2 channels implies that all devices have been sampled M times, we focus on sampling all channels using as less executions of the application as possible. That is, we would like to know, for each frame (an execution of the 124

A

B

C

A

B 2nd round

C

selected edge

0

T/4

Figure 7.1: The task graph with 3 edges as maximum matching application), what is the maximum number of different channels we can sample from. This number depends on the structure of the task graph, which, in fact, is lower-bounded by the matching number of the graph. A matching on a graph is a set of edges, where no two of which share a node [59]. The matching number of a graph is then the maximum number of edges that does not share a node. Taking an edge from the set, which connects two tasks in the task graph, we can assign these two tasks arbitrarily to get a sample of data transmission over our desired channel. Fig. 7.1 illustrate how we design the task assignment to sample as many channels as possible in one execution. First, we treat every directed edges as non-directed ones and find out the graph has matching number equal to 3. That is, we can sample at least 3 channels (AB, CA, BC) in one execution. There are some tasks that are left blank. We can assign them to other devices to get more samples. 125

T/

Algorithm 7 Hermes with DSEE 1: procedure HermesDSEE (w) M2 2: r ← d dmax e |E| 3: A(0) ← ∅ . A(v) defines the set of exploration epoches up to v 4: for v ← 1, · · · , V do 5: if |A(v − 1)| < dw ln ve then . exploration phase 6: for t ← 1, · · · , r do . each epoch contains r frames ˆ 7: Sample the channels with strategy x 8: end for 9: Calculate the sample means, θ¯(j) (v) and θ¯(jk) (v), for all j, k ∈ [M ] 10: A(v) ← A(v − 1) + {v} 11: else . exploitation phase ˜ (v) with input 12: Solve the best strategy x (j)

Ti

(jk) Tmn

13: 14: 15: 16: 17: 18:

= mi θ¯(j) (v), = dmn θ¯(jk) (v).

for t ← 1, · · · , r do ˜ (v) Exploit the assignment strategy x end for end if end for end procedure

In every exploration epoch, we want to get at least one sample from every channel. Hence, we want to know how many frames (executions) are needed in one epoch. We derive a bound for general case. For a DAG, its matching number is shown to be lower-bounded by

|E| , dmax

where dmax is the maximum degree of a

node [60]. For example, the matching number of the graph in Fig. 7.1 is lower bounded by

10 5

= 2. Hence, to sample each channel at least once, we require at 2

M most r = d dmax e frames. |E|

Algorithm 7 summarizes how we adapt Hermes to dynamic environment. We separate the time (frame) horizon into epoches, where each of them contains r 126

frames. Let A(v − 1) ⊆ {1, · · · , v − 1} be the set of exploration epoches prior to v. At epoch v, if the number of exploration epoches is below the threshold (|A(v − 1)| < dw ln ve), then epoch v is an exploration epoch. Algorithm 7 uses a ˆ to get samples. After r frames have been processed, fixed assignment strategy x Algorithm 7 gets at least one new sample from each channel and device, and updates the sample means. At an exploitation epoch, Algorithm 7 calls Hermes to solve for ˜ (v) based on current sample means, and uses this the best assignment strategy x assignment strategy for the frames in this epoch. In the following, we derive the performance guarantee of Algorithm 7. First, we present a lemma from [45], which specifies the probability bound on the deviation of sample mean.

Lemma 5. Let {X(t)}∞ t=1 be i.i.d. random variables drawn from a light-tailed distribution, that is, there exists u0 > 0 such that E[exp(uX)] < ∞ for all u ∈ ¯s = [−u0 , u0 ]. Let X

Ps

t=1

s

X(t)

and θ = E[X(1)]. We have, given ζ > 0, for all

η ∈ [0, ζu0 ], a ∈ (0, 2ζ1 ], ¯ s − θ| ≥ η} ≤ 2 exp(−aη 2 s). P{|X

(7.3)

Lemma 5 implies the more samples we get, the much less chance the sample mean deviates from the actual mean. From (4.2), the overall latency is the sum of (j)

single-stage latencies (Ti

(jk)

and Tmn ) across the slowest branch. Hence, we would 127

like to use Lemma 5 to get a bound on the deviation of total latency. Let β be the maximum latency solved by Algorithm 2 with the following input instance

(j)

Ti

= mi , ∀i ∈ [N ], j ∈ [M ],

(jk) Tmn = dmn , ∀(m, n) ∈ E, j, k ∈ [M ].

Hence, if all the single-stage sample means deviate no more than η compared to their actual means, then the overall latency deviates no more than βη. In order to prove the performance guarantee of Algorithm 7, we identify an event and verify the bound on its probability in the following lemma. Lemma 6. Assume that T (j) , T (jk) are independent random variables drawn from unknown light-tailed distributions with means θ(j) and θ(jk) , for all j, k ∈ [M ]. Let a, η be the numbers that satisfy Lemma 5. For each assignment strategy x, let ¯ v) be the total latency accumulated over the sample means that are calculated θ(x, at epoch v, and θ(x) be the actual expected total latency. We have, for each v,

¯ v) − θ(x)| > βη} ≤ P{∃x ∈ [M ]N | |θ(x,

X

n∈[M 2 +M ]

M 2 +M n

(−1)(−2)n e−naη

2 |A(v−1)|

.

(7.4)

Proof. We want to bound the probability that there exists a strategy whose total deviation (accumulated over sample means) is greater than βη. We work on its 128

complement event that the total deviation of each strategy is less than βη. That is,

¯ v) − θ(x)| > βη} = 1 − P{|θ(x, ¯ v) − θ(x)| ≤ βη ∀x ∈ [M ]N } P{∃x ∈ [M ]N | |θ(x, (7.5) We further identify the fact that if every single-stage deviation is less than η, then the total deviation is less than βη for all strategy x ∈ [M N ]. Hence, ¯ v) − θ(x)| ≤ βη ∀x ∈ [M ]N } 1−P{|θ(x, ≤ 1 − P{(

\

j∈[M ]

=1−

Y

j∈[M ]

|θ¯(j) − θ(j) | ≤ η) ∩ (

P{|θ¯(j) − θ(j) | ≤ η} ·

h iM 2 +M 2 ≤ 1 − 1 − 2e−aη |A(v−1)| ≤

X

n∈[M 2 +M ]

M 2 +M n

(7.6) \

j,k∈[M ]

Y

j,k∈[M ]

|θ¯(jk) − θ(jk) | ≤ η)}

P{|θ¯(jk) − θ(jk) | ≤ η}

2 (−1)(−2)n e−naη |A(v−1)|

(7.7)

(7.8)

(7.9) (7.10)

Leveraging the fact that all of random variables are independent and Lemma 5, where at epoch v, we get at least |A(v − 1)| samples for each unknown distribution, we arrive at (7.9). Finally, we use the binomial expansion to achieve the bound in (7.10).

129

In the following, we compare the performance of Algorithm 7 with the optimal strategy, which is obtained by solving Problem SCTA with the input instance

(j)

Ti

= mi θ(j) , ∀i ∈ [N ], j ∈ [M ],

(jk) Tmn = dmn θ(jk) , ∀(m, n) ∈ E, j, k ∈ [M ].

Theorem 4. Let η =

c , 2β

where c is the smallest precision so that for any two

assignment strategies x and y, we have |θ(x) − θ(y)| > c whenever θ(x) 6= θ(y). Let RV be the expected performance gap accumulated up to epoch V , which can be bounded by

RV ≤ r∆(w ln V + 1) + r∆

X

n∈[M 2 +M ]

M 2 +M n

(−1)(−2)n (1 +

1 ). (7.11) naη 2 w − 1

Proof. The expected performance gap consists of two parts, the expected loss due to the use of fixed strategy during exploration (RVf ix ) and the expected loss due to the mismatch of strategies during exploitation (RVmis ). During the exploration phase, the expected loss of each frame can be bounded by ∆, which can be obtained by Algorithm 2 presented in Chapter 4, with mi θ(j) and dmn θ(jk) as input instance. Since the number of exploration epoches |A(v)| will never exceed (w ln V + 1), we have RVf ix ≤ r∆(w ln V + 1).

(7.12) 130

On the other hand, RVmis is accumulated during the exploitation phase whenever the best strategy given by sample means is not the same as the optimal strategy, where the loss can also be bounded by ∆. That is,

RVmis ≤ E{

X

v6∈A(v)

≤ r∆ ≤ r∆ ≤ r∆ ≤ r∆

X

r∆ I(˜ x(v) 6= x? )} = r∆

v6∈A(v)

X

v6∈A(v)

P{˜ x(v) 6= x? }

¯ v) − θ(x)| > βη} P{∃x ∈ [M ]N | |θ(x,

X

X

M 2 +M n

v6∈A(v) n∈[M 2 +M ]

X

M 2 +M n

n∈[M 2 +M ]

X

n∈[M 2 +M ]

M 2 +M n

2 (−1)(−2)n e−naη |A(v−1)| n

(−1)(−2)

∞ X

v −naη

2w

(7.13)

(7.14)

(7.15)

(7.16)

v=1

(−1)(−2)n (1 +

1 ) naη 2 w − 1

(7.17)

In (7.14), we want to bound the probability when the best strategy based on sample means is not the optimal strategy. We identify an event, where there exists a strategy x whose deviation is greater than βη. If this event doesn’t happen, in worst case, the difference between any two strategies deviates at most 2βη = c. ¯ ? , v) is still the minimum, which implies Algorithm 7 still outputs the Hence, θ(x optimal strategy. We further use Lemma 6 in (7.15) and acquire (7.16) by the fact that epoch v is in exploration phase implies |A(v − 1)| >= w ln v. Finally, selecting w to be larger enough such that aη 2 w > 1 guarantees the result in (7.17).

131

Theorem 4 shows that the performance gap consists of two parts, one of which grows logarithmically with V and another one remains the same as V is increasing. Hence, the increase of performance gap will be negligible for large V , which implies Algorithm 7, although starting from not knowing the environment at the beginning, will learn the optimal strategy as time goes on. Furthermore, Theorem 4 provides the upper bound on the performance loss based on the worst-case analysis, in which w is a parameter left for users in Algorithm 7. A smaller w leads to less amount of probing (exploration) and hence reduces the accumulated loss during exploration, however, may increase the chance of missing the optimal strategy during exploitation. In next section, we will compare Algorithm 7 with other algorithms by simulation.

7.4


To measure the performance of Algorithm 7 under dynamic environment, we simulate an application that processes a stream of data frames. The resource network consists of 3 devices with unit process time T (j) on device j. The devices form a mesh network with unit data transmission time T (jk) over the channel between device j and k. We model T (j) and T (jk) as stochastic processes that are uniformlydistributed with given means and evolve i.i.d. over time. Hence, for each frame, we draw the samples from corresponding uniform distributions, and get the single-stage latencies by (7.1) and (7.2). 132

latency (ms)

600

frame latency running avg optimal

500 400 300 200

0

100

200

300

400

500

600

700

500

frame cost running avg optimal

cost

400 300 200 100

0

100

200

300

400

500

600

700

10000

gap

8000 6000 4000

bound gap to optimal

2000 0

0

100

200

300

400

500

600

700

frame number

Figure 7.2: The performance of Hermes using DSEE in dynamic environment 480

Hermes updated frame by frame Hermes with random exploration Hermes with DSEE optimal

460

latency (ms)

440

420

400

380

360 0

100

200

300

400

500

600

700

frame number

Figure 7.3: Comparison of Hermes using DSEE with other algorithms

133

We adapt Algorithm 7 to probe the devices and channels and exploit the strategy that is the best based on the sample means. Fig. 7.2 shows the performance of Hermes using DSEE as the sampling method. We see that the average latency per frame converges to the minimum, which implies Algorithm 7 learns the optimal strategy and exploits it most of the time. On the other hand, Algorithm 7 uses the strategy that costs less but performs worse than the optimal one during the exploration phase. Hence, the average cost per frame is slightly lower than the cost induced by the optimal strategy. Finally, we measure the performance gap, which is the extra latency caused by sub-optimal strategy accumulated over frames. The gap flattens out in the end, which implies the increase on extra latency becomes negligible.

We compare Algorithm 7 with two other algorithms in Fig. 7.3. First, we propose a randomized sampling method as a baseline. During exploration phase, Algorithm 7 designs a fixed strategy to sample the devices and channels thoroughly. However, the baseline randomly selects an assignment strategy and gather the samples. The biased sample means result in significant performance loss during exploitation phase. We propose another algorithm that resolves the best strategy every frame. That is, at the end of each frame, it updates the sample means and runs Hermes to solve for the best strategy for the next frame. We can see that by updating the strategy every frame, the performance is slightly better than 134

Algorithm 7. However, Algorithm 7 only runs Hermes at the beginning of each exploitation phase, which only increases tolerable amount of CPU load but provides competitive performance.

7.5

Discussion

We have formulate a multi-armed bandit problem to make task assignment under unknown and dynamic environment, where the performance on each device and channel are modeled as stationary processes that evolve i.i.d. with time. We have proposed an algorithm that combines Hermes presented in Chapter 4 with the DSEE sampling method. Our algorithm uses a fixed sampling sequence to sample all the devices and channels thoroughly during exploration phase, and runs Hermes to make optimal task assignment based on the sampling results during exploitation phase. Simulation results have validated our performance analysis and shown that our proposed algorithm makes competitive task assignment compared to the optimal one.

135

Chapter 8

Online learning in Non-stationary Environments

In this study, we follow the online learning scenario considered in Chpater 7 but make no stochastic assumption on the bandit processes. Instead, we adapt the adversarial multi-armed bandit formulation where each arm’s payoff is given by an arbitrary but bounded sequence. Hence, our proposed algorithm applies to any dynamic environment, including non-stationary ones. This chapter is based on our work in [37].

8.1

Problem Formulation

Suppose a data processing application consists of N tasks, where their dependencies are described by a directed acyclic graph (DAG) G = (V, E). There is an incoming data stream to be processed (T data frames in total), where for each data frame t, it is required to go through all the tasks and leave afterwords. There are M available devices. The assignment strategy of data frame t is denoted by a vector xt = 136

xt1 , · · · , xtN , where xti denotes the device that executes task i. Given an assignment strategy, stage-wised costs apply to each node (task) for computation and each edge for communication. The cost can correspond to the resource consumption for a device to complete a task, for example, energy consumption. In the following formulation we follow the tradition in MAB literature and focus on maximizing a positive reward instead of minimizing the total cost, but of course these are mathematically equivalent, e.g., by setting reward = maxCost − cost. That is, instead of minimizing the total latency, we have an equivalent formulation that maximize the total time saved compared to the worst case. When processing (jk)

(j)

data frame t, let Ri (t) be the reward of executing task i on device j. Let Rmn (t) be the reward of transmitting the data of edge (m, n) from device j to k. The reward sequences are unknown but are bounded between 0 and 1. Our goal is to find out the assignment strategy for each data frame based on the previously observed samples, and compare the performance with a genie that uses the best assignment strategy for all data frames. That is,

max Rtotal = max x∈F

T X t=1

 

N X i=1

(xi )

Ri

(t) +

X

(m,n)∈E



(xm xn ) Rmn (t) ,

(8.1)

where F represents the set of feasible solutions. The genie who knows all the reward sequences can find out the best assignment strategy, however, not knowing these sequences in advance, our proposed online algorithm aims to learn this best strategy and remain competitive in overall performance. 137

Algorithm 8 MABSTA 1: procedure MABSTA(γ, α) 2: wy (1) ← 1 ∀y ∈ F 3: for t ← 1,P 2, · · · , T do 4: Wt ← y∈F wy (t) 5: Draw xt from distribution py (t) = (1 − γ) (xt )

(8.2)

(xt xt )

m n Get rewards {Ri i (t)}N (t)}(m,n)∈E . i=1 , {Rmn t i Cex ← {z ∈ F|zi = xi }, ∀i mn ← {z ∈ F|zm = xtm , zn = xtn }, ∀(m, n) Ctx for ∀j ∈ [M ], ∀i ∈ [N ] do

6: 7: 8: 9:

ˆ (j) (t) = R i

 (j)  P Ri (t) 0

i z∈Cex

if xti = j,

pz (t)

(8.3)

otherwise.

end for for ∀j, k ∈ [M ], ∀(m, n) ∈ E do

10: 11:

 (jk)   P Rmn (t) ˆ (jk) (t) = R mn pz (t) mn z∈Ctx   0

if xtm = j, xtn = k,

(8.4)

otherwise.

end for Update for all y

12: 13:

ˆ y (t) = R

N X

ˆ (yi ) (t) + R i

i=1

14: 15:

γ wy (t) + Wt |F|

end for end procedure

8.2

X

(ym yn ) ˆ mn R (t),

(8.5)

(m,n)∈E

ˆ y (t) . wy (t + 1) = wy (t) exp αR

(8.6)

MABSTA Algorithm

We propose MABSTA (Multi-Armed Bandit based Systematic Task Assignment), summarized in Algorithm 8, which learns the environment and makes task assignment at run time. For each data frame t, MABSTA randomly selects a feasible assignment (arm x ∈ F) from a probability distribution that depends on the weights 138

of arms (wy (t)). Then it updates the weights based on the reward samples. From (8.2), MABSTA randomly switches between two phases: exploitation (with probability 1 − γ) and exploration (with probability γ). At exploitation phase, MABSTA selects an arm based on its weight. Hence, the one with higher reward samples will be chosen more likely. At exploration phase, MABSTA uniformly selects an arm without considering its performance. The fact that MABSTA keeps probing every arms makes it adaptive to the changes of the environment, compared to the case where static strategy plays the previously best arm all the time without knowing that other arms might have performed better currently.

The commonly used performance measure for an MAB algorithm is its regret. ˆ total ) compared In our case it is defined as the difference in accumulated rewards (R to a genie that knows all the rewards and selects a single best strategy for all max data frames (Rtotal in (8.1)). Auer et al. [30] propose Exp3 for adversarial MAB.

However, if we apply Exp3 to our online task assignment problem, since we have an exponential number of arms (M N ), the regret bound will grow exponentially. The following theorem shows that MABSTA guarantees a regret bound that is √ polynomial with problem size and O( T ).

139

Theorem 5. Assume all the reward sequences are bounded between 0 and 1. Let ˆ total be the total reward achieved by Algorithm 8. For any γ ∈ (0, 1), let α = R γ , M (N +|E|M )

we have

max ˆ total } ≤ (e − 1)γRmax + Rtotal − E{R total

M (N + |E| M ) ln M N . γ

(8.7)

In above, N is the number of nodes (tasks) and |E| is the number of edges in the task graph. We leave the proof of Theorem 5 in Section 8.3. By applying the max appropriate value of γ and using the upper bound Rtotal ≤ (N + |E|)T , we have the

following Corollary.

Corollary 1. Let γ = min{1,

q

M (N +|E|M ) ln M N }, (e−1)(N +|E|)T

then

p max ˆ total } ≤ 2.63 (N + |E|)(N + |E| M )M N T ln M . Rtotal − E{R

(8.8)

We look at the worst case, where |E| = O(N 2 ). The regret can be bounded by O(N 2.5 M T 0.5 ). Since the bound is a concave function of T , we define the learning time T0 as the time when its slope falls below a constant c. That is,

T0 =

1.73 (N + |E|)(N + |E| M )M N ln M. c2

(8.9) 140

This learning time is significantly improved compared with applying Exp3 to our problem, where T0 = O(M N ). As we will show in the numerical results, MABSTA performs significantly better than Exp3 in the trace-data emulation.

8.3

Proof of Theorem 5

We first prove the following lemmas. We will use more condensed notations like (ym yn ) (ym yn ) ˆ (yi ) for R ˆ (yi ) (t) and R ˆ mn ˆ mn R for R (t) in the prove where the result holds for i i

each t.

8.3.1

Proof of lemmas

Lemma 7. For all t = 1, · · · , T , we have X

ˆ y (t) = py (t)R

N X

(xti )

Ri

(t) +

i=1

y∈F

X

t

t

(xm xn ) Rmn (t).

(8.10)

(m,n)∈E

Proof.

X

ˆ y (t) = py (t)R

y∈F

X

y∈F

=



py 

XX i

y

N X

ˆ (yi ) + R i

i=1

ˆ (yi ) + py R i

X

(m,n)∈E

XX

(m,n)



ˆ (ym yn )  R mn

ˆ (ym yn ) , py R mn

(8.11)

y

141

where X

ˆ (yi ) = py R i

X

i y∈Cex

y

and similarly, X

(xti )

R py P i

i z∈Cex

(xti )

= Ri

pz

,

(ym yn ) (xtm xtn ) ˆ mn . py R = Rmn

(8.12)

(8.13)

y

Applying the result to (8.11) completes the proof. Lemma 8. For all y ∈ F, we have ˆ y (t)} = E{R

N X

(y )

Ri i (t) +

i=1

X

(ym yn ) Rmn (t).

(8.14)

ˆ (ym yn ) }, E{R mn

(8.15)

(m,n)∈E

Proof. ˆ y (t)} = E{R

N X i=1

(y )

ˆ i }+ E{R i

X

(m,n)∈E

where (yi )

ˆ (yi ) } = P{xt = yi } PRi E{R i i

i z∈Cex

and similarly,

(y )

pz

= Ri i ,

(ym yn ) (ym yn ) ˆ mn E{R } = Rmn .

(8.16)

(8.17)

Lemma 9. If F = {x ∈ [M ]N }, then for M ≥ 3 and |E| ≥ 3, X

y∈F

ˆ y (t)2 ≤ py (t)R

|E| X ˆ Ry (t). M N −2 y∈F

(8.18)

142

Proof. We first expand the left-hand-side of the inequality as X

ˆ y (t)2 = py (t)R

y∈F

=

X

y∈F



py 

X

y∈F

X



py 

N X

ˆ (yi ) + R i

i=1

(m,n)∈E

X

ˆ (yi ) R ˆ (yj ) + R i j

i,j

X

2

ˆ (ym yn )  R mn

(8.19)

ˆ (ym yn ) R ˆ (yu yv ) + 2 R mn uv

XX i

(m,n),(u,v)

(m,n)



ˆ (yi ) R ˆ (ym yn )  R mn i

(8.20)

In the following, we derive the upper bound for each term in (8.20) for all i ∈ [N ], (m, n) ∈ E. X

ˆ (yj ) ˆ (yi ) R py R i j

=

X

i ∩C j y∈Cex ex

y

≤

(xt ) Ri i (xtj ) Rj P i pz z∈Cex

(xt )

(xt )

Ri i Rj j P py P j pz i pz · z∈Cex z∈Cex

(xtj )

= Rj

t

ˆ (xi ) ≤ R i

1

M N −1

X

ˆ (yi ) R i

(8.21)

(8.22)

y

i j j The first inequality in (8.22) follows by Cex ∩ Cex is a subset of Cex and the last t

ˆ (xi ) for all y in C i . Hence, ˆ (yi ) = R inequality follows by R ex i i XX i,j

y

ˆ (yj ) ≤ ˆ (yi ) R py R i j

1 M N −2

XX y

ˆ (yi ) . R i

(8.23)

i

Similarly, X

(m,n),(u,v)

X y

ˆ (ym yn ) R ˆ (yu yv ) ≤ py R mn uv

|E| X X ˆ (ym yn ) Rmn . M N −2 y

(8.24)

(m,n)

143

For the last term in (8.20), following the similar argument gives X

X

ˆ (yi ) R ˆ (ym yn ) = py R mn i

(xti )

Ri

(xt xt )

Rmnm n P py P mn pz i pz · z∈Ctx z∈Cex mn

(8.25)

i ∩C y∈Cex tx

y

(xtm xtn )

≤ Rmn

(xti )

R P i

i z∈Cex

t

pz

t

(xti )

(xm xn ) ˆ Ri = Rmn

≤

1

M N −1

X

ˆ (yi ) . R i

(8.26)

y

Hence, XXX i

(m,n)

y

ˆ (yi ) R ˆ (ym yn ) ≤ py R mn i

|E| X X ˆ (yi ) Ri . M N −1 y i

(8.27)

Applying (8.23), (8.24) and (8.27) to (8.20) gives X

y∈F

ˆ y (t)2 ≤ py (t)R ≤

XX [ (

y∈F

i

1 M N −2

+

|E| X ˆ Ry (t). M N −2 y∈F

2 |E| ˆ (yi ) X |E| ˆ (ym yn ) )R + R ] M N −1 i M N −2 mn

The last inequality follows by the fact that

(8.28)

(m,n)

(8.29)

1 M N −2

+

2|E| M N −1

≤

|E| M N −2

for M ≥ 3 and

|E| ≥ 3. For M = 2, we have X ˆ y (t). ˆ y (t)2 ≤ M + 2 |E| py (t)R R N −1 M y∈F y∈F X

(8.30)

Since we are interested in the regime where (8.29) holds, we will use this result in our proof of Theorem 5. Lemma 10. Let α =

γ , M (N +|E|M )

if F = {x ∈ [M ]N }, then for all y ∈ F, all

ˆ y (t) ≤ 1. t = 1, · · · , T , we have αR 144

i mn Proof. Since |Cex | ≥ M N −1 and |Ctx | ≥ M N −2 for all i ∈ [N ] and (m, n) ∈ E, each

ˆ y (t) can be upper bounded as term in R (yi )

ˆ (yi ) ≤ PRi R i

i z∈Cex

pz

≤

(yi−1 yi )

Ri ˆ (yi−1 yi ) ≤ P R i

i z∈Ctx

pz

≤

1 M N −1 MγN 1 M N −2 MγN

=

M , γ

(8.31)

=

M2 . γ

(8.32)

Hence, we have

ˆ y (t) = R

N X

ˆ (yi ) + R i

i=1

Let α =

γ , M (N +|E|M )

X

(m,n)∈E

2 ˆ (ym yn ) ≤ N M + |E| M = M (N + |E| M ). R mn γ γ γ

(8.33)

we achieve the result.

145

8.3.2

Proof of Theorem 5

Proof. Let Wt =

P

y∈F

wy (t). We denote the sequence of decisions drawn at each

frame as x = [x1 , · · · , xT ], where xt ∈ F denotes the arm drawn at step t. Then for all data frame t, Wt+1 X wy (t) ˆ = exp αR(y) (t) Wt Wt y∈F =

X py (t) −

y∈F

≤

1−γ

X py (t) −

γ |F |

γ |F |

(8.34)

ˆ (y) (t) exp αR

(8.35)

ˆ (y) (t) + (e − 2)α2 R ˆ (y) (t)2 1 + αR

(8.36) 1 − γ y∈F   N X X t α  (e − 2)α2 |E| X ˆ (x ) (xtm xtn ) ≤1+ Ry (t). Ri i (t) + Rmn (t) + 1 − γ i=1 1 − γ M N −2 y∈F (m,n)∈E

(8.37)

Eq. (8.36) follows by the fact that ex ≤ 1+x+(e−2)x2 for x ≤ 1. Applying Lemma 7 and Lemma 9 we arrive at (8.37). Using 1 + x ≤ ex and taking logarithms at both sides,

ln



N X

Wt+1 α  (xt ) ≤ Ri i (t) + Wt 1 − γ i=1

X

(m,n)∈E



(e − 2)α2 |E| X ˆ (xtm xtn ) Ry (t). Rmn (t) + 1 − γ M N −2 y∈F

(8.38)

Taking summation from t = 1 to T gives T WT +1 α ˆ (e − 2)α2 |E| X X ˆ ln ≤ Rtotal + Ry (t). W1 1−γ 1 − γ M N −2 t=1 y∈F

(8.39) 146

On the other hand, T

X wz (T + 1) WT +1 ˆ z (t) − ln M N , ∀z ∈ F. ≥ ln =α ln R W1 W1 t=1

(8.40)

Combining (8.39) and (8.40) gives

ˆ total ≥ (1 − γ) R

T X t=1

T X N X ˆ y (t) − ln M . ˆ z (t) − (e − 2)α |E| R R M N −2 t=1 y∈F α

(8.41)

Eq. (8.41) holds for all z ∈ F. Choose x? to be the assignment strategy that maximizes the objective in (8.1). Now we take expectations on both sides based on x1 , · · · , xT and use Lemma 8. That is, T X

ˆ x? (t)} = E{R

N T X X X (x? ) (x?m x?n ) max [ Ri i (i) + Rmn (t)] = Rtotal , t=1 i=1

t=1

(8.42)

(m,n)∈E

and T X X t=1 y∈F

ˆ y (t)} = E{R

T X X t=1 y∈F

 

N X i=1

(y )

Ri i (t) +

X

(m,n)∈E



max (ym yn ) Rmn (t) ≤ M N Rtotal .

(8.43)

Applying the result to (8.41) gives

ˆ total } ≥ (1 − γ)Rmax − |E| M 2 (e − 2)αRmax − E{R total total

ln M N . α

(8.44)

147

Let α =

γ , M (N +|E|M )

we arrive at

N max ˆ total } ≤ (e − 1)γRmax + M (N + |E| M ) ln M . Rtotal − E{R total γ

8.4

(8.45)

Polynomial Time MABSTA

In Algorithm 8, since there are exponentially many arms, implementation may result in exponential storage and complexity. However, in the following, we propose an equivalent but efficient implementation. We show that when the task graph belongs to a subset of DAG that appear in practical applications (namely, parallel chains of trees), Algorithm 8 can run in polynomial time with polynomial storage. ˆ y (t) relies on the estimates of each node and each We observe that in (8.5), R edge. Hence, we rewrite (8.6) as

wy (t + 1) = exp α 

t X τ =1

= exp α

N X i=1

!

ˆ y (t) R

˜ (yi ) (t) + α R i

(8.46) X

(m,n)∈E



˜ (ym yn ) (t) , R mn

(8.47)

ˆ (ym yn ) . R mn

(8.48)

where ˜ (yi ) (t) = R i

t X τ =1

ˆ (yi ) , R ˜ (ym yn ) (t) = R mn i

t X τ =1

148

(j)

Algorithm 9 Calculate wN for tree-structured task graph 1: procedure Ω(N, M, G) 2: q ← BFS (G, N ) . run BFS from N and store visited nodes in order 3: for i ← q.end, q.start do . start from the last element 4: if i is a leaf then . initialize ω values of leaves 5:

(j)

(j)

ωi ← ei 6: 7:

else (j)

(j)

ωi ← ei 8: 9: 10:

Y

X

(y j)

(ym ) emim ωm

m∈Ni ym ∈[M ]

end if end for end procedure

(j,k) ˜ mn ˜ (j) (t) and R (t) for all i ∈ [N ], (m, n) ∈ E To calculate wy (t), it suffices to store R i

and j, k ∈ [M ], which costs (N M + |E| M 2 ) storage. Eq. (8.3) and (8.4) require the knowledge of marginal probabilities P{xti = j} and P{xtm = j, xtn = k}. Next, we propose a polynomial time algorithm to calculate them. From (8.2), the marginal probability can be written as

P{xti = j} = (1 − γ)

1 X γ wy (t) + . Wt y:y =j M

(8.49)

i

Hence, without calculating Wt , we have

P{xti = j} −

X X γ γ : P{xti = k} − = wy (t) : wy (t). M M y:y =j y:y =k i

(8.50)

i

149

1

(k)

!4

(kj)

e46

4 2

(j)

e6

6

(l)

!5

5

(j)

!6

(lj)

e56

3

Figure 8.1: The dependency of weights on a tree-structure task graph

8.4.1

Tree-structure Task(j1Graph )

(j |j )

2 1 !i2 |i 1

! i1

Now we focus on how to calculate the sum of weights in (8.50) efficiently. We start from tree-structure task graphs and solve more general graphs by calling the proposed algorithm for trees a polynomial number of times. We drop time index t in our derivation whenever the result holds for all time ˜ (j) (t). We assume that the task ˜ (j) ≡ R steps t ∈ {1, · · · , T }. For example, R i i graph is a tree with N nodes where the N th node is the root (final task). Let (j)

ei

(j)

(jk)

(jk)

˜ mn ). Hence, the sum of exponents in (8.47) ˜ ) and emn = exp(αR = exp(αR i (j)

can be written as the product of ei X y

wy (t) =

(jk)

and emn . That is,

N XY y

i=1

(yi )

ei

Y

(ym yn ) emn .

(8.51)

(m,n)∈E

For a node v, we use Dv to denote the set of its descendants. Let the set Ev denote the edges connecting its descendants. Formally,

Ev = {(m, n) ∈ E|m ∈ Dv , n ∈ Dv ∪ {v}}.

(8.52) 150

The set of |Dv |-dimensional vectors, {ym }m∈Dv , denotes all the possible assignments (j)

on its descendants. Finally, we define the sub-problem, ωi , which calculates the sum of weights of all possible assignment on task i’s descendants, given task i is assigned to device j. That is,

(j)

(j)

ωi = ei

X

Y

m) e(y m

{ym }m∈Di m∈Di

Y

(ym yn ) emn .

(8.53)

(m,n)∈Ei

Figure 8.1 shows an example of a tree-structure task graph. Task 4 and 5 are the children of task 6, where we have

D6 = {1, 2, 3, 4, 5}, and E6 = {(1, 4), (2, 4), (3, 5), (4, 6), (5, 6)}. (k)

(l)

(j)

From (8.53), if we have ω4 and ω5 for all k and l, ω6 can be solved by

(j)

(j)

ω6 = e6

X

(kj)

(k) (lj)

(l)

e46 ω4 e56 ω5 .

(8.54)

k,l

In general, the relation of weights between task i and its children m ∈ Ni is given by the following equation.

(j)

(j)

ωi = ei

X

Y

{ym }m∈Ni m∈Ni

(y j)

(j)

(ym ) emim ωm = ei

Y

X

(y j)

(ym ) emim ωm .

(8.55)

m∈Ni ym ∈[M ]

151

5

(lj)

e56

3

(j )

! i1 1

(j |j )

2 1 !i2 |i 1

Figure 8.2: The dependency of weights on a serial-tree task graph Algorithm 9 summarizes our approach to calculate the sum of weights of a treestructure task graph. We first run breath first search (BFS) from the root node. Then we start solving the sub-problems from the last visited node such that when solving task i, it is guaranteed that all of its child tasks have been solved. Let din denote the maximum in-degree of G (i.e., the maximum number of in-coming edges of a node). Running BFS takes polynomial time. For each sub-problem, there are at most din products of summations over M terms. In total, Algorithm 9 solves N M sub-problems. Hence, Algorithm 9 runs in Θ(din N M 2 ) time.

8.4.2

More general task graphs

All of the nodes in a tree-structure task graph have only one out-going edge. For task graphs where there exists a node that has multiple out-going edges, we decompose the task graph into multiple trees and solve them separately and combine the solutions in the end. In the following, we use an example of a task graph that consists of serial trees to illustrate our approach. Figure 8.2 shows a task graph that has two trees rooted by task i1 and i2 , (j |j )

respectively. Let the sub-problem, ωi2 |i2 1 1 , denote the sum of weights given that i2 152

(j |j )

is assigned to j2 and i1 is assigned to j1 . To find ωi2 |i2 1 1 , we follow Algorithm 9 but consider the assignment on task i1 when solving the sub-problems on each leaf m. That is, j |j

(j j )

m 1 m) = ei11mm e(j ω(m|i m . 1)

(8.56)

(j )

The sub-problem, ωi2 2 , now becomes the sum of weights of all possible assignment on task i2 ’s descendants, including task 1’s descendants, and is given by

(j )

ωi2 2 =

X

j1 ∈[M ]

(j |j )

(j )

wi2 |i2 1 1 wi1 1 .

(8.57)

For a task graph that consists of serial trees rooted by i1 , · · · , in in order, we can (j |j

(j )

)

(j

)

r−1 r−1 solve ωir r , given previously solved ωir |ir r−1 and ωir−1 . From (8.57), to solve

(j )

(j |j )

ωi2 2 , we have to solve wi2 |i2 1 1 for j1 ∈ {1, · · · , M }. Hence, it takes O(din n1 M 2 ) + O(M din n2 M 2 ) time, where n1 (resp. n2 ) is the number of nodes in tree i1 (resp. i2 ). Hence, to solve a serial-tree task graph, it takes O(din N M 3 ) time.

Our approach can be generalized to more complicated DAGs, like the one that contains parallel chains of trees (parallel connection of Figure 8.2), in which we solve each chain independently and combine them from their common root N . Most of the real applications can be described by these families of DAGs where we have proposed polynomial time MABSTA to solve them. For example, in [11], the three benchmarks fall in the category of parallel chains of trees. In Wireless Sensor Networks, an application typically has a tree-structured workflow [61]. 153

8.4.3

Marginal Probability

From (8.50), we can calculate the marginal probability P{xti = j} if we can solve the sum of weights over all possible assignments given task i is assigned to device j. (j)

If task i is the root (node N ), then Algorithm 9 solves ωi =

P

y:yi =j

wy (t) exactly.

(j 0 )

If task i is not the root, we can still run Algorithm 9 to solve [ωp ]yi =j , which fixes the assignment of task i to device j when solving from i’s parent p. That is,

0

(jj 0 )

0

(j)

) [ωp(j ) ]yi =j = e(j p eip ωi

Y

X

0

m j ) (ym ) e(y ωm . mp

(8.58)

m∈Np \{i} ym (j 0 )

Hence, in the end, we can solve [ωN ]yi =j from the root and X

wy (t) =

y:yi =j

X

0

[ωr(j ) ]yi =j .

(8.59)

j 0 ∈[M ]

Similarly, the P{xtm = j, xtn = k} can be achieved by solving the conditional subproblems on both tasks m and n.

8.4.4

Sampling

As we can calculate the marginal probabilities efficiently, we propose an efficient sampling policy summarized in Algorithm 10. Algorithm 10 first selects a random number s between 0 and 1. If s is less than γ, it refers to the exploration phase, 154

Algorithm 10 Efficient Sampling Algorithm 1: procedure Sampling(γ) 2: s ← rand() . get a random number between 0 and 1 3: if s < γ then 4: pick an x ∈ [M ]N uniformly 5: else 6: for i ← 1, · · · , N do (j) 7: [ωi ]xt1 ,··· ,xti−1 ← Ω(N, M, G)xt1 ,··· ,xti−1 8: 9: 10: 11:

(j)

P{xti = j|xt1 , · · · , xti−1 } ∝ [ωi ]xt1 ,··· ,xti−1 end for end if end procedure

where MABSTA simply selects an arm uniformly. Otherwise, MABSTA selects an arm based on the probability distribution py (t), which can be written as

py (t) = P{xt1 = y1 } · P{xt2 = y2 |xt1 = y1 } · · · P{xtN = yN |xt1 = y1 , · · · , xtN −1 = yN −1 }.

(8.60) (8.61)

Hence, MABSTA assigns each task in order based on the conditional probability given the assignment on previous tasks. For each task i, the conditional probability can be calculate efficiently by running Algorithm 9 with fixed assignment on task 1, · · · , i − 1. 155

8.5


In this section, we first examine how MABSTA adapts to dynamic environment. Then, we perform trace-data emulation to verify MABSTA’s performance guarantee and compare it with other algorithms.

8.5.1

MABSTA’s Adaptivity

Here we examine MABSTA’s adaptivity to dynamic environment and compare it to the optimal strategy that relies on the existing profile. We use a two-device setup, where the task execution costs of the two devices are characterized by two different Markov processes. We neglect the channel communication cost so that the optimal strategy is the myopic strategy. That is, assigning the tasks to the device with the highest belief that it is in “good” state [62]. We run our experiment with an application that consists of 10 tasks and processes the in-coming data frames one by one. The environment changes at the 100th frame, where the transition matrices of two Markov processes swap with each other. From Figure 8.3, there exists an optimal assignment (dashed line) so that the performance remains as good as it was before the 100th frame. However, myopic strategy, with the wrong information of the transition matrices, fails to adapt to the changes. From (8.2), MABSTA not only relies on the result of previous samples but also keeps exploring uniformly (with probability

γ MN

for each arm). Hence, when the performance of 156

10 8

myopic MABSTA offline opt

cost

6 4 2 0 0

50

100

150

200

250 frame number

300

350

400

450

500

Figure 8.3: MABSTA has better adaptivity to the changes than a myopic algorithm Table 8.1: Parameters Used in Trace-data measurement Device ID # of iterations Device ID # of iterations 18 21 22 24 26

U(14031, 32989) U(37259, 54186) U(23669, 65500) U(61773, 65500) U(19475, 44902)

28 31 36 38 41

U(10839, 58526) U(10868, 28770) U(41467, 64191) U(12386, 27992) U(15447, 32423)

one device degrades at 100th frame, the randomness enables MABSTA to explore another device and learn the changes.

8.5.2

Trace-data Emulation1

To obtain trace data representative of a realistic environment, we run simulations on a large-scale wireless sensor network / IoT testbed. We create a network using 10 IEEE 802.15.4-based wireless embedded devices, and conduct a set of experiments to measure two performance characteristics utilized by MABSTA, namely channel conditions and computational resource availability. To assess the channel conditions, the time it takes to transfer 500 bytes of data between every pair of 1

This section is a joint work with Mr. Kwame Wright, University of Southern California.

157

latency (ms)

device 18 6000

avg = 1881, std = 472

4000 2000 0 0

200

400

600

800

1000

latency (ms)

device 28 6000

avg = 2760, std = 1122

4000 2000 0 0

200

latency (ms)

2

400

600

800

1000

channel 21 −> 28

4

x 10

avg = 1798, std = 2093 1 0 0

200

400

600

800

1000

frame number

Figure 8.4: Snapshots of measurement result

5

4

x 10

3.5 3

regret

2.5

bound (10,5) MABSTA (10,5) bound (10,3) MABSTA (10,3) bound (5,5) MABSTA (5,5) bound (5,3) MABSTA (5,3)

2 1.5 1 0.5 0 0

1

2

3 T

4

5 5

x 10

Figure 8.5: MABSTA’s performance with upper bounds provided by Corollary 1

158

Application, N = 5, M = 5

5

3

x 10

Randomized Exp3 MABSTA, fixed γ MABSTA, varing γ

2.5

regret

2 1.5 1 0.5 0 0

1

2

3

4

5 5

x 10

ratio to opt

1 0.95 0.9 0.85

MABSTA, varying γ MABSTA, fixed γ

0.8 0.75 0

1

2

3

4

frame number

5 5

x 10

Figure 8.6: MABSTA compared with other algorithms for 5-device network

Application, N = 5, M = 10

5

3

x 10

2.5

regret

2 1.5

Randomized Exp3 MABSTA, fixed γ MABSTA, varing γ

1 0.5 0 0

1

2

3

4

5 5

x 10

ratio to opt

1 0.95 0.9 0.85

MABSTA, varying γ MABSTA, fixed γ

0.8 0.75 0

1

2

3

frame number

4

5 5

x 10

Figure 8.7: MABSTA compared with other algorithms for 10-device network

159

motes is measured. To assess the resource availability of each device, we measure the amount of time it takes to run a simulated task for a uniformly distributed number of iterations. The parameters of the distribution are shown in Table 8.1. Since latency is positively correlated with device’s energy consumption and the radio transmission power is kept constant in these experiments, it can also be used as an index for energy cost. We use these samples as the reward sequences in the following emulation.

We present our evaluation as the regret compared to the offline optimal solution in (8.1). For real applications the regret can be extra energy consumption over all nodes, or extra processing latency over all data frames. Figure 8.5 validates MABSTA’s performance guarantee for different problem sizes. From the cases we have considered, MABSTA’s regret scales with O(N 1.5 M ).

We further compare MABSTA with two other algorithms as shown in Figure 8.6 and Figure 8.7. Exp3 is proposed for adversarial MAB in [30]. Randomized baseline simply selects an arm uniformly for each data frame. Applying Exp3 to our task assignment problem results in the learning time grows exponentially with O(M N ). Hence, Exp3 is not competitive in our scheme, in which the regret grows nearly linear with T as randomized baseline does. In addition to original MABSTA, 160

we propose a more aggressive scheme by tuning γ provided in MABSTA. That is, for each frame t, setting ( s

γt = min 1,

M (N + |E| M ) ln M N (e − 1)(N + |E|)t

)

.

(8.62)

From (8.2), the larger the γ, the more chance that MABSTA will do exploration. Hence, by exploring more aggressively at the beginning and exploiting the best arm as γ decreases with t, MABSTA with varying γ learns the environment even faster and remains competitive with the offline optimal solution, where the ratio reaches 0.9 at early stage. That is, after first 5000 frames, MABSTA already achieves the performance at least 90% of the optimal one. In sum, these empirical tracebased evaluations show that MABSTA scales well and outperforms the state of the art in adversarial online learning algorithms (EXP3). Moreover, it typically does significantly better in practice than the theoretical performance guarantee.

8.6

Discussion

With increasing number of devices capable of computing and communicating, the concept of collaborative computing enables complex applications which a single device cannot support individually. However, the intermittent and heterogeneous connections and diverse device behavior make the performance highly-variant with time. In this study, we have proposed a new online learning formulation that does 161

not make any stationary stochastic assumption on channels and devices. We have presented MABSTA and proved that it can be implemented efficiently and provides performance guarantee for all dynamic environments. The trace-data emulation has shown that MABSTA is competitive to the optimal offline strategy and is adaptive to changes of the environment. A more general category than multi-armed bandit problems, called reinforcement learning, is to learn how to map the situations (environments) to actions [63]. An interesting and essential research problem is to speed up the learning process, in which not only can we learn from the actions we take but also learn from the actions not taken [64, 65, 66].

162

Chapter 9

Conclusion

As more and more intelligent devices are being connected in the era of the Internet of Things (IoT), there is an abundant amount of computing resources distributed over the network. Hence, it is promising to perform collaborative computing over multiple devices to jointly support a complex application that a single device cannot support individually. On one hand, multiple devices can run tasks in parallel to speed up the process. On the other hand, spreading the tasks over multiple devices from a cost-balancing perspective extends the battery lifetime on each device. However, collaborative computing over the network has communication overhead. The extra data transmission incurs both energy consumption and latency. Because of energy, computation, and bandwidth constraints on smart things and other edge devices, we have to efficiently leverage the resource to optimize system performance, considering devices’ availabilities, and the costs as well as latencies associated with computation and communication over the network. 163

In this thesis, we have proposed a step by step approach to optimize task assignment over multiple devices in realistic scenarios so as to make most efficient usage of available resource, while satisfying QoS constraints like energy consumption and application latency. We have partitioned an application into multiple tasks with their dependencies described by a directed acyclic graph (DAG), and studied the best task assignment strategy in different environments. We started from assuming that the amount and the cost of resources, like CPU cycles and channel bandwidth, are known and deterministic. We have solved an optimal task assignment that minimizes the application latency subject to a single cost constraint. Then, considering each device may have its own cost budget, we have formulated an optimization problem subject to individual constraints on each device. Taking a step further, in the scenario where the amount and the cost of resource vary with time, we have modeled them as stochastic processes with known statistics and solved a stochastic optimization problem. Finally, considering the resource states may be unknown and highly variant at the application run time, we have proposed online learning algorithms to learn the unknown statistics and make competitive task assignment that adapts to the changes in dynamic environments. Specifically, we have proposed the following formulations. • Deterministic Optimization with Single Constraint • Deterministic Optimization with Multiple Constraints • Stochastic Optimization with Single Constraint 164

• Online Learning in Stationary Environments • Online Learning in Non-stationary Environments We have focused on designing computation-light algorithms to make decisions on task assignment so that they do not incur considerable CPU overhead. For optimization formulations, we have shown that these problems are NP-hard, and proposed polynomial-time approximation algorithms with provable performance guarantee (approximation ratio). For online learning formulations, we have proposed polynomial-time algorithms that makes competitive task assignment compared with the optimal strategy. We have performed comprehensive simulations, including trace-data simulations, to validate our analysis on algorithms’ performance and complexity. We envision that a future cyber foraging system is able to take the requests, explore the environment and assign tasks on heterogeneous computing devices, and satisfies the specified QoS requirements. Furthermore, the concept of macroprogramming enables the application developers to code by describing the high-level and abstracted functionality that is independent of platforms. The existence of an interpreter plays an important role, which translates the high-level functional description to machine codes for different devices. Especially, one crucial component that is closely related to system performance is how to partition an application into tasks and assign them to suitable devices with the awareness of resource availability at run time. As we have seen several potential system prototypes that demonstrate 165

the benefit of collaborative computing, we are positive that these innovative algorithms can be incorporated into real systems to optimize the performance of collaborative computing.

166

Reference List [1] O. Vermesan and P. Friess, Internet of things: converging technologies for smart environments and integrated ecosystems. River Publishers, 2013. [2] “50 sensor applications for a smarter world.” http://www.libelium.com/top_ 50_iot_sensor_applications_ranking/. Accessed: 2015-12-08. [3] M. Satyanarayanan, “Pervasive computing: Vision and challenges,” Personal Communications, IEEE, vol. 8, no. 4, pp. 10–17, 2001. [4] L. Atzori, A. Iera, and G. Morabito, “The internet of things: A survey,” Computer networks, vol. 54, no. 15, pp. 2787–2805, 2010. [5] M. A. M. Vieira, C. N. Coelho Jr, D. da Silva, and J. M. da Mata, “Survey on wireless sensor network devices,” in Emerging Technologies and Factory Automation, 2003. Proceedings. ETFA’03. IEEE Conference, vol. 1, pp. 537– 544, IEEE, 2003. [6] R. Balan, J. Flinn, M. Satyanarayanan, S. Sinnamohideen, and H.-I. Yang, “The case for cyber foraging,” in Proceedings of the 10th workshop on ACM SIGOPS European workshop, pp. 87–92, ACM, 2002. [7] L. Mottola and G. P. Picco, “Programming wireless sensor networks: Fundamental concepts and state of the art,” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 19, 2011. [8] C. H. Papadimitriou and K. Steiglitz, Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998. [9] G. Ausiello, Complexity and approximation: Combinatorial optimization problems and their approximability properties. Springer, 1999. [10] E. Cuervo, A. Balasubramanian, D.-k. Cho, A. Wolman, S. Saroiu, R. Chandra, and P. Bahl, “Maui: making smartphones last longer with code offload,” in ACM MobiSys, pp. 49–62, ACM, 2010. [11] M.-R. Ra, A. Sheth, L. Mummert, P. Pillai, D. Wetherall, and R. Govindan, “Odessa: enabling interactive perception applications on mobile devices,” in ACM MobiSys, pp. 43–56, ACM, 2011. 167

[12] J. Gittins, K. Glazebrook, and R. Weber, Multi-armed bandit allocation indices. John Wiley & Sons, 2011. [13] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” arXiv preprint arXiv:1204.5721, 2012. [14] “Tutornet: A low power wireless iot testbed.” http://anrg.usc.edu/www/ tutornet/. Accessed: 2015-12-08. [15] R. M. Karp, Reducibility among combinatorial problems. Springer, 1972. [16] L. Lovász, “On the ratio of optimal integral and fractional covers,” Discrete mathematics, vol. 13, no. 4, pp. 383–390, 1975. [17] D. P. Williamson and D. B. Shmoys, The design of approximation algorithms. Cambridge University Press, 2011. [18] J. Canny, “Some algebraic and geometric computations in pspace,” in Proceedings of the twentieth annual ACM symposium on Theory of computing, pp. 460–467, ACM, 1988. [19] S. Arora and B. Barak, Computational complexity: a modern approach. Cambridge University Press, 2009. [20] G. L. Nemhauser and L. A. Wolsey, Integer and combinatorial optimization, vol. 18. Wiley New York, 1988. [21] G. Nemhauser and L. Wolsey, “Polynomial-time algorithms for linear programming,” Integer and Combinatorial Optimization, pp. 146–181. [22] E. V. Denardo, Dynamic programming: models and applications. Courier Corporation, 2012. [23] K. Dudzi´ nski and S. Walukiewicz, “Exact methods for the knapsack problem and its generalizations,” European Journal of Operational Research, vol. 28, no. 1, pp. 3–21, 1987. [24] O. H. Ibarra and C. E. Kim, “Fast approximation algorithms for the knapsack and sum of subset problems,” Journal of the ACM (JACM), vol. 22, no. 4, pp. 463–468, 1975. [25] G. J. Woeginger, “When does a dynamic programming formulation guarantee the existence of a fully polynomial time approximation scheme (fptas)?,” INFORMS Journal on Computing, vol. 12, no. 1, pp. 57–74, 2000. [26] H. Robbins, “Some aspects of the sequential design of experiments,” in Herbert Robbins Selected Papers, pp. 169–177, Springer, 1985. 168

[27] J. Vermorel and M. Mohri, “Multi-armed bandit algorithms and empirical evaluation,” in Machine Learning: ECML 2005, pp. 437–448, Springer, 2005. [28] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 2-3, pp. 235–256, 2002. [29] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a rigged casino: The adversarial multi-armed bandit problem,” in Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pp. 322– 331, IEEE, 1995. [30] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002. [31] K. Kumar, J. Liu, Y.-H. Lu, and B. Bhargava, “A survey of computation offloading for mobile systems,” Mobile Networks and Applications, vol. 18, no. 1, pp. 129–140, 2013. [32] C. Wang and Z. Li, “Parametric analysis for adaptive computation offloading,” ACM SIGPLAN, vol. 39, no. 6, pp. 119–130, 2004. [33] Y.-H. Kao, B. Krishnamachari, M.-R. Ra, and F. Bai, “Hermes: Latency optimal task assignment for resource-constrained mobile computing,” in IEEE INFOCOM, pp. 1894–1902, IEEE, 2015. [34] Y.-H. Kao, R. Kannan, and B. Krishnamachari, “Flexible in-network data processing on iot devices,” submitted to IEEE SECON 2016. [35] Y.-H. Kao and B. Krishnamachari, “Optimizing mobile computational offloading with delay constraints,” in IEEE GLOBECOM, IEEE, 2014. [36] Y.-H. Kao, B. Krishnamachari, M.-R. Ra, and F. Bai, “Hermes: Latency optimal task assignment for resource-constrained mobile computing,” submitted to IEEE Transactions on Mobile Computing. [37] Y.-H. Kao, B. Krishnamachari, F. Bai, and K. Wright, “Online learning for wireless distributed computing,” in preparation. [38] B.-G. Chun, S. Ihm, P. Maniatis, M. Naik, and A. Patti, “Clonecloud: elastic execution between mobile device and cloud,” in ACM Computer systems, pp. 301–314, ACM, 2011. [39] C. Shi, K. Habak, P. Pandurangan, M. Ammar, M. Naik, and E. Zegura, “Cosmos: computation offloading as a service for mobile devices,” in ACM MobiHoc, pp. 287–296, ACM, 2014. 169

[40] O. Goldschmidt and D. S. Hochbaum, “A polynomial algorithm for the kcut problem for fixed k,” Mathematics of operations research, vol. 19, no. 1, pp. 24–37, 1994. [41] A. Gerasoulis and T. Yang, “On the granularity and clustering of directed acyclic task graphs,” Parallel and Distributed Systems, IEEE Transactions on, vol. 4, no. 6, pp. 686–701, 1993. [42] W. Dai, Y. Gai, and B. Krishnamachari, “Online learning for multi-channel opportunistic access over unknown markovian channels,” in IEEE SECON, pp. 64–71, IEEE, 2014. [43] K. Liu and Q. Zhao, “Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5547–5567, 2010. [44] J. D. Cohen, S. M. McClure, and J. Y. Angela, “Should i stay or should i go? how the human brain manages the trade-off between exploitation and exploration,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 362, no. 1481, pp. 933–942, 2007. [45] S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploitation for multi-armed bandit problems,” Selected Topics in Signal Processing, IEEE Journal of, vol. 7, no. 5, pp. 759–767, 2013. [46] R. Ortner, D. Ryabko, P. Auer, and R. Munos, “Regret bounds for restless markov bandits,” in Algorithmic Learning Theory, pp. 214–228, Springer, 2012. [47] C. Shi, V. Lakafosis, M. H. Ammar, and E. W. Zegura, “Serendipity: enabling remote computing among intermittently connected mobile devices,” in ACM MobiHoc, pp. 145–154, ACM, 2012. [48] M. Y. Arslan, I. Singh, S. Singh, H. V. Madhyastha, K. Sundaresan, and S. V. Krishnamurthy, “Cwc: A distributed computing infrastructure using smartphones,” Mobile Computing, IEEE Transactions on, 2014. [49] M. R. Rahimi, N. Venkatasubramanian, S. Mehrotra, and A. V. Vasilakos, “Mapcloud: mobile applications on an elastic and scalable 2-tier cloud architecture,” in IEEE/ACM UCC, pp. 83–90, IEEE, 2012. [50] M. Satyanarayanan, “Cloudlets: at the leading edge of cloud-mobile convergence,” in ACM SIGSOFT, pp. 1–2, ACM, 2013. [51] L. Fleischer, M. X. Goemans, V. S. Mirrokni, and M. Sviridenko, “Tight approximation algorithms for maximum general assignment problems,” in ACMSIAM symposium on Discrete algorithm, pp. 611–620, Society for Industrial and Applied Mathematics, 2006. 170

[52] T. Korkmaz and M. Krunz, “Multi-constrained optimal path selection,” in IEEE INFOCOM, vol. 2, pp. 834–843, IEEE, 2001. [53] F. Kuipers, P. Van Mieghem, T. Korkmaz, and M. Krunz, “An overview of constraint-based path selection algorithms for qos routing,” IEEE Communications Magazine, 40 (12), 2002. [54] G. Xue, W. Zhang, J. Tang, and K. Thulasiraman, “Polynomial time approximation algorithms for multi-constrained qos routing,” IEEE/ACM Transactions on Networking (ToN), vol. 16, no. 3, pp. 656–669, 2008. [55] G. Xue, A. Sen, W. Zhang, J. Tang, and K. Thulasiraman, “Finding a path subject to many additive qos constraints,” IEEE/ACM Transactions on Networking (TON), vol. 15, no. 1, pp. 201–211, 2007. [56] V. V. Vazirani, Approximation Algorithms. New York, NY, USA: SpringerVerlag New York, Inc., 2001. [57] J. B. Orlin, “A faster strongly polynomial minimum cost flow algorithm,” Operations research, vol. 41, no. 2, pp. 338–350, 1993. [58] X. Chen, S. Hasan, T. Bose, and J. H. Reed, “Cross-layer resource allocation for wireless distributed computing networks,” in RWS, IEEE, pp. 605–608, IEEE, 2010. [59] H. N. Gabow, “An efficient implementation of edmonds’ algorithm for maximum matching on graphs,” Journal of the ACM (JACM), vol. 23, no. 2, pp. 221–234, 1976. [60] Y. Han, “Tight bound for matching,” Journal of combinatorial optimization, vol. 23, no. 3, pp. 322–330, 2012. [61] H. Viswanathan, E. K. Lee, and D. Pompili, “Enabling real-time in-situ processing of ubiquitous mobile-application workflows,” in IEEE MASS, pp. 324– 332, IEEE, 2013. [62] Y. M. Dirickx and L. P. Jennergren, “On the optimality of myopic policies in sequential decision problems,” Management Science, vol. 21, no. 5, pp. 550– 556, 1975. [63] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998. [64] K. Tumer and N. Khani, “Learning from actions not taken in multiagent systems,” Advances in Complex Systems, vol. 12, no. 04n05, pp. 455–473, 2009. 171

[65] N. Khani and K. Tumer, “Learning from actions not taken: a multiagent learning algorithm,” in Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 1277–1278, International Foundation for Autonomous Agents and Multiagent Systems, 2009. [66] N. Khani and K. Tumer, “Fast multiagent learning: Cashing in on team knowledge,” Intel. Engr. Systems Though Artificial Neural Nets, vol. 18, pp. 3–11, 2008.

172

OPTIMIZING TASK ASSIGNMENT FOR ...

OPTIMIZING TASK ASSIGNMENT FOR ...

Suggest Documents

Attack Helicopter, Task Assignment, Helicopter Simulator, Task ...

Task Assignment and Path Planning for Multiple

Task Assignment Strategies for Pools of Baseband

Supervisory Controller for Task Assignment and Resource

some compartmentalized secure task assignment models for

Optimizing Task Distribution for Heterogeneous ... - Semantic Scholar

Planning Task Assignment Tips - Homelandplanning.nebraska.edu

Task Assignment and Scheduling

Optimizing Address Assignment for Scheduling DSPs with ... - CityU CS

Experimental Demonstration of UAV Task Assignment ... - CiteSeerX

Analysis of Task Assignment with Cycle Stealing

Assignment 3: SemEval-2016 Task 5

Cycle Stealing under Immediate Dispatch Task Assignment

Decentralized Task Assignment in Camera Networks - Automatica

Real-Time Task Assignment Approach Leveraging ... - MDPI

Assignment 3: SemEval-2016 Task 5

Task Assignment with Unknown Duration - Semantic Scholar

On Optimizing Human-Machine Task Assignments

QoS-Constrained Sensing Task Assignment for Mobile ... - IEEE Xplore

Integrated Data Placement and Task Assignment for ... - CiteSeerX

An Integral Framework of Task Assignment and Path Planning for ...

Integrated Data Placement and Task Assignment for ... - CiteSeerX

Hermes: Latency Optimal Task Assignment for Resource-constrained ...

Efficient task assignment for spatial crowdsourcing: A ... - Edward Curry