Adaptive Task Allocation for Mobile Edge Learning Umair Mohammad and Sameh Sorour
arXiv:1811.03748v1 [cs.DC] 9 Nov 2018
Department of Electrical and Computer Engineering, University of Idaho, Moscow, ID, USA
[email protected],
[email protected]
Abstract—This paper aims to establish a new optimization paradigm for implementing realistic distributed learning algorithms, with performance guarantees, on wireless edge nodes with heterogeneous computing and communication capacities. We will refer to this new paradigm as “Mobile Edge Learning (MEL)”. The problem of dynamic task allocation for MEL is considered in this paper with the aim to maximize the learning accuracy, while guaranteeing that the total times of data distribution/aggregation over heterogeneous channels, and local computing iterations at the heterogeneous nodes, are bounded by a preset duration. The problem is first formulated as a quadratically-constrained integer linear problem. Being an NP-hard problem, the paper relaxes it into a non-convex problem over real variables. We thus proposed two solutions based on deriving analytical upper bounds of the optimal solution of this relaxed problem using Lagrangian analysis and KKT conditions, and the use of suggest-and-improve starting from equal batch allocation, respectively. The merits of these proposed solutions are exhibited by comparing their performances to both numerical approaches and the equal task allocation approach.
I. I NTRODUCTION A. Motivation and Background The accelerated migration towards the era of smart cities mandates the deployment of a large number of Internetof-Things (IoT) devices, generating exponentially increasing amounts of data at the edge of the network. This data is currently sent to cloud severs for analytics and decisionmaking that improve the performance of a wide range of systems and services. However, it is expected that the rate and nature of this generated data will prohibit such centralized processing and analytics option. Indeed, the size of data will surpass the capabilities of current and even future wireless networks and internet backbone to transfer them to cloud datacenters [1]. In addition, the nature of this data and the timecriticality of their processing/analytics will enforce 90% of their processing to be done locally at edge servers and/or at the edge (mostly mobile) nodes themselves (e.g., smart phones, laptops, monitoring cams, drones, connected and autonomous vehicles, etc.) [2]. The above options of edge processing are supported by the late advancements in the area of mobile edge computing (MEC) [3] in general, especially collaborative MEC [4]–[6] and hierarchical MEC (H-MEC) [7], [8]. While the former enables edge nodes to decide on whether to perform their computing task locally or offload them to the edge servers, the latter involves offloading tasks among edge nodes themselves, and possibly the edge servers. Such task offloading decisions
are usually done while respecting the heterogeneous spare computing resources and communication capacities of the edge network’s nodes and links, respectively, so as to minimize task completion delays and/or energy consumption. While almost all works on MEC and H-MEC focused on managing simple independent computing/processing tasks (i.e., tasks initiated separately by each of the edge nodes), HMEC can be easily extended to enable collaboration among edge nodes, and possibly with remote edge or even cloud servers, in performing distributed processing of the same tasks initiated by one or multiple orchestrating nodes. In this setting, the orchestrating node(s) will distributed computing tasks to the processing edge nodes for local processing and then collect their results to conclude the process. The decisions on the task size allocated to each node can again be done, given the heterogeneous communication capacities and spare computing resources in the edge network, so as to achieve delay guarantees and energy savings. In addition, most of the previous works on MEC and H-MEC were limited to the level of basic processing technology which clearly contradicts with the increasing trends of using machine learning (ML) tools for data analytics. ML include a wide range of techniques, starting from simple regression to deep neural networks. These technique give highly accurate results but are computationally expensive and require large amounts of data. Sometimes, the models may even be undergoing retraining online to search for the optimum parameters. Hence, a lot of research has been done on how to parallelize these algorithms, by distributing the data over multiple nodes. However, most of this works have considered this procedure at the cloud level and over wired distributed computing environments . Extending the above distributed learning paradigm to resourceconstrained edge servers and nodes was not explored until very recently. [9]–[11]. The work of [9] explore the possibility of connected mobile android devices in a distributed manner to train collected images in order to improve accuracy. The objective is to be able to train adaptive deep learning models based on the collected images. The authors of [10] propose an exit strategy model for a distributed deep learning network. Based on an accuracy measure, the learning is either done collaboratively between users, at the edge or on the cloud. A new deep learning algorithm is designed to recognize food for dietary needs [11] using a mobile device and a server by splitting the processing between the device and edge. As one can observe, the objective of these approaches is to
train networks in a distributed manner in order to improve training accuracy. The work of [10] does aim to restrict transmissions to the edge or cloud by defining an acceptable level of accuracy. However, these works do not tackle distributing the machine learning task in wireless MEC’s by optimizing resource utilization and maintaining an acceptable level of accuracy simultaneously. More recently, Tuor et al. [12] aimed to unify the number of local learning iterations in resource-constrained edge environments in order to maximize accuracy. The proposed approach jointly optimized the number of local learning and global update cycles at the learning edge servers/nodes (or learners for short) and orchestrator, respectively. To deal with the problem of “deviating gradients”, a newer approach is suggested in [13] to improve the frequency of global updates. However, all the above works investigating distributed ML on resource constrained wireless devices assumed equal distribution of task sizes to all edge nodes/servers, thus ignoring the typical heterogeneity in computing and communication capacities of different nodes and links, respectively. The implications of such computing and communication heterogeneity on optimizing the task allocation to different learners, selecting learning models, improving learning accuracy, minimizing local and global cycle times, and/or minimizing energy consumption, are clearly game-changing, but yet never investigated. B. Contribution To the best of the authors’ knowledge, this work is the first attempt to develop realistic distributed learning algorithms, with performance guarantees, on cloudlet(s) of heterogeneous wireless edge nodes. We will refer to this new paradigm as “Mobile Edge Learning (MEL)”. To achieve the above MEL end-goal, new research tracks needs to be developed to explore the interplay and joint optimization of learning model selection, learning accuracy, task allocation, resource provisioning, node selection/arrangements, local/global cycle times, and energy consumption. This paper will inaugurate this MEL research, by considering the problem of dynamic task allocation for distributed learning over heterogeneous wireless edge learners (i.e., edge nodes with heterogeneous computing capabilities and heterogeneous wireless links to the orchestrator). This task allocation will be conducted so as to maximize the learning accuracy, while guaranteeing that the total times of data distribution/aggregation over heterogeneous channels, and local computing iterations at the heterogeneous nodes, are bounded by a preset duration by the orchestrator. The maximization of the learning accuracy is achieved by maximizing the number of local learning iterations per global update cycle [12]. To this end, the problem is first formulated as quadraticallyconstrained integer linear problem. Being an NP-hard problem, the paper relaxes it to a non-convex problem over real variables. Analytical upper bounds on the optimal solution of this relaxed problem are derived using Lagrangian analysis and
KKT conditions. The proposed algorithm will thus start from these computed bounds, and then runs suggest-and-improve steps to reach a feasible integer solution. For large number of learners, we also proposed a heuristic solution based on implementing suggest-and-improve steps starting from equal batch allocation. The merits of these proposed algorithms will finally be exhibited through extensive simulations, comparing their performances to both numerical solutions and the equal task allocation approach of [12], [13]. II. S YSTEM M ODEL FOR MEL A. Distributed Learning Background Distributed learning is defined by the operation of running one machine learning task on one global dataset D over a system of K learners. Learner k, k ∈ {1, 2, . . . , K} trains its local learning model on a subset Dk of the global dataset, where SK D k = D. We will refer to each of these dataset subsets k=1 as a batch. The number of samples in each batch Dk is denoted by dk = |Dk |, and the size of the global dataset is denoted by d = d1 + · · · + dK . In machine learning (ML), the loss function is typically defined by f (w, xj , yj ), where j ∈ {1, . . . , dk } represents one sample of batch Dk , x is the vector of features or observations, yj is the associate label or class to which sample j belongs, and w is the parameter matrix of the employed learning approach. For most ML algorithms, the parameter matrix consists a weight vectors. For instance, it consists of the weights and biases in neural networks. Since the observation/feature vectors and labels are not variable, the loss function can be concisely denoted by fj (w). Given the above notation, the local loss function at each learner k can be given by: Fk (w) =
dk 1 X fj (w) dk j=1
(1)
The global loss function at the orchestrator is thus computed by aggregating all learners’ loss functions as follows: dk K K X 1X 1 X fj (w) = Fk (w) F (w) = dk j=1 d k=1
(2)
k=1
The objective of the orchestrator in any distributed learning setting is to minimize the global loss function over the parameter matrix, which can be expressed as: w∗ = arg min F (w)
(3)
The optimal solution for this type of problems is generally difficult to obtain analytically. Consequently, these problems are solved using gradient descent (GD) or stochastic gradient descent (SGD) methods, based on the employed batch distribution approach among learners. By applying the GD or SGD
method at the k-th learner, the local parameter matrix is thus updated at the i-th local iteration as follows: ˜ k [i] = w ˜ k [i − 1] − δ∇Fk (w ˜ k [i − 1]) w
(4)
˜ k [i] ∀ k differ from one Clearly, the local parameter matrices w another during local update iterations, because they are processing different batches. After τ local iterations, the learners send these local parameter matrices to the orchestrator, which can then re-compute an updated unique parameter matrix at the j global update cycle with some kind of averaging such as: K 1X ˜ k [j] (5) dk w w [j] = d k=1
B. Transition to MEL System Model The modelling and formulation presented above [12] does not lend itself well to wireless nor heterogeneous edge learner environments, where learners have different computing capabilities and various channel qualities to the orchestrator. To migrate the general distributed learning model to the MEL paradigm, we will redefine the distributed learning model in an MEC/HMEC context. Consider an edge orchestrator o (e.g., edge server or even one of the edge nodes) wants to perform a distributed learning task on a specific dataset over K heterogeneous wireless edge learners. In each global cycle, it thus sends to each learner k a batch Dk of random samples1 from this dataset and the initial parameter matrix w. The batch Dk , allocated to learner k, is assumed to have a size of Bkdata bits, which can be computed as follows: Bkdata = dk F Pd
(6)
where F is the number of features in the dataset, and Pd is the data bit precision. For example, the MNIST dataset has 60000 images of size 28x28 stored as unsigned integers, and thus Bkdata = 376.32 Mbits. On the other hand, the size of the parameter matrix is assumed to be of size Bkmodel bits, which can be expressed for learner k as: Bkmodel = Pm (dk Sd + Sm )
(7)
As mentioned above, the orchestrator sends the concatenation of the aforementioned data and model bits with power Pko over a wireless channel with bandwidth W and complex channel power gain hko . Once this information is received by ˜ k = w. It each learner k, it sets is local parameter matrix w then performs τ local update cycles on this local parameter matrix, using its allocated batch Dk . For each iteration in typical ML, the algorithm sequentially goes over all features once for each data sample. Consequently, the number Xk of computations required per iteration is equal to: Xk = dk Cm
(8)
which clearly depends on the number of data samples dk assigned to each node and the computational complexity Cm of the model. Although the data maybe stored as other types, the operations themselves are typically floating point operations. ˜ k at each learner k is then transmitted back to The updated w the orchestrator with the same power Pko on the same channel. The server will then update the global parameter matrix w as described in (5). Once done, it sends back this updated vector with a batch of new random sample from the dataset to each learner, and the process repeats. To this end, define the time T as the duration initiated when the orchestrator starts allocating and sending batches to learner in each global cycle, until the orchestrator starts this cycle’s global update operation. We will refer to the time T as the global cycle clock (to differentiate it from the total global cycle duration consisting of the global cycle clock and the global update processing time). Clearly, this duration should encompass the time needed by the three processes of the distributed processing in each global cycle: 1) The time tSk needed to send the allocated batch Dk and global parameter matrix w to learner k, ∀ k. Defining the transmission rate between the orchestrator and learner k by Rk , the time tSk can be expressed for learner k as: Bkdata + Bkmodel Rk dk F Pd + Pm (dk Sd + Sm ) , = W log2 1 + PkoNh0ko
tSk =
(9)
where Pm is the model bit precision (typically floating point precision). As shown in the above equation, the parameter matrix size consists of two parts, one depending on the batch size (represented by the term dk Sd , where Sd is the number of model coefficients related to each sample of the batch), and the other related to the constant size of the employed ML model (denoted by Sm ).
where N0 is the noise power spectral density. 2) τ times the time tC k needed to perform one local update iteration at learner k, ∀ k. Defining fk as learner k’s local processor frequency dedicated to update the ˜ k , the time tC parameter matrix w k can be expressed for learner k as: Xk dk Cm tC = (10) k = fk fk
1 This model assumes randomized batch allocations to learners in each global cycle, and SGD update approach to compute the gradients. This choice is justified by the generality of this approach (i.e., The deterministic GD approach does not need to send data in very cycle) and its proven superior learning accuracy performance [14].
3) The time tR k needed to receive the updated local parame˜ k from learner k, ∀ k. Assuming the channel ter matrix w between learner and the orchestrator is reciprocal and does not change during the duration of one global update
cycle, the time tR k can be computed for learner k as: tR k =
Pm (dk Sd + Sm ) Bkmodel , = Rk W log2 1 + PkoNh0ko
(11)
Thus, the total time tk taken by learner k to complete the above three processes is equal to: R tk = tSk + τ tC k + tk dk Cm dk F Pd + 2Pm (dk Sd + Sm ) +τ = fk W log2 1 + PkoNh0ko
(12)
III. P ROBLEM F ORMULATION As mentioned in Section I, the objective of this first paper on MEL is to optimize the task allocation (i.e., distributed batch size dk for each learner k) so as to maximize the accuracy of the distributed learning process in each global cycle (and thus eventually the accuracy of the entire learning process), within a preset global cycle clock T by the orchestrator. It is well established in the literature that the loss function in general ML, and local and global loss functions in distributed learning (expressed in (1) and (2)), using GD or SGD, are minimized by increasing the number of learning iterations [15]. For distributed learning, this is equivalent to maximizing the number of local iterations τ in each global cycle [16]. Thus, maximizing the MEL accuracy is achieved by maximizing τ . Given the above model and facts, our objective can be thus re-worded as optimizing the assigned batch size to each of the learners so as to maximize the number of local iterations τ per global updated cycle, while bounding the round-trip distributed processing duration tk ∀ k by the preset global cycle clock T . The optimization variables in this problems are thus τ and dk k ∈ {1, . . . , K}. We can thus re-write the expression of tk in (12) as a function of the optimization variables as follows: tk =
+
Ck1 dk
+
Ck0
(13)
where Ck2 , Ck1 , and Ck1 represent the quadratic, linear, and constant coefficients of learner k in terms of the optimization variables τ and dk , expressed as: Ck2
Cm = fk
F Pd + 2Pm Sd Ck1 = W log2 1 + PkoNh0ko Ck0 =
2Pm Sm W log2 1 + PkoNh0ko
max
τ,dk ∀ k
We will refer to the time tk of learner k as its round-trip distributed processing duration. Clealy, tk ∀ k must be smaller than or equal to the global cycle clock T , for the orchestrator to have all the needed information to perform its global cycle update processing in time.
Ck2 τ dk
Clearly the relationship between tk and the optimization variables dk and τ is quadratic. Furthermore, the optimization variables τ ∈ Z+ and dk ∈ Z+ ∀ k are all non-zero integers. Consequently, the problem of interest in this paper can be formulated as an integer linear program with quadratic constraints and linear constraints as follows:
(14) (15) (16)
s.t.
τ
Ck2 τ dk + Ck1 dk + Ck0 ≤ T, K X
(17a)
k = 1, . . . , K (17b)
dk = d
(17c)
k=1
τ ∈ Z+
(17d)
dk ∈ Z+ ,
k = 1, . . . , K
(17e)
Constraint (17b) will make sure that the round-trip distributed processing times are less than T . Constraint (17c) guarantees that the sum of sizes of the assigned batches to all learners is equal to the entire dataset size that the orchestrator needs to analyze. Constraints (17d) and (17e) are setting the number of local iterations and assigned batch sizes to non-negative integers. The above problem is an integer linear program with quadratic constraints (ILPQC), which is well-known to be NP-hard [17]. We will thus proposed a simpler solution to it through relaxation of the integer constraint in the next section. IV. P ROPOSED S OLUTION A. Problem Relaxation As shown in the previous section, the problem of interest in this paper is NP-hard due to its integer decision variables. We thus propose to simplify the problem by relaxing the integer constraints in (17d) and (17e), solving the relaxed problem, then rounding the obtained real results back into integers. The relaxed problem can be thus given by: max
τ,dk ∀ k
s.t.
τ
Ck2 τ dk + Ck1 dk + Ck0 ≤ T, K X
dk = d
(18a)
k = 1, . . . , K (18b) (18c)
k=1
τ ≥0 dk ≥ 0,
(18d) k = 1, . . . , K
(18e)
where the cases for τ and/or all dk ’s being zero represent scenarios where MEL will not be feasible, and thus the orchestrator must send the learning tasks to the edge or cloud server. The above resulting program becomes a linear program with quadratic constraints. This problem can be solved by using interior-point or ADMM methods, and there are efficient solvers (such as OPTI) that implement these approaches [18].
On the other hand, the associated matrices for each of the quadratic constraints in the relaxed problems can be written in a symmetric form. However, these matrices will have two non-zero values that are positive and equal to each other. The eigenvalues will then sum to zero, which means these matrices are not positive semi-definite, and hence the relaxed problem is non-convex. Consequently, we cannot derive the optimal solution of this problem analytically. Yet, we can still derive upper bounds on the optimal variables and optimal solution using Lagrangian relaxation and KKT conditions.
Proof: From the KKT optimality conditions, we have the following relations given by the following equations:
The philosophy of our proposed solution is thus to calculate these upper bounds values on the optimal variables, then implement suggest-and-improve steps until a feasible integer solution is reached. The next two subsections will thus derive the upper bounds on the optimal variables, by applying Lagrangian analysis and KKT conditions, respectively, on the relaxed non-convex problem.
α0 τ = 0
Ck2 τ dk + Ck1 dk + Ck0 − T ≤ 0, λk Ck2 τ dk
+
Ck1 dk
αk dk = 0
− ∇τ +
K X
k=1
K X
k=1
ν1
!
dk − d
k=1
− ν2
K X
!
dk − d
k=1
− α0 τ −
K X
αk dk
k=1
(19)
where the λk ’s k ∈ {1, . . . , K}, ν1 /ν2 , and α0 /αk k ∈ {1, . . . , K}, are the Lagrangian multipliers associated with the time constraints of the K learners in (18b), the total batch size constraint in (18c), and the non-negative constraints of all the optimization variables in (18d) and (18e), respectively. Note that the equality constraint in (18c) can PKbe represented in the form of two inequality constraints k=1 dk ≤ d and PK − k=1 dk = d) ≤ 0, and thus is associated with two Lagrangian multipliers ν1 and ν2 . Using the well-known KKT conditions, the following theorem introduces upper bounds on the optimal variables of the relaxed problem. Theorem 1: The optimal values d∗k of the allocated batch sizes to different users in the relaxed problem satisfy the following bound: T − Ck0 d∗k ≤ ∀k (20) τ Ck2 + Ck1 Moreover, the analytical upper bound on the optimization variable τ belongs to the solution set the polynomial given by: K K K Y X Y d (τ ∗ + bk ) − ak (τ ∗ + bl ) = 0 (21) k=1
k=1
where rk0 = Ck0 − T , ak = −
l=1 l6=k
Ck1 rk0 and b = . k Ck2 Ck2
!
dk − d
k = 1, . . . , K
(24) (25) (26)
k = 1, . . . , K
− ν2 ∇
K X
(27)
!
dk − d −
k=1
α0 ∇τ − αk ∇
The Lagrangian function of the relaxed problem is expressed as:
K X
(23)
λk ∇ Ck2 τ dk + Ck1 dk + Ck0 − T +
k=1
λk Ck2 τ dk + Ck1 dk + Ck0 − T +
(22)
k=1
B. Upper Bounds using the KKT conditions
K X
Γ0 − T = 0,
K X νi ( dk − d) = 0 i = 1, 2
ν1 ∇
L (x, λ, ν, α) = −τ +
+
Ck0
k = 1, . . . , K
K X
k=1
dk
!
=0
(28)
From the conditions in (22), we can see that the batch size at user k must satisfy (20). Moreover, it can be inferred from (24) that the bound in (20) holds with equality for k having λk 6= 0. In addition, it is clear from (25) that either ν1 = ν2 = 0 (which means that there is no feasible MEL solution and the orchestrator P must offload the entire task to the edge/cloud K servers) or k=1 dk = d (which we get if the problem is feasible). By re-writing the bound on dk in (20) as an equality and taking the sum over all k, we have the following relation: X K K K X X ak T − Ck0 ∗ = (29) d= dk = τ ∗ Ck2 + Ck1 τ ∗ + bk k=1
k=1
k=1
The expression on the right-most hand-side has the form of a partial fraction expansion of a rational polynomial function of τ . Therefore, we can expand it in the following way: a1 a2 ak aK + + ···+ + ···+ = τ + b1 τ + b2 τ + bk τ + bK 1 × (τ + b1 )(τ + b2 ) . . . (τ + bk ) . . . (τ + bK ) " a1 (τ + b2 )(τ + b3 ) . . . (τ + bk ) . . . (τ + bK ) +
a2 (τ + b1 )(τ + b3 ) . . . (τ + bk ) . . . (τ + bK ) + ...+ .ak (τ + b1 )(τ + b2 ) . . . (τ + bk−1 )(τ + bk+1 ) . . . (τ + bK ) # + ... + aK (τ + b1 )(τ + b2 ) . . . (τ + bK−1 )
(30)
Finally, the expanded form can be cleaned up in the form of a rational function with respect to τ , which is equal to the total
dataset size d. PK
ak QK
k=1
d=
QK
k=1
l=1 l6=k
TABLE I L IST OF SIMULATION PARAMETERS
∗
(τ + bl )
(τ ∗ + bk )
(31)
Please note that the degrees of the numerator and denominator will be K − 1 and K, respectively. Furthermore, the poles of the system will be −bk , and, since bk ≥ 0, the system will be stable. Furthermore, τ = −bk is not a feasible solution for the problem, because it is eliminated by the τ ≥ 0 constraint. Therefore, we can re-write (31) as shown in (21). By solving this polynomial, we obtain a set of solutions for τ , one of them is feasible. The problem being non-convex, this feasible solution τ ∗ will constitute the upper bound to the solution of the relaxed problem. Though it was expected that the above bounds for d∗k ∀ k in (20) and solution for τ ∗ from (21) should undergo a suggestand-improve steps to a feasible solution, we have found in our extensive simulations presented in Section V that these expressions were always already feasible. This means that no suggest-and-improve steps were needed, and that the d∗k expressions can be directly used for batch allocation to achieve the optimal τ ∗ for the relaxed problem.
C. Heuristic Method for Large K
Parameter Attenuation Model System Bandwidth B Node Bandwidth W Device proximity R Transmission Power P k Noise Power Density N 0 Computation Capability (fk ) Pedestrian Dataset size (d) Pedestrian Dataset Features (F) MNIST Dataset size (d) MNIST Dataset Features (F)
Value 7 + 2.1 log(R) dB [19] 100 MHz 5 MHz 50m 23 dBm -174 dBm/Hz 2.4 GHz and 700 MHz 9,000 images 648 ( 18 × 36 ) pixels 60,000 images 784 ( 28 × 28 ) pixels
V. S IMULATION R ESULTS In this section, we test our proposed adaptive task allocation solutions in MEL scenarios emulating realistic learning and edge node environments. More specifically, the OPTI-based solution to the relaxed version of the ILPQC formulation, analytical results from (20) and (21) (UB-Analytical), and the heuristic upper-bound/suggest-and-improve (UB-SAI) solution are tested. We also show the merits of these two solutions compared to the equal task allocation (ETA) scheme employed in [12], [13]. We will first introduce the simulation environment, and then present the testing results. A. Simulation Environment
We can see that obtaining the above solution for τ ∗ in (21) will require solving K-th order polynomial, which may be computationally expensive for large K. An alternative to do in such large settings can be done as follows. We can easily infer that the two extremes of batch allocation in the original problem are either: 1) K − 1 nodes are offloaded one data sample each and the remaining node takes d − (K − 1) samples. In this case, the sum of the reciprocals of batch sizes will be K − 1 plus a negligible value. As the number of nodes increases (K will be large), τ will be unbounded. 2) Equal batch allocation where each node processes d/k samples. In this case, the sum of the reciprocals of batch 2 sizes will be Kd . From the above cases, one option for large K would thus be to start the solution from an equal batch allocation setting, and use the proposed suggest-and-improve steps from this starting point. By setting dk = d/K, and summing the reciprocals of dk in (20) and solving for τ , we can thus use the following starting point for τ : K 2 PK Ck1 − k=1 0 d rk , τ= 2 PK Ck k=1 0 rk
(32)
A typical MEC will consist of a cloudlet of heterogeneous devices, channel and computing wise. In our simulation, the edge nodes are assumed to located in an area of 50m of radius. Half of the considered nodes emulates the capacity of a typical fixed/portable computing device (e.g., laptops, tablets, roadside units. etc.) and the other half emulates the capacity of commercial micro-controllers (e.g., Raspberry Pi) that can be attached to different indoor or outdoor systems (e.g., smart meters, traffic cameras). The setting thus emulates an edge environment that can be located either indoor or outdoor. The employed channel model between these devices is summarized in Table I, which emulates 802.11 type links between the edge nodes. Two datasets are considered in our simulations, namely the pedestrian [20] and MNIST [21] datasets. The pedestrian datasets has 9,000 training images consisting of 684 features (18 x 36 pixels). The ML model used for this dataset is a single-layer neural network with 300 neurons in the hidden layer. For this model, the weight matrix w = [w1 , w2 ] is the concatenation of two sub-matrices, where w1 is 300 × 648 and w2 is 300 × 2, neither of which depending on the batch size (Sd = 0). Thus, the size of the model is 6,240,000 bits, which is fixed for all edge nodes. The forward and backward passes will require 781,208 floating point operations [22]. On the other hand, the MNIST dataset consists of 60,000 images 28x28 images (784 features). The employed ML model
350
140 UB-SAI, T = 60s UB-Analytical, T = 60s OPTI-based, T = 60s ETA, T = 60s UB-SAI, T = 30s UB-Analytical, T = 30s OPTI-based, T = 30s ETA, T = 30s
Number of Updates
250
200
120
100
Number of Updates
300
UB-SAI, K = 20 nodes UB-Analytical, K = 20 nodes OPTI-based, K = 20 nodes ETA, K = 20 nodes UB-SAI, K = 10 nodes UB-Analytical, K = 10 nodes OPTI-based, K = 10 nodes ETA, K = 10 nodes UB-SAI, K = 5 nodes UB-Analytical, K = 5 nodes OPTI-based, K = 5 nodes ETA, K = 5 nodes
150
100
50
80
60
40
20
0
0 0
5
10
15
20
25
30
35
40
45
50
0
10
20
Number of Nodes (K)
B. Simulation Results for Pedestrian Dataset Fig. 1 shows the number of local iterations τ achieved by all tested approaches versus the number of edge nodes, for T = 30 and T = 60 seconds. We can first notice τ increases as the number of edge nodes increases. This is trivial as less batch sizes will be allocated to each of the nodes as their number increase. We can also see that the performance of the OPTI-based, UB-Analytical, and UB-SAI solutions are identical for all simulated number of edge nodes and global update clocks. For both global cycle clock cases, the figures show that both adaptive task allocate approaches result in much more local iterations than the ETA approach. For example, for T = 30 seconds, 50 edge nodes can perform only 36 iterations each, whereas 162 can be achieved with our proposed solutions, a gain of 450%. Another interesting result is that the performance of ETA scheme for T = 60 seconds is actually much lower than the performance of our proposed solutions for T = 30s. In other words, our scheme can achieve a better level of accuracy as the ETA scheme in half the time.
50
60
80
Number of Updates
for this data is a 3-layer neural network with the following configuration [784, 300, 124, 60, 10].
40
Fig. 2. Performance comparison of all schemes for K = 5, 10 and 20 vs T
UB-SAI, T = 120s UB-Ana, T = 120s OPTI- based, T = 120s ETA, T = 120s UB-SAI, T = 60s UB-Ana, T = 60s OPTI- based, T = 60s ETA, T = 60s
60 40 20 0 0
5
10
15
20
25
30
35
40
45
50
Number of Nodes (K) (a) 30
Number of Updates
Fig. 1. Performance comparison of all schemes for T = 30 and 60 seconds vs K
30
Global Cycle Time T (seconds)
UB-SAI, K = 20 nodes UB-Ana, K = 20 nodes OPTI-based, K = 20 nodes ETA, K = 20 nodes UB-SAI, K = 10 nodes UB-Ana, K = 10 nodes OPTI-based, K = 10 nodes ETA, K = 10 nodes
20
10
0 0
20
40
60
80
100
120
Global Cycle Time T (seconds) (b)
Fig. 3. MNIST results (a) Performance comparison of all schemes for T = 30 and 60 seconds vs K (b) Performance comparison of all schemes for K = 10 and 20 vs T
less than the number achieved by our scheme in 20 seconds cycle clock. C. Simulation Results for MNIST Dataset
Fig. 2 illustrates the number of local iterations τ versus the global cycle clock, for K = 5, 10, and 20 edge nodes. Clearly, increasing T enables more time for all learners to perform more local updates. Again, the OPTI-based, UB-Analytical, and UB-SAI approaches performs similarly for all simulated scenarios. Comparing this performance to ETA, the figures shows that, when T = 20s, our solutions can perform around 28 local iterations on 20 edge learners, versus only be 42 iterations using ETA, a gain of 420%. As the global cycle clock increases to 60 seconds, our scheme can reach up-to 138 iterations whereas the ETA scheme achieves only 30 updates,
Fig. 3 shows the results for training using the MNIST dataset with a deep neural network model. Similar to the case of the pedestrian dataset, the performance of the OPTI-based, UB-Analytical, and UB-SAI solutions are all identical, and give better performance than the ETA scheme. In general, less updates are possible compared to the smaller pedestrian dataset and model. However, the optimized approach makes it possible to perform more than 30 updates for 20 nodes with a cycle time of 60 seconds. When K = 10, at T = 120s the OPTI-based approach for adaptive batch allocation give
τ = 12 updates whereas only 3 updates are possible with ETA, a gain of 400%. VI. C ONCLUSION This paper inaugurates the research efforts towards establishing the novel MEL paradigm, enabling the design of performance-guaranteed distributed learning solutions for wireless edge nodes with heterogeneous computing and communication capacities. As a first MEL problem of interest, the paper focused on exploring the dynamic task allocation solutions that would maximize the number of local learning iterations on distributed learners (and thus improving the learning accuracy), while abiding by the global cycle clock of the orchestrator. The problem was formulated as an NPhard ILPQC problem, which was then relaxed into a nonconvex problem over real variables. Analytical upper bounds on the relaxed problem’s optimal solution were then derived and were found to solve it optimally in all simulated scenarios. For large K, we proposed a heuristic solution based on implementing suggest-and-improve steps starting from equal batch allocation. Through extensive simulations using wellknown datasets, the proposed solutions were shown to both achieve the same performance as the numerical solvers of the ILPQC problem, and to significantly outperform the equal ETA approach. R EFERENCES [1] M. Chiang and T. Zhang, “Fog and IoT: An Overview of Research Opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854–864, dec 2016. [Online]. Available: http://ieeexplore.ieee.org/document/7498684/ [2] Rhea Kelly, “Internet of Things Data To Top 1.6 Zettabytes by 2020 – Campus Technology,” 2015. [Online]. Available: https://campustechnology.com/articles/2015/04/15/internetof-things-data-to-top-1-6-zettabytes-by-2020.aspx
[8] U. Y. Mohammad and S. Sorour, “Multi-Objective Resource Optimization for Hierarchical Mobile Edge Computing,” in 2018 IEEE Global Communications Conference: Mobile and Wireless Networks (Globecom2018 MWN), Abu Dhabi, United Arab Emirates, dec 2018. [9] D. Li, T. Salonidis, N. V. Desai, and M. C. Chuah, “DeepCham: Collaborative edge-mediated adaptive deep learning for mobile object recognition,” Proceedings - 1st IEEE/ACM Symposium on Edge Computing, SEC 2016, pp. 64–76, 2016. [10] S. Teerapittayanon, B. McDanel, and H. T. Kung, “Distributed Deep Neural Networks over the Cloud, the Edge and End Devices,” Proceedings - International Conference on Distributed Computing Systems, pp. 328–339, 2017. [11] C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, M. Yunsheng, S. Chen, and P. Hou, “A New Deep Learning-Based Food Recognition System for Dietary Assessment on An Edge Computing Service Infrastructure,” IEEE Transactions on Services Computing, vol. 11, no. 2, pp. 249–261, 2018. [12] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When Edge Meets Learning : Adaptive Control for ResourceConstrained Distributed Machine Learning,” in INFOCOM, 2018. [Online]. Available: https://researcher.watson.ibm.com/researcher/files/uswangshiq/SW INFOCOM2018.pdf [13] T. Tuor, S. Wang, T. Salonidis, B. J. Ko, and K. K. Leung, “Demo abstract: Distributed machine learning at resource-limited edge nodes,” INFOCOM 2018 - IEEE Conference on Computer Communications Workshops, pp. 1–2, 2018. [14] L. Bottou and O. Bousquet, “The Tradeoffs of Large Scale Learning,” in Advances in Neural Information Processing Systems, J. C. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. NIPS Foundation (http://books.nips.cc), 2008, vol. 20, pp. 161–168. [Online]. Available: http://leon.bottou.org/papers/bottou-bousquet-2008 [15] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. A. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large Deep Networks Nips2012,” pp. 1–11, 2012. [Online]. Available: papers3://publication/uuid/3BA537A9-BC9A-460495A4-E6F64F6691E7 [16] Z. Wei, S. Gupta, X. Lian, and J. Liu, “Staleness-Aware Async-SGD for distributed deep learning,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2016-Janua, pp. 2350–2356, 2016. [17] A. D. Pia, S. S. Dey, and M. Molinaro, “Mixed-integer Quadratic Programming is in NP,” pp. 1–10, 2014.
[3] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A Survey on Mobile Edge Computing: The Communication Perspective,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358, 2017. [Online]. Available: http://arxiv.org/abs/1701.01090
[18] J. Currie and D. I. Wilson, “OPTI: Lowering the Barrier Between Open Source Optimizers and the Industrial MATLAB User,” in Foundations of Computer-Aided Process Operations, N. Sahinidis and J. Pinto, Eds., Savannah, Georgia, USA, 2012.
[4] C. You and K. Huang, “Mobile Cooperative Computing: EnergyEfficient Peer-to-Peer Computation Offloading,” pp. 1–33, 2017. [Online]. Available: http://arxiv.org/abs/1704.04595
[19] S. Cebula, A. Ahmad, J. M. Graham, C. V. Hinds, L. A. Wahsheh, A. T. Williams, and S. J. DeLoatch, “Empirical channel model for 2.4 GHz ieee 802.11 wlan,” Proceedings of the 2011 International Conference on Wireless Networks, 2011.
[5] M. Liu and Y. Liu, “Price-Based Distributed Offloading for Mobile-Edge Computing with Computation Capacity Constraints,” IEEE Wireless Communications Letters, pp. 1–4, 2017. [6] Y. Li, L. Sun, and W. Wang, “Exploring device-to-device communication for mobile cloud computing,” 2014 IEEE International Conference on Communications (ICC), pp. 2239 – 44, 2014. [Online]. Available: http://dx.doi.org/10.1109/ICC.2014.6883656 [7] X. Cao, F. Wang, J. Xu, R. Zhang, and S. Cui, “Joint Computation and Communication Cooperation for Mobile Edge Computing,” in 16th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), 2018. [Online]. Available: http://arxiv.org/abs/1704.06777
[20] S. Munder and D. M. Gavrila, “An experimental study on pedestrian classification.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1863–1868, 2006. [Online]. Available: https://ieeexplore.ieee.org/document/1704841 [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based Learning Applied to Document Recognition,” Proceedings of IEEE, vol. 86, no. 11, 1998. [22] Brendan Shillingford, “What is the time complexity of backpropagation algorithm for training artificial neural networks? - Quora,” 2016. [Online]. Available: https://www.quora.com/What-is-the-time-complexityof-backpropagation-algorithm-for-training-artificial-neural-networks