The Virtual Savant: Automatic Generation of Parallel

The Virtual Savant: Automatic Generation of Parallel Solvers Frédéric Pinela , Bernabé Dorronsorob, Pascal Bouvryc a Faculty

of Science, Technology and Communication, University of Luxembourg, Luxembourg e-mail: [email protected] b School of Engineering, University of C´ adiz, Spain, e-mail: [email protected] c Faculty of Science, Technology and Communication, University of Luxembourg, Luxembourg, e-mail: [email protected]

Abstract We present a novel method to automatically generate new parallel solvers for optimization problems, called the Virtual Savant. It applies machine learning to model a reference algorithm (which is treated as a black box) from its solutions to a given problem, and after, it is able to efficiently and accurately reproduce its solutions on new unseen problem instances. Additionally, the generated parallel algorithm is scalable to a different problem dimension. We analyze the performance and accuracy of our novel technique on a scheduling problem, and we use instances of different features and sizes. Virtual Savant was used to learn from an exact algorithm, a well-known heuristic, and a specialized parallel metaheuristic. During our experiments, our method solved to optimality up to 95.5% of the 200 unseen instances tested. For larger instances, it is not only highly competitive with the reference algorithms, but it is even able to outperform them in some cases. Another outstanding result is that Virtual Savant improved its accuracy when scaling the problem size (without additional training), with respect to the original algorithm. Keywords: Optimization, Supervised Machine Learning, Support Vector Machines, Scheduling, Metaheuristics

1. Introduction Solving complex optimization problems is a difficult task. There is a plethora of optimization methods for complex problems in the literature. Most of them target a generic design, in order to maximize its applicability to different optimization problems [26]. This implies that the programmer must master the problem to solve in order to appropriately use the existing generic methods, making the task more difficult. This issue is further complicated by the difficulty of parallel computing, among many other factors, that is becoming widespread in all kinds of architectures [25]—from small mobile devices to powerful computing servers. The reason is the well-known “power wall”, that does not allow chip designers to increase the clock frequency of processors, as they used to do, due to physical limits. Therefore, the current trend to improve performance is packaging replicas of components that can execute different instructions in parallel, as it is the case of current computers, which come with multiple multi-core processors, and implement technologies to increase their parallel capabilities (e.g., hyper-threading). The emergence of these new parallel computing architectures prompts the use of concurrent programs, which are now broadly required, and not only for specific constraints (usually performance) as it used to be in the past. Therefore, the need for concurrent algorithms is to be fully met. Developers need to implement parallel programs to make an efficient, full use of current architectures, and moreover, current software programs must be redesigned for parallelism, which is expected to occur more in the future. This work presents a new way of producing parallel programs, in which computers are able to learn by themselves how to perform the required tasks to produce the desired results. This will be done with the proposed Virtual Savant (VS) framework, that makes use of machine learning techniques to infer the behavior of a given algorithm, and reproduce it on parallel architectures. As sketched in Fig. 1, using the method we propose requires two steps. The first one implies a learning process, where VS infers the rules that generated the observed solutions from the input data. This is a costly process that is done only once to train the model. The second step is the execution of VS on parallel architectures to get highly accurate results on unseen data (that may even belong to a different problem dimension). The framework is designed to produce parallel programs from a predefined template that efficiently makes full use of the available parallel resources. They are composed of a large number of quasi-independent and lightweight tasks. Preprint submitted to Information Sciences

November 21, 2017

STEP 1: LEARNING Input Input Input Input Input data data data data Data

Original Software / Observations

STEP 2: EXECUTION Output Output Output Output Result

Unseen Data Parallel Environment

VIRTUAL SAVANT

Learning Process

Original Software / Observations

Solution

VIRTUAL SAVANT Figure 1: The Virtual Savant learns the output produced by a given program (or any observations) for the different inputs (left-hand side), generating a completely different massively parallel algorithm that reproduces its behavior (right-hand side).

Unlike our approach, the other existing works in the literature typically focus on the parallelization of a sequential code by applying transformations [7, 21] that make some instructions, or sequence of them, parallel. Some other works focus on the analysis of the code in order to find independent tasks that can be executed in parallel [15]. Finally, Genetic Programming (GP) has also been applied to parallelize code [53, 56], with the handicap when using this technique it cannot be ensured that the transformed parallel program keeps the same semantics as the original one. A common feature of these three mentioned approaches is that they offer a limited parallel scalability and performance, as well as an uncertain parallelism. The VS cannot be used to learn any kind of program because it does not aim to preserve the original algorithm (as mentioned, it works on observations, so it does not use the original source code). However, we envision it can be a highly suitable technique for the generation of some non-deterministic programs, such as approximated optimization algorithms. These algorithms deal with highly complex optimization problems with normally huge search spaces, making parallelism a desirable feature in order to find accurate results in reasonable time. Additionally, these algorithms are guided by the fitness of solutions, which is a quantification of their quality. This feature can be used by the VS to analyze the accuracy of its results with respect to the reference algorithm/observations, or even to improve its performance. In this work, we propose a new design of our preliminary version of VS framework [14, 41], which is improved here by making it simpler, easier to use (thanks to its new simplified training process), scalable on the problem size, and more efficient. We focus on a complex and well-known optimization problem, the independent task scheduling problem, and we use our framework to automatically generate parallel programs from the observations produced by (i) a well-known heuristic, (ii) an efficient parallel state-of-the-art genetic algorithm (GA) specialized for the problem, as well as from (iii) the optimal solutions (in the case of the smaller instances). Although we focus on a specific problem, we do not make use of any knowledge of the problem for better performance, unlike the reference algorithms, making it possible to further improve the results of our framework. Our motivation is to present and evaluate a generic method, that can be applied to a variety of different problems, rather than finding state-of-the-art solutions. In fact, it is a matter of our future work to both investigate the application of VS to other optimization problems in order to strengthen our confidence in the proposed method, and incorporate problem knowledge into VS to look for highly competitive results with the state of the art (as we did in [14] for our previous prototype). The main contribution of this paper is the design of a novel, scalable framework for the automatic generation of accurate and efficient parallel programs. With respect to our previous approach [14, 41], the enhanced solution we propose here provides two new major features, supposing a definite step forward towards its consolidation as an automatic programming tool for parallel optimizers: • It requires training only one predictor (i.e., the Machine Learning (ML) technique that infers the behavior of the original algorithm), instead of the hundreds or thousands of them used in our previous work [41] (as many as the number of variables in the problem). • It is scalable to any problem size with high accuracy results, and without additional training. This allows 2

training on small problem sizes and later addressing considerably bigger problems. The combination of these two features drastically simplifies and accelerates the training process, because a significantly reduced number of reference solutions are needed (up to 512 times less solutions in this work: from 16, 000 to only 32), and these solutions can be obtained from small problem instances. In addition, as other contributions of this work, we did some extensive experimentation to thoroughly evaluate the performance of VS from different perspectives on a problem from the scheduling domain (an NP-complete problem), when learning observations generated from advanced optimization algorithms of different nature (an heuristic algorithm, an accurate metaheuristic, and an exact approach). The VS framework was evaluated in different ways: (i) the capability to learn the behavior of the reference algorithms, being able to generate similar solutions (over 80% similarity was obtained); (ii) its computing efficiency versus the reference algorithms on the same architecture (it is about 100 times faster than the compared GA); and (iii) its capability to find more accurate solutions to the problem, outperforming the reference algorithms in some cases (still, further improvements are possible if incorporating problem knowledge to the automatically generated program). Finally, because VS outperforms the learned algorithms in many cases, we perform a comparison against some state-of-the-art algorithms, despite the presented VS is a generic algorithm that does not incorporate problem knowledge, which could boost its accuracy. The value of our proposal is not limited to the automatic generation of parallel programs to solve optimization problems: • it is scalable on the problem size, meaning that the framework can be trained for small scale problems and applied to any larger problem size without any additional work; • it works on observations, so we can generate the code without the need of a sequential program or, alternatively, it allows creating an efficient and accurate parallel solver from a simple, inefficient, draft implementation; • it produces massively parallel code, that can be executed on different parallel architectures, as multi-core, clusTM ters, General Purpose GPU (GPGPU), Intel! Xeon Phi , Array of Wimpy Nodes (AWN)1 , etc.; • it is designed as a large set of efficient, lightweight and independent components that can be executed in parallel, matching the Map-Reduce parallel programming model; • its accuracy can be further improved by adding problem knowledge, as it is commonly done in metaheuristics, while keeping the parallel and scalability properties; • the performance profile of all the generated parallel programs can also be further improved by changing the predefined template, to generate different object code, parallel model, etc. The structure of the paper is as follows. Next, we describe the problem addressed in this work. Section 3.3 presents an overview of the main existing solvers to this problem from the literature (both sequential and parallel), and summarizes the most important works dealing with the automatic parallelization of software programs. After that, Section 4 gives a brief description of our inspiration, the savant syndrome, and presents our proposal for the generation of parallel solvers. An extensive analysis of the performance of our solution is presented in Section 5, focusing on the required training process, the learning capability of our VS, the quality of solutions it finds, its performance, and its scalability. Finally, we finish with our main conclusions and lines for future research from this paper in Section 6. 2. The Independent Tasks Scheduling Problem The independent tasks scheduling problem was chosen to analyze the performance of our VS framework in this paper. It is a well-known NP-complete combinatorial optimization problem [18], broadly studied. Algorithms must provide an efficient schedule to the received batch of tasks in a really short time, in order to maximize the response and performance of the system, for any batch size. Consequently, there is a plethora of efficient techniques to solve the problem, from simple heuristics to parallel metaheuristics, as well as exact approaches. 1 Parallel

architectures composed by a set of low-power, low-performance, and low-cost processors

3

The problem considers a number of independent computing tasks and a set of heterogeneous resources (i.e., machines) where they must be processed. Each task can be processed by a single machine only (it cannot be split across machines). A problem instance is defined by an Expected Time to Compute (ETC) matrix, which lists the duration of each task on every possible machine. The ETC is assumed given [6, 16, 24], and the instances we study are randomly generated according to the procedure described in [3]. In the present study, the tasks are chosen as highly heterogeneous. The machines are consistent (a machine cannot be slower than another for one task, and faster for another task), and high and low heterogeneous machine sets are chosen for the evaluation. The optimization problem is finding the tasks to machines assignments that minimize the makespan. Makespan is a well-known measure of the productivity of distributed systems, and it is defined as the finishing time of the last task: max{completion[m] | m ∈ Machines} ,

(1)

where completion[m] is the time when machine m will finish processing the previously assigned tasks, as well as of those already planned. The completion time of a machine m is defined as: ! completion[m] = readym + ETC[t][m] , (2) t∈S (m)

where readym represents the time when machine m will finish all the previously assigned tasks (we consider readym = 0 for all machines in this work), and S (m) is the set of tasks assigned to machine m. Problem solutions are represented by a vector, where solution[t] = m represents that the task of index t is assigned to machine of index m. The ETC and solution representations are needed to understand how the VS approach is applied to the problem. The makespan minimization problem is NP-complete [18], which confines candidate solver algorithms to heuristic and metaheuristic approaches. Such approaches require significant computation to find good solutions, which suggests the design of parallel solvers. Next two sections review some of the most important techniques proposed for this problem, both sequential and parallel. 3. Related Works We present in this section some relevant works in the field. We first review in Section 3.1 the most important algorithms that have been proposed for the independent tasks scheduling problem, while some recent parallel solutions are surveyed in Section 3.2. Finally, we end surveying the main existing approaches targeting the automatic parallelization of programs in Section 3.3. 3.1. Sequential Solvers List scheduling algorithms are among the most relevant heuristics applied to solve this problem in the literature. The reason is their low complexity and high accuracy. They are deterministic tools that assign priorities to tasks, based on the recommendations of some heuristic [29]. Some of the most well-known heuristics for the independent tasks mapping problem are Min-Min, its variant Max-Min, and Sufferage [20, 31, 51]. Min-Min starts with the whole set of tasks and iteratively selects the task that can be accomplished the soonest taking into account the current machine assignments, and removing it from the set of unassigned candidate tasks. Pinel et al. propose in [44] a heuristic that works in two phases. It first runs Min-Min and then improves its result with a local search heuristic in a second step. The complexity of these heuristics grow with the problem size, and they are not a good option for relatively large problems. Besides these heuristics, metaheuristics have been successfully applied to this problem too [17]. Some examples are Genetic Algorithm (GA) [6, 8], Ant Colony Optimization [45], and other hybrid algorithms [45, 59]. They generally provide better quality solutions with respect to the mentioned heuristics, but at the cost of considerably longer execution times. Because of the need of reporting accurate solutions in short times, there is a trend in the last years towards the design of parallel schedulers to solve the problem. The next section reviews some of them. 4

3.2. Parallel Solvers One of the first massively parallel schedulers in the literature was the Differential Evolution (DE) algorithm implementation for GPUs presented in [27], providing speedups of up to 21 times faster than the equivalent CPU implementation. The first parallel design of Min-Min for GPU was presented in Pinel et al. [43], where a parallel Cellular Genetic Algorithm (CGA) for GPU was presented too. The work reports speedups of up to 538 times faster with respect to the sequential Min-Min. Two years later, Iturriaga et al. [23] proposed GPU implementations of two scheduling heuristics, reporting a maximum speedup of about 52 with respect to the Min-Min heuristic, and enhancing the accuracy of the solution reported by Min-Min by around 8%. Abraham et al. propose a different way to solve the problem using multiple threads and Min-Min [1]. In their work, they split the number of tasks and processors into different groups, and Min-Min is independently executed in every group, in parallel. Authors report significant speedups of the proposed approach, but the quality of solutions found was not reported. Iturriaga et al. [22] presented a multithreading local search algorithm for a multi-objective version of the independent tasks scheduling problem, considering makespan and energy consumption. The algorithm outperforms a set of two-phase heuristics based on Min-Min. Because of the high problem complexity and the prompt answer required, some parallel metaheuristics have been proposed in the literature. Pinel et al. designed a multi-threaded Parallel Asynchronous CGA (PA-CGA) for the problem [42], and Nesmachnow et al. presented a parallel CHC algorithm [38]. There are also several works proposing parallel algorithms for this problem for GPGPU co-procesors: In addition to our previously described CGA [43], Solomon et al. [49] presented a Particle Swarm Optimization algorithm that provides high speedups (up to 37 times faster), but it is outperformed by the compared heuristics in terms of results. 3.3. Automatic Parallelization of Programs Parallelism was considered as a step in the automated build of program executables from source code [4]. It is usually performed by applying transformations into the code [7, 21], as changing data access patterns or unrolling loops. The tools rely on carefully inspecting all the possible dependencies of data, in order to identify all concurrent methods in the source program. Fonseca et al. [15] propose a method to decompose the program into a number of tasks that can be executed in parallel, taking into account that the identified dependencies between tasks are respected. They achieved up to around 11 times speedup in a sample code on a 12 cores machine. The method applies transformations that do not alter the semantics of the original program, but the parallel efficiency of the generated software is limited because most of the source code is preserved. In addition, the approach does not support distributed parallelism. The verification may rely on formal reasoning [28]. Other authors apply Artificial Intelligence (AI) techniques such as GA to select the transformations to be applied in the code, as well as their sequence of operation [39, 46, 47, 58]. We do not further survey contributions in this field because our approach does not aim at modifying the implementation while preserving the algorithm, but to produce a new algorithm (and implementation) completely. A combined GP and source-to-source transformation technique was introduced in [56]. Paragen searches the constituent statements of the program in order to use the identified building blocks to rebuild it in a parallel form, with a combination of a number of available functions, both parallel and serial. The method does not respect the semantics of the program, and its functional equivalence needs to be tested. The algorithm solves a multi-objective problem, where correctness and parallelism of the code are the metrics to be maximized. However, defining parallelism as an objective in a multi-objective search greatly increases the solutions space, thus the parallelism obtained is unpredictable and often limited. The method assumes an execution platform in which all cores operate at the same clock. This unusual assumption allows avoiding concurrent access to the shared memory. We do not make such an assumption in our approach which is, in addition, suited to a network of independent processors. In [53], the authors present an original experiment to parallelize a program, using an evolutionary search similar to GP. The original program is self-replicated, being randomly altered during the replication (possibly introducing parallel instructions). All new generated programs undergo the same process. The evolutionary search is guided by the fact that programs that replicate faster survive, while the oldest ones are removed. Programs first optimize their sequential code, given that it allows accelerating self-replication. After that, the process starts finding parallel versions, using up to 32 threads. We consider that the method is limited in effectiveness: despite the considerable computational effort needed, only small programs were evolved (having O(100) assembly instructions). 5

Input:   Problem Instance

Allowed Variable Values

Parallel Predictor

Refinement  Algorithm

P

P

P

P

P

P

P

P

P

1

Prediction Step

2

Refinement Step

Problem Variables  Assignment Probabilities

Refinement Algorithm

SOLUTION

P

SOLUTION

…



SOLUTION

SOLUTION

Best Fitness Value

SOLUTION Figure 2: Overview of the Virtual Savant parallel algorithm.

In our previous work [14, 41], a preliminary version of our VS framework was introduced to learn observations produced by the Min-Min algorithm on instances of up to 512 tasks and 16 machines. The VS framework was able to reproduce the solutions from Min-Min with high accuracy (up to 82% for the 128 × 4 problems), outperforming it in terms of quality of solutions. In the current work, we propose an enhanced design of our framework that scales to any number of tasks, speeds up the learning process, and considerably decreases the number of observations required for learning. The performance of the new VS design is thoroughly evaluated, considering several different perspectives, and it is used to learn from a heuristic, an advanced metaheuristic, and an exact approach. It is described in the next section. 4. The Virtual Savant This section presents our novel method to automatically parallelize programs. We first present our source of inspiration in Section 4.1. After that, Section 4.2 describes the design of VS. 4.1. The Savant Syndrome The proposed VS is inspired by the Savant syndrome [11, 19]. The starting point for our investigation were past occurrences where a massively parallel machine, i.e., the human brain, was able to solve sequential problems in a short time, even when the problems are small in size. Our motivation was to ensure the method employed relied on parallelism, in order to adapt to the current trend in computing architectures (even those composed of low-power –also called “wimpy”– nodes, quickly expanding in the market). The essential features of the Savant syndrome are translated into a general algorithmic approach for automatic parallelization, that is presented later in Section 4. People displaying symptoms of the Savant syndrome can compute small sequential tasks, such as calendar computation (given a date, it lies in finding which day of the week it is), almost instantaneously (in around 700 ms) [19], with largely unknown methods [36]. It is very interesting to us that the increase of the savants’ response time to provide a solution with respect to the problem size is lower to that of any known algorithm. In addition, savants can perform other date computations considered more time-consuming for a computer algorithm [36] with similar performance, as finding all years that match a given weekday, date and month. Although not fully understood, it is believed that savants extract rules that capture perceived regularities in the data, such as found in calendars and prime numbers, and then apply these patterns (in parallel) on the problem input data, 6

to solve their problems [19, 34]. The savant’s perception of data is reported different to the normal brain operation, it is more precise (“high resolution data” [57]) than our ordinary higher-level, conceptual memory [11, 34, 35]. Some descriptions of the savant’s internal perception of numbers [52] display synesthesia [30]. These findings might help explaining how they can perform complex calendar computation and prime numbers enumeration, without even knowing the complicated details of calendars or even what prime numbers are. The descriptions that some savants have provided (they are often autistic and thus cannot explain their methods) incline us to believe that their pattern-recognition learning method is supervised [52]. 4.2. The Virtual Savant Framework In this section, we exploit the Savant syndrome analogy to automatically generate parallel solvers. Our proposed approach, called the Virtual Savant (VS), is able to automatically generate a massively parallel program from a set of observations, not requiring any kind of source code as an input. For that, VS relies on ML to learn the behavior of a given reference algorithm to solve some optimization problem, using as observations a number of problem instances and their solutions, provided by the reference algorithm (as shown in the learning step in Fig. 1). Therefore, given an algorithm, VS is able to automatically generate a new and completely different program that reproduces its behavior, but in a highly efficient massively parallel way, thanks to its parallel pattern recognition engine. An overall view of our VS template is depicted in Fig. 2. Basically, VS is composed of two steps. The first one is the prediction step. By analogy with the savant syndrome, VS implements a parallel predictor to generate a solution, based on parallel pattern recognitions. Each predictor is composed by a pattern recognition engine for the assignment of a value to the corresponding variable (represented in the figure as grey boxes, labeled ‘P’). They only require as input the relevant information to take the decision on one variable, and not the whole problem instance (therefore, the problem instance should be partitioned). A complete solution to the problem is built by executing one predictor for every variable, in parallel. Please note that predictors are independent processes that do not share any information, so the model builds the solution in a massively parallel way, and it can make use of as many parallel processes as the number of variables the problem has. This number can be multiplied in case predictors are parallel too. In our design in this paper, predictors provide the probabilities of assigning each allowed value to every variable. It provides VS with more information to correct possible prediction inaccuracies in the second step: the refinement step. It lies in the execution of a number of search algorithms in parallel, in order to find an accurate solution to the problem. The starting solutions are generated according to the values assignment probabilities computed in the prediction step. The best solution found in the refinement step is reported as the output of VS. We carefully describe our method in sections 4.2.1 to 4.2.5, paying special attention on how it can be applied to the considered independent tasks scheduling problem (defined in Section 2), the case of study chosen for this work. Section 4.2.1 describes the steps that need to be followed when facing a new problem with VS. After that, we explain in Section 4.2.2 how VS can be trained to learn the behavior of an algorithm. These two procedures are done only once, before execution. After that, sections 4.2.3 and 4.2.4 explain the two steps followed by VS during the execution to find a result to a given problem instance: the mentioned prediction and refinement steps, respectively. Finally, Section 4.2.5 discusses about the parallel capabilities of VS. 4.2.1. Problem Decomposition In order to apply VS to solve a given problem, we first need to decide its solution representation. In the case of the considered independent tasks scheduling problem, solutions will be represented as an array of integers: the length of the array is the number of tasks (each index of the array is assigned to one task), and the values in the array denote the machine the corresponding task is assigned to. This is a widely adopted representation in the literature for this problem [23, 27, 37, 43, 60]. The model uses one predictor for every task, and it will be used to compute the machine the task must be assigned to. Therefore, the number of classes of the predictor is fixed to the number of machines. The output of VS is a solution to the problem, following the described representation. Once the problem representation is designed, we need to define how the input problem instance can be partitioned. The idea behind partitioning the problem instance is to provide predictors with the minimum number of features required to accurately compute the machine the task should be assigned to. This way, we may importantly reduce the number of features the predictors must learn from. In addition, the input data partitioning allows VS to be scalable to the problem size, and more efficient for parallelism, since the different processes require less input data, and the 7

1

Tasks

SVM

Tasks

Predictor is trained from an instance and its solution

2

1. Execution time of every   task on all machines 2. Task assignment computed   by the reference algorithm One instance and solution pair provides “Tasks” observations

Predictor Machines

Problem Instance  (ETC File)

Solution from Reference Algorithm

Figure 3: The training process of VS is made for every task, and it requires the processing time of the task on all machines, and the machine it is assigned to by the reference algorithm.

data is not shared between them. These are important features of VS, differentiating it from other approaches from the literature, such as Artificial Neural Networks (ANNs) [50, 55] and Ensemble or hybrid models [12, 48]. These methods work on the whole problem instance, (i) requiring a large amount of observations to be used in the learning process, and (ii) strongly limiting their possibilities for efficient parallelism. Additionally, (iii) they can only be used to solve problem instances of exactly the same size as the instances used for training. In contrast, partitioning the problem instance provides VS with outstanding features, as (i) it allows significantly reducing both the number of observations (because there are less features to learn from) and the required solutions from the reference algorithm, because a single problem solution provides VS with as many observations as the number of tasks, (ii) VS is massively parallel, composed by a large number of independent tasks with low input and output data transfer demand, so it can efficiently perform in both shared- and distributed-memory environments, and (iii) VS can scale to any problem dimension (in terms of the number of variables) without requiring any modifications. The input to VS is an instance of the problem, defined in our case as an ETC matrix of tasks columns and machines rows, containing the expected time to compute each task in every machine. As mentioned before, concurrency is first introduced by decomposing the task-to-machine assignments into multiple, independent, processes (called the predictors) in the parallel predictor. As it can be seen in Fig. 2, the input of each predictor is not the whole problem instance, but a part of it. In our case, each predictor assigns the machine to one task, based on the time the task takes in all machines. Therefore, the ETC matrix is divided by tasks, the values of a given task being the input of one predictor. The approach is data-parallel, and each predictor operates independently of the others, avoiding any communication. The proposed decomposition approach allows exploiting the parallelism of the target architecture even for small problem instances. Section 4.2.2 describes how these predictors are implemented as classifiers to provide the probabilities of assigning every machine to the considered task (the classes are the different machines). The section also carefully explains the training process for the predictors. 4.2.2. The Learning Step In the implementation of the VS proposed in this paper, predictors are implemented as a Support Vector Machine (SVM) multi-class classifier (other classifiers could have been adopted, and this is a matter of our future work). Every task assignment is made by an SVM instance, and the classes of the SVM are the machines in the problem instance. The SVM is trained with solved problem instances, obtained from an existing solver, which is to be parallelized. The characteristics used to train the SVM are emphasized with red rectangles in Fig. 3 and described next: 1. Information from the problem instance: the expected time to compute the corresponding task in every machine (i.e., the values in the array of the corresponding task in the ETC matrix). 2. Information from the reference algorithm: the task assignment in the solution computed by the reference algorithm for the specific problem instance. In our previous VS design [14, 41], each SVM is trained for a single task, independently of the others, and learns the assignment rules of the original solver for that task. Despite the promising results reported, this model is restricted 8

to problem instances having the same number of tasks to those used for training. To overcome this limitation, only one SVM is trained to learn the assignments for all tasks in this work. Therefore, the trained VS can be applied to instances with any number of tasks, if keeping the same hardware settings, as it is desirable in a real system. In addition, this model speeds up the learning process with respect to our previous one, because only one SVM needs to be trained (before, as many SVMs as the number of tasks had to be trained). Finally, the number of required observations in the learning process is drastically reduced: now, one solution provides n observations, n being the number of tasks. For example, 32, 000 observations can be obtained with only 64 solved problem instances of size 512 × 16, instead of 32, 000 if tasks are distinguished. 4.2.3. The Prediction Step After the SVMs are trained in the learning step, they can be used to predict the behavior of the original algorithm on unseen problem instances. The prediction phase builds a tentative solution in a massively parallel way, using as many independent concurrent processes –the predictors– as the number of tasks (typically several hundreds of thousands) thanks to the parallel predictor. In our proposed design, predictors are SVM classifiers. The input of each SVM is the array of the corresponding task in the ETC matrix, containing the estimated time to compute the task in every machine. The output of each SVM is not a definite assignment for a task, but an assignment probabilities vector (Figure 2). The probabilities vector contains the probabilities to assign the task to every machine, according to the behavior learned from the reference solver. We took this decision in order to have more information for the next step, called the refinement phase (Section 4.2.4), in order to efficiently and effectively cope with possible inaccuracies in the predictions. However, the output of the predictor could be the machine assignment for the task, and this way the prediction step would compute one tentative solution to the problem. In the presented design, communications are almost negligible, since every process requires as input the expected computing times of a task in all machines, namely, a vector of double values with length the number of machines (maximum size of 16 in the experiments in this paper), and its output is a vector with the assignment probabilities of the task to every machine (again, a vector of double values, with the same size as the input vector). No more communications are needed in the prediction phase. In addition, predictors could be parallel themselves, increasing the parallelism of the model [33]. 4.2.4. The Refinement Step The results from the predictors cannot be used as is. The reason is that the output of the SVM classifiers is a vector of probabilities, indicating how likely the assignment of each machine to every task is, as mentioned in Section 4.2.3. In addition, there is a need for a method to repair possible inaccuracies in the assignment of some of the tasks, given that in combinatorial optimization problems (as in the scheduling problem we address in this work) there is a strong dependency among the values of the different variables in a solution, while the proposed decomposition assigns each task independently. However, Section 5 confirms the accurate results provided by the SVMs when choosing the machine with the highest probability for every task. The prediction step builds the probabilities vector using a number of completely independent processes (which can themselves be parallel too). The refinement step (the “Refinement Algorithm” boxes in Figure 2) closes the gap between the independent task assignments and the combinatorial optimization nature of the problem, because it operates on full solutions making use of the fitness function. Every refiner starts by generating a complete solution to the optimization problem, with a randomized method that makes use of the probabilities matrix generated in the prediction step. The refiners are not limited to the assembly of the solution, but also search fit solutions (with low makespan). In this paper, each refiner performs a random search in the solution space, storing the best solution found, which is output at the end of the process. This is done for the purpose of generality. Obviously, some problem knowledge can be used by the refiners to further improve the results, but that is not the scope of the present study, where we present VS as a generic method. In essence, because random search is not efficient, the refiner’s search is guided by the predictors’ output. The search performed by the refiners is malleable [5], meaning that the computation effort needed is a function of the number of iterations made (solution evaluations). Its two components, the random generation and fitness evaluation, can be executed in any number of nodes, because they are stateless. For instance, it provides the same result running 10, 000 iterations of the refiner in one single thread than two refiners performing 5, 000 iterations each, 9

in parallel. The fitness evaluation can, in general, require significant computation (although not makespan), and could itself be a parallel computation. 4.2.5. Discussion on the Parallel Design of VS In Section 3.3, we mentioned that the GP approaches considered parallelism as a function to optimize, leading to uncertain parallelism results. In contrast, the proposed approach to parallelism specifies a target parallel template for all the generated programs, composed of a large number of quasi-independent and lightweight tasks. Therefore, the approach only produces algorithms that exhibit a suitable form of parallelism. The parallel template is designed to TM efficiently fit different modern architectures, such as massively parallel machines (as GPGPU or Intel! Xeon Phi ), distributed architectures and supercomputers (e.g., clusters, Grids, the Cloud), multi-core processors, and even lowpower architectures (called the AWN). It displays the following characteristics: • limited computation on each node, • limited inter-node communication, • scalability with the number of nodes, • massive parallelism, exploiting hundreds of cores even for small problems. The chosen algorithm template fits the Map-Reduce parallel model definition [13]. This is a well known model that is implemented in a number of different frameworks, for computational architectures as clusters of computers, multi-core servers, or GPGPUs. Additionally, Map-Reduce has been shown equivalent to other well-studied parallel models as Bulk Synchronous Parallel (BSP) and Parallel Random Access Machine (PRAM) [40]. In the chosen template, the mappers are the machine learning based predictors that process the input problem instance in parallel and independently. The results are then passed to the reducers, the refinement algorithms, to produce the proposed solution to the problem, also in an independent and parallel fashion. The designed algorithmic template allows a high number of predictors and refiners to be run independently parallel, even on small problems. Moreover, it allows scaling with respect to the problem size, by running more predictors and refiners, and hardware resources, meaning that both predictors and refiners can themselves be parallel. In summary, we target in this work the automatic generation of a parallel version of an existing algorithm. The original algorithm is treated as a black box, and the VS is presented as a massively parallel software that learns its behavior just by making use of supervised learning. Predictors assign the probabilities for the different possible values for every variable. Refiners are then used to exploit the probabilistic information of the predictors to sharpen solutions for the considered combinatorial problem, based upon the fitness function. 5. Experimental Results This section presents the experiments performed, and analyzes the results obtained. The configuration we used in the experiments for the algorithms and the problem instances are detailed in Section 5.1. After that, the process followed to train VS is described in Section 5.2. Then, we analyze the accuracy of the predictions made by VS in Section 5.3. Section 5.4 examines the quality of solutions found by our method, while Section 5.5 discusses about its runtime performance. Finally, the scalability of VS when increasing the number of problem variables is studied in Section 5.7. 5.1. Configuration of experiments The problem instances for the selected scheduling combinatorial optimization problem are the ETC matrices. They are randomly generated following probability distributions that capture realistic scenarios, as presented in [6]. We study ETC of high and low machine heterogeneity. With the aim of realistically reflecting current High Performance Computing (HPC) systems, tasks composing the problem instances are chosen to be highly heterogeneous. Regarding the machines, we consider two different scenarios: one with high heterogeneity, called hihi (high task and machine heterogeneity), and the other one with low heterogeneity, referred as hilo. Five problem sizes are simulated, problems 10

of 12 tasks to map onto 4 machines (denoted 12 × 4), 128 tasks onto 4 machines (denoted 128 × 4), 512 tasks onto 16 machines (denoted 512 × 16), and 4096 tasks onto 16 machines (denoted 4096 × 16). The algorithms parallelized by the VS are an exhaustive search, the well-known Min-Min heuristic (described in Section 3.1), and a specialized metaheuristic (i.e., a GA). The exhaustive search is used to find the optimal solutions for the smallest problems (12 × 4) only. The metaheuristic algorithm we consider is our PA-CGA [42]. It is an asynchronous parallel cellular genetic algorithm [2] specialized to the independent task scheduling problem: the random initial population is seeded with one solution from Min-Min heuristic, and a local search (H2LL) is executed on each new solution. The same parameters originally proposed for PA-CGA are adopted. As mentioned, the predictors are SVM classifiers (Section 4.2.3). We adopted the implementation in libSVM [10]. The supervised training procedure follows the recommendations of the implementation guide: • a subset of the available observations is used for training, the subset is chosen so as to maintain the class distribution (stratification), • the data is scaled linearly to [−1, 1], features range from 10−1 to 103 for highly heterogeneous scheduling instances, • the chosen kernel is the default radius basis function (RBF) because we have few features, • a grid-search is used to find the best parameters C and γ, • v-fold cross-validation is used to evaluate the accuracy of the model. 5.2. Training process Before we can evaluate the performance of VS, we must select the amount of training data for each problem size. The amount of training data impacts: • the accuracy of the SVM predictors (presented in the next paragraph), • the runtime performance of the SVM, which decreases with more data. For example, for 512 × 16 problem instances, models trained with 32, 000 observations are about: – ×3 slower than those trained with 1, 600, and – ×2 slower than those trained with 16, 000. Therefore, we wish to employ the fewest training observations that nevertheless provide good results. Figures 4 to 8 present the impact of the amount of training observations on the VS performance. These figures are representative of the results we got for the other instances. The results are presented as boxplots, and notches indicate statistical significance: there is statistically significant difference between boxplots with non-overlapped notches, with 95% confidence [9]. The boxplots represent the quality (in terms of makespan) of the solutions found by the VS when trained with different amount of observations, for 200 unseen test ETC instances. Figures are presented in pairs: (a) results from the predictors (the SVM classification), and (b) VS results after 10, 000 refinement iterations (i.e., 10, 000 random changes). The quality presented is relative, it is the error of the makespan of the VS solution with respect to a reference solution: • For the VS predictors, the quality measure is the most likely SVM classification (i.e., the solution built by assigning tasks to those machines with the highest probability), relative to a reference solution. • For VS with a refiner, the quality measure is the VS’s median solution over 30 runs, relative to a reference solution (because the refiner is stochastic). When the reference algorithm is PA-CGA, we compare against its best solution after 10 independent runs (we consider this is enough given the variance of PA-CGA results, as shown in the next section), and not against its median solution. The reader should take into account that this comparison is unfair for the VS, but we do it because the best solution of PA-CGA is the one used in the training process of VS. 11

10 8 6 4 2 0

Quality (median vs. reference, in %)

200 150 100 50 0


400

400

4000

4000

Training set size

Training set size

(a) After the VS predictor

(b) After 10K iterations of the VS refiner

400

10 5 0 −5 −10


60 50 40 30 20 10 0 -10


Figure 4: Influence of training size on the performance of VS learning from optimal for 12 × 4 hihi instances.

400

4000

Training set size

4000

Training set size



400

15 10 5 0


60 50 40 30 20 10 0


Figure 5: Influence of training size on the performance of VS learning from Min-Min for 128 × 4 hihi instances.

400

4000

Training set size

4000

Training set size



Figure 6: Influence of training size on the performance of VS learning from PA-CGA for 128 × 4 hilo instances.

For 12 × 4 problems, Fig. 4a shows the low influence of the training set size on the performance of the VS predictors, when trained with optimal solutions. However, despite the mentioned negligible difference between the output of the predictors, the results of VS after performing 10, 000 iterations of the refiner are considerably improved (with statistical confidence) when using the SVMs trained with 4, 000 observations. These results can be seen in Fig. 4b. Regarding the quality of results, we can see that the solutions na¨ıvely built directly from the predictors are 12

16000

15 10 5 0


80 60 40 20 0


1600

1600

32000

16000

32000

Training set size

Training set size



16000

16 15 14 13 12 11

Quality (median vs. reference, in %) 1600

10

160 140 120 100 80 60 40 20


Figure 7: Influence of training size on the performance of VS learning from Min-Min for 512 × 16 hihi instances.

1600

32000

Training set size

16000

32000

Training set size



Figure 8: Influence of training size on the performance of VS learning from PA-CGA 512 × 16 hilo instances.

around 55% worse than the optimum. However, after 10, 000 iterations of the refiners (implementing a random search) the results become highly accurate: less than 1% worse, in average, for the SVMs trained with the large set. In fact, VS found the reference optimum in 20.17% of the runs (considering 30 runs per ETC), and it found it in at least one run for 76.5% of the 200 unseen ETCs studied. We did some additional experiments and we found that after allowing 120, 000 iterations of the refiners, then VS finds the reference optimum in 46.43% of the runs, and it is found for 88.5% of the ETCs in at least one out of the 30 independent runs. These numbers can be considerably improved when considering any optimal solution, and not the reference one, as it will be shown in Section 5.4. In fact, the generic refiners implemented in VS target any optimal solution. We now discuss two representative cases for problems of size 128 × 4. They are the VS learning from Min-Min on hihi instances (Fig. 5) and the VS to reproduce the results of PA-CGA on hilo instances (Fig. 6). The two cases show how the VS predictors’ results do not reveal important differences in training set size. However, as for the small instances, VS after 10, 000 iterations of the refiner shows that using a training set size of 4,000 leads to better quality solutions with statistical significance, both for VS trained with Min-Min or PA-CGA solutions. Regarding the quality of solutions, the predictors output is around 10% worse quality than Min-Min, while the refiners step can significantly outperform the Min-Min reference algorithm by almost 15% in some cases. The VS learning from PA-CGA on the hilo instances is about 20% worse than the reference algorithm, while the execution of the refiners allows VS finding solutions as accurate as those of PA-CGA. We would like to remark at this point that the values used to generate the boxplots are the median values after 30 independent runs of VS versus the best result of PA-CGA after 10 runs, meaning that for some instances, the median of the results is as accurate as the best result found by the reference PA-CGA algorithm in 10 independent runs. We analyzed the performance of VS according to the best solutions found 13

100

100

60 20

40

Similarity (%)

80

80 60 40 1

8

16

24

32

40

48

56

64

72 80

88 96 104 112 120 128

per task average

0

Similarity (%)

20 0

per task average

1

32 64

Task index

96 128 160 192 224 256 288 320 352 384 416 448 480 512

Task index

(a) To PA-CGA solution for 128 × 4 hihi problems.

(b) To Min-Min solution for 512 × 16 hihi problems.

Figure 9: Similarity of VS prediction.

too; the study is made in Section 5.4. Finally, we present some results for 512 × 16 problems in figures 7 and 8. Because in this case the SVMs implemented in the predictors are required to classify into 16 categories (instead of 4 as in the case of the previously commented instances), larger training sizes were required in the learning process. Sizes of 1, 600, 16, 000, and 32, 000 were considered in our study. We can see in the figures that the VS predictors now can benefit from additional training data. In the case of the hihi instance –Fig. 7a–, the performance of the predictors is worse for the small training set size, while there are negligible differences in the other two cases. For the hilo problem –Fig. 8a–, we can appreciate significant differences among the three training set sizes, and the bigger the size the better the results. After the refiners step (Figs. 7b and 8b), we can see that more training data helps (both for VS from Min-Min and PA-CGA results). However, we did not get statistical significance on the comparison of the performance of VS trained with the two largest sets. In the remaining of this paper, we select the smallest training set sizes that provide good results, according to the study performed. The reason is that less training data yields less support vectors in the models, and this implies shorter execution times. From our experiments we conclude that an appropriate size for the training data is 4, 000 observations for 12 × 4 and 128 × 4 problems and 16, 000 for the 512 × 16 problems. 5.3. Accuracy of the prediction step We compare in this section the solutions found by the VS predictors to the ones found by Min-Min and PA-CGA, on unseen problem instances. The solutions are compared literally (i.e., in terms of the task-to-machine assignments), and not via their quality (makespan), that will be analyzed later in Section 5.4. For the comparison, we do not perform the refinement step. Although the goal of the VS is the quality of solutions, the accuracy of the predictors is essential to the behavior of the algorithm. The performance of the predictors is reported with two metrics: similarity and probability to solution. We define the similarity score as the number of correct task assignments across 200 evaluation (unseen) problem instances, for each task. The choice of the predictor is the most probable machine assignment in the probabilities vector. Figure 9 shows some representative similarity results for Min-Min and PA-CGA algorithms on some of the studied problems (tasks are ordered from the smallest –number 1– to the biggest one). As it can be seen, the average similarity across tasks is over 80% and 60% for the 128 × 4 and 512 × 16 problems, respectively. We consider these high values are an interesting result, because the VS assignments are independently done for all tasks and taking into account only the 4 and 16 ETC values used by each predictor, for the two problem sizes (with a search space of 2256 and 22,048 solutions), respectively. On the contrary, Min-Min and PA-CGA algorithms build their solutions based on more information: they use the ETC values of all tasks, and they also take into account the assignments of the other tasks when making decisions. 14

100 20

40

Accuracy (%)

60

80

100 80 60 40 1

8

16

24

32

40

48

56

64

72 80 88 96 104 112 120 128

Min-Min PA-CGA

0

Accuracy (%)

20 0

Min-Min PA-CGA

1

32 64

96 128 160 192 224 256 288 320 352 384 416 448 480 512

Task index

Task index

(a) 128 × 4 hihi problems.

(b) 512 × 16 hihi problems.

Figure 10: Probability to solution of VS prediction.

It can be seen that the similarity values are lower than average for the smaller tasks (those with lower index value). However, the error on these small tasks has a low impact on the solution quality, and they are easy to reallocate for better makespan. These inaccuracies are also observed for the biggest tasks in the case of the VS for Min-Min. Anyway, the error is never lower than 60%. We find that the wrong assignments for the largest tasks is understandable, because the Min-Min first assigns the smaller tasks, and the larger ones are scheduled towards the end of the execution of the algorithm, being strongly biased by the previous task assignments, information that the VS predictor does not have. In addition, the approximation we perform in our model in the sorting of tasks (Section 4.2.3), determining the predictor model to apply for a task, could also be another reason for the observed mismatch between the VS predictor and Min-Min. We define the probability to solution as the probability to choose the correct task assignment (i.e., matching the assignment of the reference algorithm) according to the probabilities vector generated by the predictors. The difference with similarity is that it adopts the assignment defined by the highest probability for every variable, while the probability to solution defines how likely it is that the classifier assigns the correct choice. The probability to solution is a very useful indicator for the evaluation of VS, because even in the case when the greatest probability points to a wrong assignment, the probability of the predictor for the correct assignment influences the refiner of the VS. It reflects both ambiguous choices, when more than one assignment have similar probability values, and strong preferences, meaning that some assignment has a high probability. Figure 10 shows the probability to build the solution of Min-Min and PA-CGA for two sample problems, detailed for every task. The result is shown as the average accuracy (i.e., the average probability to perform the assignment of the reference algorithm for every task) on 200 unseen problem instances. For smaller problem instances, Fig. 10a shows highly accurate prediction values of the SVM, in general: over 60%. Because of this high accuracy, similarity and probability to solution are very similar. For the 512 × 16 problems, Fig. 10b shows lower probabilities than the similarity of Fig. 9b, because the probability of machine assignments is spread across more machines/classes. Also, the predictors are less accurate than for smaller problem instances. However, because there are more than 2 machines, an accuracy value higher than 50% (dashed line in the plots) implies that the correct option is the one reported with the highest probability by the SVMs. Therefore, taking the highest probability strategy leads to a correct assignment for all those tasks whose accuracy level is over the 50% dashed line. 5.4. Precision of Virtual Savant In this section, we investigate how fit are the solutions found by our automatically generated parallel VS algorithm, for the considered combinatorial problem. We call it the precision of the VS algorithm, and we provide some representative results from Fig. 11 to 13, covering the different studied problem sizes. For each problem size, we 15

7.5 0

2.5

2.5

10

15

VS from Min-Min PA-CGA Optimum

5

Best quality (%)

7.5

10 5 0

Median quality (%)

15

VS from Min-Min PA-CGA Optimum

10K

100K

120K

PA-CGA

10K

100K

120K

Iterations of reducer


(a) Median results

(b) Best results

PA-CGA

Figure 11: VS solutions for 12 × 4 hihi ETC.

show the median and best solutions of VS over 30 runs on 200 unseen test instances, in separate boxplots. The plotted results are relative to a reference solution for each ETC, which are the best known results: the optima (for the 12 × 4 problems) and the best results found by PA-CGA after 10 independent runs for the other problem sizes. The three figures show the quality of the solutions found by the: • VS trained with optimal solutions, for 12 × 4 problems, • VS trained with Min-Min solutions, • VS trained with the best PA-CGA solutions. We analyze the performance of the three VS versions for several iteration counts of the refiner: from 10, 000 to 10, 000 × tasks, and all the intermediate values that are a power of ten. The size of the problem is then reflected in the number of refinement iterations. The chosen number of iterations shows convergence, and captures the major improvements in solution quality. Some sample results obtained for 12 × 4 problems on hihi instances are presented in Fig. 11, in terms of both median and best results found. As mentioned, the quality of the solutions are relative to a reference solution quality for each ETC, which is an optimal solution in the case of this small problem size. This explains why there are no points below 0%. In the figure, the box on the far right presents the PA-CGA results, while the results of Min-Min are not included because they are about 40% worse than optimum, and adding them would reduce the clarity of the figure. We notice that the PA-CGA finds near optimal solutions for these small problems. Figure 11a shows how, overall, the VS versions obtain high accurate solutions quickly (similar results were obtained for hilo instances). The median of the VS solutions is less than 2.5% worse than the optimum after 10, 000 iterations of the refinement step, in all cases, and it sometimes reaches the optimum. The solutions found by the VS trained with Min-Min highly improve those from Min-Min, reducing the quality of the solution from 40% worse than the optimum to a 2.5% worse median value. The small solutions (12 tasks) allow the refiners to quickly improve the results. The efficiency of the refiners is also due to the small ETC sizes, which allow the predictor to be more accurate, as the features of the SVM include 1/12 of the ETC for a 4-way classification. Figure 11b compares the best solutions of the VS versions, with the optimal solution for each ETC. As happened in the study made on the median results just commented, the VS trained with Min-Min finds slightly worse solutions than when trained with the optimal or PA-CGA solutions. This shows the influence of the accuracy of the predictors on the overall VS capability. The VS found an optimum for 41.30% of the 30 runs for each ETC, in average, after 10, 000 iterations of the refiners, and for 80,01% after 120, 000 refinement iterations. In addition, the VS found an optimum in at least one run for 86.5% and 95.5% of the 200 instances after 10, 000 and 120, 000 refinement iterations (and it solved to optimality in all the 30 runs 12% and 54% of the 200 instances, respectively). Figure 12 presents the VS results for 128 × 4 hilo problems (both median and best ones). For this problem size, the median solution found by PA-CGA over 10 runs is chosen as the reference solution to normalize the presented results. The reason is that it is better than the Min-Min solution and the optimal solutions for this problem size are 16

7.5

6

2.5

0

2

2.5

8

10

VS from Min-Min PA-CGA

4

Best quality (%)

7.5

10 5 0

Median quality (%)

15


Min-Min 10K

100K

1M

1.28M PA-CGA

Min-Min 10K


100K

1M

1.28M PA-CGA


(a) Median results

(b) Best results Figure 12: VS solutions for 128 × 4 hilo ETC.

7.5

10 5

7.5

10

15

Best quality (%)

15


0

0

2.5

2.5

5

Median quality (%)

20


Min-Min 10K

100K

1M

5.12M PA-CGA

Min-Min 10K


100K

1M

5.12M PA-CGA


(a) Median results

(b) Best results

Figure 13: VS solutions for 512 × 16 hihi ETC.

not known. Therefore, the PA-CGA results are near 0%. The leftmost box presents the Min-Min solutions. As for the small problems, the VS versions find highly accurate solutions quickly. The VS trained with Min-Min importantly outperforms its reference Min-Min algorithm. The VS even finds better solutions than the PA-CGA in a number of cases (even for the median results). For example, for hilo instances, after 1, 280, 000 random changes, VS finds better solutions than PA-CGA in around 100 out of the 200 studied instances. As a comparison, PA-CGA performs around 6, 000, 000 evaluations of the fitness function in the 30s run for this problem size. This is consequence of the accurate predictors classification, that effectively guides the random search of the refiners to find new, fit solutions. In addition, we envision that implementing some specific local search algorithm as the one implemented in PA-CGA instead of the random search will definitely lead to a further improvement of the VS results. However, as we aim to provide a generic method in this paper, this study is left for future work. Results for hihi instances were similar to the hilo ones shown, and we found that the difference in machine heterogeneity has less impact on makespan for 128 tasks than with 12 tasks. Finally, Fig. 13 presents the median and best VS results for the 512 × 16 problems. The reference solutions used for normalizing VS results are chosen in the same way as for 128 × 4 problems. We can see in Fig. 13a that the median results of VS are similar in quality to the Min-Min solutions, but requiring more iterations than what was observed on smaller problems. Another difference with the results from the previous problem sizes is that VS trained with Min-Min performs slightly better than when trained with PA-CGA observations, although differences are not always statistically significant (with 95% confidence interval). The VS median solutions are less or equal to 7.5% of the best PA-CGA solutions. If we compare the accuracy of VS median and best results, we can observe a little improvement of the best results with respect to the median results. This is a strong indicator of the reliability and accuracy of the proposed VS method. The VS performance compared to the PA-CGA is not as accurate as in the previous, smaller, problem sizes. The increased problem size, especially in the number of machines, decreases the 17

Table 1: Runtime performance model for 512 × 16 problems

80 60 20

40

Accuracy (%)

80 60 40 0

per task average 1

256 512 768 1024 1280 1536 1792 2048 2304 2560 2816 3702 3328 3584 3840 4096

Task index

Min-Min PA-CGA

0

20

Similarity (%)

unit of work one classification 1K solutions

100

median wall-time 247 ms 13 ms

100

Savant component Predictor (SVM) Refiner (guided random search)

1

256 512 768 1024 1280 1536 1792 2048 2304 2560 2816 3702 3328 3584 3840 4096

Task index

(a) Similarity of VS from Min-Min predictor

(b) VS predictor probability to solution on hihi problems

Figure 14: VS accuracy and similarity results for 4096 × 16 hihi problems, scaled from the predictor trained for 512 × 16 hihi problems.

accuracy of predictors, which in turn requires more iterations from the refiners. In addition, more observations could be necessary to train VS for the PA-CGA to improve the accuracy of predictors. 5.5. Performance consideration for Virtual Savant The VS assignment of each task of a problem instance is independent of all other tasks, and could be programmed concurrently and executed in parallel, on up to tasks processors. In addition, VS refiner is state-free: each solution generation is independent of all others, and could also be programmed concurrently and executed in parallel, on up to one processor per solution generation (O(106 ) in our experiments). It is not the scope of this paper to thoroughly evaluate the parallel performance of VS, but to present and analyze the accuracy of the method. However, we give here some insights on the potential performance of VS, when executed in parallel. Table 1 presents the median runtime (wall-time, in ms) of each component of VS, when executed for 200 512 × 16 problems on one processor with the following features: Intel Xeon X5670 @ 2,93 GHz with 1, 920 MB RAM. The implementation is not optimized for speed, as the standalone SVM programs are used (e.g., all files read for the file system each time). The reader can see how one single SVM execution takes only 247 ms to execute in our server, while the refiner step is about 13 ms for every 1, 000 iterations. This means that, when having enough resources, we can get assignment probabilities to construct the solution to the problem in around 250 ms (putting communications costs aside), whatever the number of tasks is, and keeping constant the number of machines to 16. We think this is a valuable result, because the PA-CGA the VS learns from takes 30 seconds to find the solutions. 5.6. Scalability of Virtual Savant We proceed to analyze in this section the behavior of VS when scaling the problem size. For that, we use the models trained for 512 × 16 problems presented in this paper to solve 4096 × 16 instances. We analyze the performance of VS in the same terms as in the previously presented study: (i) analyzing the similarity of the result to the one from the original algorithm (Fig. 14a); (ii) showing the probability of the predictors to find the reference solution (Fig. 14b); (iii) and analyzing the median and best quality solutions of VS (Fig. 15). First, we can see in Fig. 14a how close the solutions generated by the VS predictors are to the reference solutions, for the refiner trained from Min-Min, as an example. In average, the highest probability assignments of the predictors 18

6 5


4 3

2.5

2.5

1

2

Best quality (%)

5 4 3 2

0

0

1

Median quality (%)

6

7


Min-Min

10K

100K

1M

PA-CGA

Min-Min


10K

100K

1M

PA-CGA


(a) Median results

(b) Best results

Figure 15: VS solutions for 4096 × 16 hilo ETC, scaled from the predictor trained for 512 × 16 ETC.

generate a solution which is slightly over 80% similar to the reference solution, for all tasks in all the tested instances (we got the same results for PA-CGA algorithm too). We can see how the most difficult assignments of the predictors are those for the smallest tasks, ranging from around 30% to 60% similarity (for the 256 smallest tasks). However, as discussed before in the paper, the impact of a mistake in the assignment of these tasks is low, and consequently they are the easiest tasks to re-allocate. Therefore, the refinement step can quickly accommodate them in the right machines. We can also appreciate how the similarity slightly decreases for the biggest tasks too, but its value is rarely below 60%. If we compare the results against those obtained for the 512 × 16 problems (shown in Fig. 9b), for which the predictor was trained, we can see that the scalability of the model is highly satisfactory in terms of similarity. Plots for the two problem sizes confirm that similarity values raise from around 40% (20% in the case of the VS trained from PA-CGA) for the smallest tasks and then decrease to around 60% for the biggest ones. However, we must emphasize at this point that the similarity of the predictor trained for 512 × 16 considerably improved when scaling to 4096 × 16 problems, since the average similarity raised from 60% to 80%, roughly. Second, the probability that the predictor finds the reference solution (referred as its accuracy) is shown in Fig. 14b for hihi problems, detailed for every task. We plot the accuracy curves of VS trained for both the Min-Min heuristic and the PA-CGA metaheuristic, with respect to the reference solutions provided by the original algorithms. We can see that the accuracy raises from around a poor 10% for the smallest task to around 80%, slightly decreasing to over 60% for the biggest tasks. The model trained from Min-Min shows slightly better accuracy values than those trained for PA-CGA, a non-deterministic algorithm. We can see that this plot is similar to the one obtained by VS for 512 × 16 problems (shown in Fig. 10b). At this point, we would like to emphasize that any percentage over 50% ensures that VS assigned the highest probability to the correct option. Third, the scalability of VS is also analyzed according to its performance with respect to the original algorithms in terms of median and best results reported (Fig. 15). We first focus on the median solutions obtained, shown in Fig. 15a. The quality of results is around 5% worse than the results of the PA-CGA for hilo problems (and around 7.5% for hihi). These results improve those obtained for the 512 × 16 problems, for which VS was trained. In that case (see Fig. 13a for reference), VS could find solutions with less than 7.5% error (in median) in one case for hihi problems (and in two cases for hilo). Please, note that the error (in median) of the scaled VS ranged –roughly– between 4% and 5.5% for hilo problems (it was less than 7.5% error in most cases for hihi). Similar conclusions can be drawn from the best results shown in Fig 15b. For hilo problems, all results are between 3% and 5% error in median, with respect to the best-known solution. As it happened previously, we got slightly worse results for hihi instances, as the error ranges between 6% and 7.5%. Again, the performance of VS was better when scaling to 4096 × 16 problem size than for the 512 × 16 problems for which it was trained. We observe that for hilo problems the learning metrics used –accuracy and probability– show a worse performance 19

19000

26500

18000 26000

Makespan

Makespan

17000

16000

25500

25000

15000

24500

14000

13000

(a) hihi instances

GA

CHC

PA−CGA

VS−PA−CGA

VS−MinMin

Sufferage

Max−Min

Min−Min

GA

CHC

PA−CGA

VS−PA−CGA

VS−MinMin

Sufferage

Max−Min

Min−Min

24000

(b) hilo instances

Figure 16: Comparison of VS with state-of-the-art techniques for 4096 × 16 problems (VS is trained for 512 × 16 problems).

of VS than for hihi instances. However, the quality of solutions of VS is better for hilo problems. The reason is that the low heterogeneity of tasks makes learning the right assignment more difficult to the predictors, but the impact of a mistake in this assignment is lower than in hihi instances because the distribution of duration of tasks is smoother. 5.7. Comparison against the state-of-the-art Performing a thorough comparison of our technique against the state of the art methods is out of the scope of our work, in which we focus on the capacity of VS to learn and accurately reproduce the behavior of different exact, deterministic, and non-deterministic optimization algorithms. In order to improve the accuracy of VS to be highly competitive with the state of the art, we should make use of problem knowledge in the refinement step, instead of the random search implemented in this work. We present in Fig. 16 a comparison of the performance of VS against some remarkable techniques from the state of the art. We included in the study two heuristics and two advanced metaheuristics. The heuristics are Max-Min [20] and Sufferage [32]. They are variants of Min-Min heuristic. As Min-Min, they work with a pool unassigned tasks (composed of all tasks at the beginning), then they find the machine that can finish every task the soonest. Among all tasks assignments, Max-Min takes the one that finishes the latest, while Sufferage selects the one that would “suffer” the most in case it were not assigned to the processor that executes it the fastest. The selected task is removed from the pool of tasks to assign, and the algorithm continues iterating until all tasks are selected. Regarding the metaheuristics, we compare against the GA presented in [38] and the CHC-based scheduler from [37]. The results reported for these two metaheuristics were computed with their original implementations and parameterizations. The study was made for the biggest instances considered in this work, composed by 4096 tasks to map into 16 machines. Please, note that VS was not trained for this problem size, but for 512 tasks. The results for all metaheuristics, as well as the three VS versions, are obtained after 450, 000 fitness function evaluations. They are computed on 100 different instances, and 10 independent runs were made for every problem instance in the case of the non-deterministic algorithms. Therefore, the comparisons are made after 1, 000 independent runs. It can be seen in the figure how the performance of the two VS versions, either learning from Min-Min or from PA-CGA is similar, with very low differences between them in the two problem classes studied. They are strongly competitive with the state-of-the-art algorithms for both hihi and hilo problems, despite they do not incorporate any problem knowledge. The two VS versions outperform Max-Min and Sufferage for the hihi problems, and Max-Min for hilo ones. When comparing against the best algorithm for every problem family, the VS median results are only around 4% worse than those of GA for hihi problems, and around 6% worse than PA-CGA for hilo ones. The overall 20

best performing algorithm is PA-CGA: significantly better than all other compared algorithms for hilo problems, and offering a similar performance (no significant differences were found) for hihi. 6. Conclusions and Future Work We present in this paper the Virtual Savant (VS), a new method to automatically generate parallel programs to solve optimization problems. The inputs of the method are a number of problem instances and their solutions, provided by a reference algorithm. Using this information, VS automatically generates a completely new parallel algorithm that behaves similarly to the reference one. To do so, VS employs machine learning to abstract the rules followed by the reference algorithm to generate the solutions. Due to its massively parallel design, the generated algorithm can be implemented using the Map-Reduce programming model, because it is composed of a large number of independent and low demanding processes (although this is left for future work). We present a complete experimentation to analyze the performance of VS, which is measured from different perspectives to help us understanding the behavior and operation of the new method. Our main result is that the VS is able to successfully learn the behavior of the three algorithms used as test cases; sometimes even outperforming them. An immediate future work is the use of problem knowledge in VS, since it was presented as a generic method in this paper. Additionally, the performance of VS on other problems is still an open question, and we find specially interesting to study how VS can tackle problems with constraints. However, its application to other different problems is of high interest to strengthen our confidence on the proposed method, and this is being addressed in our next line of research. This implies identifying interesting problems, determining how to extract the features to learn from, or setting an appropriate training set size, among other tasks. Other interesting lines of research may consider the enhancement of VS, by improving its design and considering other machine learning techniques (as other modern classifiers or ensemble methods) to improve the learning accuracy, especially when the number of classes (i.e., the values variables can take) increases. Finally, we are also working on the evaluation of the parallel performance of VS on different architectures. Acknowledgment This work is partially funded by the Spanish MINECO and FEDER under contracts TIN2014-60844-R (the SAVANT project) and RYC-2013-13355. We would like to thank Prof. Nesmachnow, who kindly computed the results of his algorithms GA [38] and CHC [37] for our experimental comparison. We used the HPC facilities of the University of Luxembourg [54] to perform our experiments – see http://hpc.uni.lu. References [1] Abraham, G. T., James, A., Yaacob, N., 2015. Group-based parallel multi-scheduler for grid computing. Future Generation Computer Systems 50, 140–153. [2] Alba, E., Dorronsoro, B., 2008. Cellular Genetic Algorithms. Springer. [3] Ali, S., Siegel, H. J., Maheswaran, M., Hensgen, D., Ali, S., 2000. Representing task and machine heterogeneities for heterogeneous computing systems. Journal of Science and Engineering 3, 195–207. [4] Banerjee, U., Eigenmann, R., Nicolau, A., Padua, D. A., 1993. Automatic program parallelization. Proceedings of the IEEE 81 (2), 211–243. [5] Blazewicz, J., Kovalyov, M. Y., Machowiak, M., Trystram, D., Weglarz, J., 2006. Preemptable malleable task scheduling problem. Computers, IEEE Transactions on 55 (4), 486–490. [6] Braun, T. D., Siegel, H. J., Beck, N., Bölöni, L. L., Maheswaran, M., Reuther, A. I., Robertson, J. P., Theys, M. D., Yao, B., Hensgen, D., et al., 2001. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Journal of Parallel and Distributed computing 61 (6), 810–837. [7] Callahan, C. D., Cooper, K. D., Hood, R. T., Kennedy, K., Torczon, L., 1988. ParaScope: A parallel programming environment. Int. J. of High Performance Computing Applications 2 (4), 84–99. [8] Carretero, J., Xhafa, F., 2006. Using genetic algorithms for scheduling jobs in large scale grid applications. Journal of Technological and Economic Development 12 (1), 11–17. [9] Chambers, J. M., Cleveland, W. S., Kleiner, B., Tukey, P. A., 1983. Graphical Methods for Data Analysis. Wadsworth International Group, Belmont, California. [10] Chang, C.-C., Lin, C.-J., 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27, software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm. [11] Darius, H., 2007. Savant syndrome-theories and empirical findings. Ph.D. thesis, University of Skövde, Sweden.

21

[12] de Araújo Padilha, C. A., Couto Barone, D. A., Dória Neto, A. D., 2016. A multi-level approach using genetic algorithms in an ensemble of least squares support vector machines. Knowledge-Based Systems 106, 85–95. [13] Dean, J., Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51 (1), 107–113. [14] Dorronsoro, B., Pinel, F., 2017. Combining machine learning and genetic algorithms to solve the independent tasks scheduling problem. In: The 3rd IEEE International Conference on Cybernetics (CYBCONF-2017). IEEE, pp. 1–8. [15] Fonseca, A., Cabral, B., Rafael, J., Correia, I., 2016. Automatic parallelization: Executing sequential programs on a task-based parallel runtime. International Journal of Parallel Programming, 1–22. [16] Ghafoor, A., Yang, J., 1992. Distributed heterogeneous supercomputing management system. ECE Technical Reports, 270. [17] Guzek, M., Bouvry, P., Talbi, E. G., 2015. A survey of evolutionary computation for resource management of processing in cloud computing. IEEE Comp. Intelligence Magazine 10 (2), 53–67. [18] Horowitz, E., Sahni, S., 1976. Exact and approximate algorithms for scheduling nonidentical processors. Journal of the ACM 23, 317–327. [19] Hughes, J. R., 2010. A review of savant syndrome and its possible relationship to epilepsy. Epilepsy & Behavior 17 (2), 147–152. [20] Ibarra, O. H., Kim, C. E., 1977. Heuristic algorithms for scheduling independent tasks on nonidentical processors. Journal of the ACM 24 (2), 280–289. [21] Irigoin, F., Jouvelot, P., Triolet, R., 1991. Semantical interprocedural parallelization: An overview of the PIPS project. In: Int. conference on Supercomputing. ACM, pp. 244–251. [22] Iturriaga, S., Nesmachnow, S., Dorronsoro, B., Bouvry, P., 2013. Energy efficient scheduling in heterogeneous systems with a parallel multiobjective local search. Comput. Inform. 32 (2), 273–294. [23] Iturriaga, S., Nesmachnow, S., Luna, F., Alba, E., 2015. A parallel local search in CPU/GPU for scheduling independent tasks on large heterogeneous computing systems. The Journal of Supercomputing 71 (2), 648–672. [24] Kafil, M., Ahmad, I., 1998. Optimal task assignment in heterogeneous distributed computing systems. IEEE Concurrency 6 (3), 42–50. [25] Keckler, S., Olukotun, K., Hofstee, H. (Eds.), 2009. Multicore Processors and Systems. Springer US. [26] Koziel, S., Yang, X. (Eds.), 2011. Computational Optimization, Methods and Algorithms. Springer Berlin Heidelberg. [27] Kroemer, P., Platos, J., Snasel, V., Abraham, A., 2011. An implementation of differential evolution for independent tasks scheduling on GPU. In: 6th Int. Conf. on Hybrid Artificial Intelligent Systems (HAIS). Vol. 6678 of Lecture Notes in Artificial Intelligence. Springer, pp. 372–379. [28] Leroy, X., 2006. Formal certification of a compiler back-end or: programming a compiler with a proof assistant. In: 33rd ACM SIGPLANSIGACT Symposium on Principles of programming languages. Vol. 41. ACM, pp. 42–54. [29] Luo, P., Lü, K., Shi, Z., 2007. A revisit of fast greedy heuristics for mapping a class of independent tasks onto heterogeneous computing systems. Journal of Parallel and Distributed Computing 67 (6), 695–714. [30] Luria, A. R., 1968. The mind of a mnemonist: A little book about a vast memory. Harvard University Press. [31] Maheswaran, M., Ali, S., Siegel, H., Hensgen, D., Freund, R. F., 1999. Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In: Proceedings of the Heterogeneous Computing Workshop. pp. 30–44. [32] Maheswaran, M., Ali, S., Siegel, H. J., Hensgen, D., Freund, R. F., 1999. Dynamic mapping of a class of independent tasks onto heterogeneous computing systems. Journal of Parallel and Distributed Computing 59 (2), 107–121. [33] Massobrio, R., Nesmachnow, S., Dorronsoro, B., 2017. Support vector machine acceleration for intel xeon phi manycore processors. In: High Performance Computing Conference Latin America (CARLA). Buenos Aires (Argentina) & Colonia (Uruguay), pp. 1–14. [34] Mottron, L., Bouvet, L., Bonnel, A., Samson, F., Burack, J. A., Dawson, M., Heaton, P., 2012. Veridical mapping in the development of exceptional autistic abilities. Neurosci. Biobehav. Rev. 37 (3), 209–228. [35] Mottron, L., Dawson, M., Soulieres, I., Hubert, B., Burack, J., 2006. Enhanced perceptual functioning in autism: An update, and eight principles of autistic perception. Journal of Autism and Developmental Disorders 36 (1), 27–43. [36] Mottron, L., Lemmens, K., Gagnon, L., Seron, X., 2006. Non-algorithmic access to calendar information in a calendar calculator with autism. Journal of Autism and Developmental Disorders 36 (2), 239–247. [37] Nesmachnow, S., Alba, E., Cancela, H., 2012. Scheduling in heterogeneous computing and grid environments using a parallel CHC evolutionary algorithm. Computational Intelligence 28 (2), 131–155. [38] Nesmachnow, S., Cancela, H., Alba, E., 2010. Heterogeneous computing scheduling with evolutionary algorithms. Soft Computing 15 (4), 685–701. [39] Nisbet, A., 1998. GAPS: A compiler framework for genetic algorithm (GA) optimised parallelisation. In: High-Performance Computing and Networking. Springer, pp. 987–989. [40] Pace, M. F., 2012. Bsp vs mapreduce. Procedia Computer Science 9, 246–255. [41] Pinel, F., Dorronsoro, B., 2014. Savant: Automatic generation of a parallel scheduling heuristic for map-reduce. The International Journal of Hybrid Intelligent Systems 11 (4), 287–302. [42] Pinel, F., Dorronsoro, B., Bouvry, P., 2010. A new parallel asynchronous cellular genetic algorithm for scheduling in grids. In: IEEE Int. Symp. on Parallel & Distributed Processing. IEEE, p. 206b. [43] Pinel, F., Dorronsoro, B., Bouvry, P., 2013. Solving very large instances of the scheduling of independent tasks problem on the GPU. Journal of Parallel and Distributed Computing 73 (1), 101–110. [44] Pinel, F., Dorronsoro, B., Pecero, J. E., Bouvry, P., Khan, S. U., 2013. A two-phase heuristic for the energy-efficient scheduling of independent tasks on computational grids. Cluster Computing 6 (3), 421–433. [45] Ritchie, G., Levine, J., 2004. A hybrid ant algorithm for scheduling independent jobs in heterogeneous computing environments. In: Planning and Scheduling Special Interest Group Workshop (PLANSIG 2004). pp. 178–183. [46] Ryan, C., van Roermund, A. H., Verhoeven, C. J. M., 2000. Automatic re-engineering of software using genetic programming. Kluwer. [47] Ryan, C., Walsh, P., 1997. Paragen II: evolving parallel transformation rules. In: Comp. Int. Theory and App. Springer, pp. 573–573. [48] Shen, X.-N., Minku, L. L., Marturi, N., Guo, Y.-N., Han, Y., 2018. A q-learning-based memetic algorithm for multi-objective dynamic software project scheduling. Information Sciences 428, 1 – 29. [49] Solomon, S., Thulasiraman, P., Thulasiram, R., 2011. Collaborative multi-swarm pso for task matching using graphics processing units. In: Conf. on Genetic and Evol. Comp. (GECCO). ACM, pp. 1563–1570.

22

[50] Steinhaus, M., 2015. The application of the self organizing map to the vehicle routing problem. Open access dissertations. paper 383., University of Rhode Island. [51] Tabak, E., Cambazoglu, B., Aykanat, C., 2014. Improving the performance of independent task assignment heuristics MinMin, MaxMin and Sufferage. IEEE Tr. Par. Distr. Syst. 25 (5), 1244–1256. [52] Tammet, D., 2007. Born on a blue day: Inside the extraordinary mind of an autistic savant. Simon and Schuster. [53] Thearling, K., Ray, T. S., 1996. Evolving parallel computation. Complex Systems 10 (3), 229. [54] Varrette, S., Bouvry, P., Cartiaux, H., Georgatos, F., July 2014. Management of an academic hpc cluster: The ul experience. In: Proc. of the 2014 Intl. Conf. on High Performance Computing & Simulation (HPCS 2014). IEEE, Bologna, Italy, pp. 959–967. [55] Villarrubia, G., De Paz, J. F., Chamoso, P., De la Prieta, F., 2018. Artificial neural networks used in optimization problems. Neurocomputing 272, 10–16. [56] Walsh, P., Ryan, C., 1996. Paragen: a novel technique for the autoparallelisation of sequential programs using GP. In: Annual Conference on Genetic Programming. MIT Press, pp. 406–409. [57] Welling, H., 1994. Prime number identification in idiots savants: Can they calculate them? Journal of Autism and Developmental Disorders 24 (2), 199–207. [58] Williams, K. P., 1998. Evolutionary algorithms for automatic parallelization. Ph.D. thesis, University of Reading. [59] Xhafa, F., Alba, E., Dorronsoro, B., Duran, B., 2008. Efficient batch job scheduling in grids using cellular memetic algorithms. J. of Mathematical Modelling and Algorithms 7 (2), 217–236. [60] Xhafa, F., Alba, E., Dorronsoro, B., Duran, B., Abraham, A., 2008. Efficient batch job scheduling in grids using cellular memetic algorithms. In: Metaheuristics for Scheduling in Distributed Computing Environments. Springer-Verlag, Heidelberg, pp. 273–299.

Appendix A. Acronyms AI

Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

AWN

Array of Wimpy Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

ANNs

Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

BSP

Bulk Synchronous Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CGA

Cellular Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

DE

Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

ETC

Expected Time to Compute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

GA

Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

GP

Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

GPGPU General Purpose GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 HPC

High Performance Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

ML

Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

PA-CGA Parallel Asynchronous CGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 PRAM Parallel Random Access Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 SVM

Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

VS

Virtual Savant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

23

The Virtual Savant: Automatic Generation of Parallel

The Virtual Savant: Automatic Generation of Parallel

Suggest Documents

Automatic Code Generation for Massively Parallel ...www.researchgate.net › publication › fulltext › Automatic

Automatic generation of self-scheduling programs - Parallel and ...

Massively Parallel GPU Design of Automatic Target Generation

Automatic Generation of Massively Parallel Hardware from Control

Automatic Generation of Parallel Programs with ... - Semantic Scholar

Pollock: Automatic Generation of Virtual Web ... - PIKE - Penn State

The automatic generation of narratives

Pollock: Automatic Generation of Virtual Web Services from Web Sites

Automatic Generation of Compact Programs and Virtual Machines for ...

Automatic Generation of Communication

Savant: Automatic Parallelization of a Scheduling ... - ORBilu - UNI.lu

Automatic Parallel Program Generation and Optimization from Data

Automatic Parallel Program Generation and Optimization from Data ...

Savant syndrome

Savant RelicS

automatic generation control performance of the ...

Automatic generation of formal requirements

automatic generation of pronunciation dictionaries

Automatic Generation of Application-Specific

The automatic generation of narratives - Semantic Scholar

A Framework for the Automatic Generation of

Toward the Automatic Generation of Cued Speech

Supporting the Automatic Generation of Advanced

Automatic Generation of - INESC-ID