Parallel Algorithms for Distributed Control A Petri Net Based Approach Zdenˇek Hanz´alek Department of Control Engineering Czech Technical University in Prague Karlovo n´am. 13 121 35 Prague 2, Czech Republic
[email protected] January 6, 2003
Life can only be understood going backwards, but it must be lived going forwards. Kierkegaard
Pro m´e rodiˇce
Acknowledgments This research has been conducted at LAAS-CNRS Toulouse and at DCE CTU Prague as part of the research projects: New control system structures for production machines (supported by CTU grant GACR 102/95/0926), INCO COPERNICUS - TRAFICC (supported by the Commision of the European Communities), Trnka Laboratory for Automatic Control (supported by the Ministry of Education of the Czech Republic under VS97/034) and French research project (supported by Ambassade de France `a Prague). I would like to express my gratitude and appreciation to all the people whose efforts contributed to this thesis: G´erard Authi´e for his correctness and patience Jiˇr´ı Bayer for his encouragements and confidence Robert Valette for his ability to simplify complicated problems Jan B´ılek for his valuable suggestions Dominique de Geest and Vincent Garric for their friendship Patrick Danes for the address ’mon meilleur ami’ Christophe Calmettes for his touchy humour Fr´ed´eric Viader for his rough humour Jean Philippe Marin for his good-nature and many others. I also express my gratitude to the chairman, the reviewers and the members ˇ ska, Jiˇr´ı of my dissertation committee: Vladim´ır Kuˇcera, Hassane Alla, Milan Ceˇ Kadlec, Guy Juanole and Branislav Hr´ uz.
i
ii Preface In order to better understand principal problems comming from the nature of parallel processing we introduce basic concepts including modelling by data dependence graphs, program transormations, partitioning and scheduling. A static scheduling of nonpreemptive tasks on identical processors is surveyd as a mathematical background reflecting the problem comnplexity. Time complexity measures and global communication primitives are given to introduce a principal terminology originated from computer sciences. Improtance of global communication primitives is very well illustrated in one example of parallel algorithm - gradient training of feedforward neural networks. A message-passing architecture is presented to simulate multilayer neural networks, adjusting its weights for each pair, consisting of an input vector and a desired output vector. Such algorithm comprise a lot of data dependencies, so it is very difficult to parallelize. Then the implementation of a neuron, split into the synapse and body, is proposed by arranging virtual processors in a cascaded torus topology. Mapping virtual processors onto node processors is done with the intention of minimizing external communication. Then, internal communication is reduced and implementation on a physical message-passing architecture is given. A time complexity analysis arises from the algorithm specification and some simplifying assumptions. Theoretical results are compared with experimental ones measured on a transputer based machine. Finally the algorithm based on the splitting operation is compared with a classical one. The example shows very well one possibility, how to make an efficient implementation of parallel algorithm. This approch demands a deep knowlege of given application and big experience in parallel processing. This difficulty is probably one of the major reasons why parallel computers are more popular among theoreticiens than in applications. Another possibility is to to transform sequential programs into equivalent parallel programs. Even if the complete application cannot be translated automatically, the aim is to facilitate the task of the programmer by translating some sections of the code and performing operations exploiting parallelism and detecting global data movements. Such approach is very complex task and that is why we do not focuse only on the problem sulution but we focuse our interest namely on the problem fromalisation and analysis in order to beter understand the problem nature. Petri Nets are the formalism adopted in this thesis, because they make it possible to model and visualize behaviours comprising concurrency, synchronization and resource sharing. As such they are convenient tool to model parallel algorithms. The first objective is to study structural properties of Petri Nets which offer profound mathematical backround originated namely from linear algebra and graph theory. It is argued that namely positive invariants are of our interest when analyzing structural net properties and the notion of a generator is introduced.The importance of generators lies in their usefulness for analyzing net properties. Then three existing algorithms finding set of generators are imple-
iii mented. It is argued that the nuber of generators is nonpolynomial and original algorithm comprising structural information of fork-join components is proposed. Usefullnes of generators consists in their ability to express structural properties of Petri Nets. That is why the set of generators in combination with initial marking are often used to prove properties like livenes or they serve as an input data for scheduling algorithms. The second objective is to model algorithms via Petri Nets. It is stated that model can be based either on the problem analysis or on the sequential algorithm. The problems are modelled as noniterative or iterative ones and reduction rules preserving model properties are studied. An attempt is made to put DDGs and Petri Nets on the same platform when removing antidependencies and output dependencies. The third objective is to schedule nonpreemptive tasks with precedence constraints given by event graphs with on unlimited , but possibly minimal, number of identical processors. An importance is paid to a loop scheduling which is esential when designing efficient compilers for parallel architectures. The fourt objective studied in this thesis is detection of global communication primitives. In the presence of communication, the complexity of the above scheduling problem has been found to be much more difficult than in the classical scheduling problem, where communication is ignored. Global data movements such as a broadcasting or a gathering occur quite frequently in certain classes of algorithms. If such data movement patterns can be identified from the PN model algorithm representation, then the calls to communication routines can be issued without having detailed knowledge of the target machine, while the communication routines are optimized for the specific target machine. This thesis has strong interdisciplinary character. An attempt is made to put a knowledge of various scientific branches to the same theoretical platform. We try to adopt a method developped in one of them and to elaborate it in another scientific branche. Such scientific branches include computer sciences, Petri Nets, linear algeba and graph theory.
iv
Contents 1 Introduction
1
2 Basic concepts of parallel processing 2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Modelling and parallelism detection . . . . . . . . . . . . . 2.2.2 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Survey on static scheduling of nonpreemptive tasks on identical processors . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Time complexity measures . . . . . . . . . . . . . . . . . . . . . . 2.4 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Communication model . . . . . . . . . . . . . . . . . . . . 2.4.2 Network topologies . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Global communication primitives . . . . . . . . . . . . . . 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 19 20 21 22 25 29
3 An example of parallel algorithm 3.1 Algorithms for neural networks . . . . . . . . 3.2 Neural network algorithm specification . . . . 3.3 Simple mapping . . . . . . . . . . . . . . . . . 3.4 Cascaded torus topology of virtual processors 3.5 Mapping virtual processors onto processors . . 3.6 Data distribution . . . . . . . . . . . . . . . . 3.7 Time complexity analysis . . . . . . . . . . . . 3.8 Some experimental results . . . . . . . . . . . 3.9 Comparison with a classical algorithm . . . . 3.10 Conclusion . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
31 31 32 34 35 37 39 41 44 46 48
4 Structural analysis of Petri nets 4.1 Basic notion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Linear invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Finding positive P-invariants . . . . . . . . . . . . . . . . . . . . .
51 51 54 56
v
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
5 5 6 7 10 12
vi
CONTENTS 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5
An algorithm based on Gauss Elimination Method . . . . An algorithm based on combinations of all input and all output places . . . . . . . . . . . . . . . . . . . . . . . . . An algorithm finding a set of generators from a suitable Z basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time complexity measures and problem reduction . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Petri net based algorithm modelling 5.1 Additional basic notions . . . . . . . . . . . . . . 5.1.1 Implicit place . . . . . . . . . . . . . . . . 5.1.2 FIFO event graph . . . . . . . . . . . . . . 5.1.3 Uniform graph . . . . . . . . . . . . . . . 5.2 Model based on the problem analysis . . . . . . . 5.2.1 Noniterative problems . . . . . . . . . . . 5.2.2 Iterative problems . . . . . . . . . . . . . . 5.2.3 Model reduction . . . . . . . . . . . . . . . 5.3 Model based on the sequential algorithm . . . . . 5.3.1 Acyclic algorithms . . . . . . . . . . . . . 5.3.2 Cyclic algorithms . . . . . . . . . . . . . . 5.3.3 Detection and removal of antidependencies output dependencies in a PN model . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . 6 Parallelization 6.1 Basic principles . . . . . . . . . . . . . . . . . 6.1.1 Data parallelism . . . . . . . . . . . . 6.1.2 Structural parallelism . . . . . . . . . . 6.1.3 Noniterative versus iterative scheduling 6.2 Cyclic scheduling . . . . . . . . . . . . . . . . 6.2.1 Additional terminology . . . . . . . . . 6.2.2 Structural approach to scheduling . . . 6.2.3 Quasi-dynamic scheduling . . . . . . . 6.3 Communications . . . . . . . . . . . . . . . . 6.3.1 The problem complexity . . . . . . . . 6.3.2 Finding global communications . . . . 6.3.3 Relation to automatic parallelization . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . 7 Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
62 69 72 81 85 87 87 88 88 89 90 90 92 94 98 99 100
. . . . . . 100 . . . . . . 110
. . . . . . . . . . . . . . . . . . problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
113 113 113 116 118 119 120 121 126 127 127 128 130 132 133
List of Tables 2.1
Earliest, latest, and feasible task execution time . . . . . . . . . .
14
3.1
Numerical values for neural network with 30-150-150-30 neurons .
46
4.1
Generator computational levels . . . . . . . . . . . . . . . . . . .
59
vii
viii
LIST OF TABLES
List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12
Data dependence graph . . . . . . . . . . . . . . . . . . . . . . . . Data dependence graph . . . . . . . . . . . . . . . . . . . . . . . . Partitioning a computational graph: (a) fine grain; (b) coarse grain Directed acyclic graph . . . . . . . . . . . . . . . . . . . . . . . . An instance of a scheduling problem . . . . . . . . . . . . . . . . An instance of a scheduling problem with the critical subgraph . . Earliest schedule for the instance from Figure 2.5 . . . . . . . . . Some specific topologies . . . . . . . . . . . . . . . . . . . . . . . Hypercube interconnection . . . . . . . . . . . . . . . . . . . . . . Hierarchy and duality of the basic communication problems . . . Hierarchy example . . . . . . . . . . . . . . . . . . . . . . . . . . Duality example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 9 11 13 17 18 19 23 24 25 27 28
3.1 Artificial neuron j in layer l . . . . . . . . . . . . . . . . . . . . . 3.2 Example of a multilayer neural network (NN 2-4-4-2) . . . . . . . 3.3 The activation in the second hidden layer ( 4 neurons in both hidden layers mapped on 4 NPs) and its Petri net representation. 3.4 Cascaded torus topology of VPs (for NN 2-4-4-2) . . . . . . . . . 3.5 VPs simulating NN with 16-16-16-16 neurons mapped on 4 × 4 NPs 3.6 Realization on an array of 17 transputers . . . . . . . . . . . . . . 3.7 Separate parts of the execution time for NN with 64-64-64-64 neurons 3.8 Theoretical execution time of NN algorithm . . . . . . . . . . . . 3.9 Comparison of theoretical and experimental results . . . . . . . . 3.10 Experimental results achieved on a T-node machine . . . . . . . .
32 34 35 36 38 40 43 44 45 46
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
52 53 57 58 59 60 61 68 81
An example of Petri Net . . . . . . . . . . . . . . . . . . . . . . A Petri Net (an billiard balls in [22]) . . . . . . . . . . . . . . . A Petri Net with two positive P-invariants and one T- invariant A Petri net with one negative P-invariant . . . . . . . . . . . . . Event graph with two generators . . . . . . . . . . . . . . . . . Subspace of positive linear invariants . . . . . . . . . . . . . . . A Petri Net with four generators . . . . . . . . . . . . . . . . . Subnet of the PN representation of a Neural network algorithm Example of Petri Net with exponential number of generators . . ix
. . . . . . . . .
x
LIST OF FIGURES 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
Implicit place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expressive power of different modelling methods . . . . . . . . . . Representation of data flow by means of PNs and DAG . . . . . . Matrix[3,3]-vector[3] multiplication . . . . . . . . . . . . . . . . . Discrete time linear system . . . . . . . . . . . . . . . . . . . . . . PN model of the discrete time linear system of second order with PD controller shown in Figure 5.5 . . . . . . . . . . . . . . . . . . NN learning algorithm represented by Petri Net . . . . . . . . . . Simplified PN model . . . . . . . . . . . . . . . . . . . . . . . . . PN model after reduction . . . . . . . . . . . . . . . . . . . . . . Representation of the algorithm from Example 5.1 . . . . . . . . . PN representation of Example 5.2 . . . . . . . . . . . . . . . . . . Detection (a) and removal (b) of antidependence . . . . . . . . . . Antidependence in a cyclic algorithm . . . . . . . . . . . . . . . . Output dependence . . . . . . . . . . . . . . . . . . . . . . . . . . Two antidependencies and one output dependence . . . . . . . . . The six possible instances of an output dependence . . . . . . . . Problem with two competitors and two destinations . . . . . . . . Rough comparison of the two modelling approaches . . . . . . . . Two PN models of the same vector operation . . . . . . . . . . . PN model of cyclic algorithm from Figure 5.11 after removal of IP-dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph corresponding to the parallel matrix Π of the algorithm given in Figure 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . Cyclic version of matrix[3,3] vector[3] multiplication . . . . . . . . A simple instance for structural scheduling . . . . . . . . . . . . . Underlying directed graph for Figure 6.5 and its reduction . . . . Schedule of the cyclic version of the matrix[3,3]-vector[3] multiplication (from Figure 6.4 by elimination of implicit places) . . . . . Global communication primitives of matrix-vector multiplication . Automatic parallelization . . . . . . . . . . . . . . . . . . . . . . .
88 89 91 92 92 93 95 95 97 99 101 101 103 104 105 106 109 111 114 117 117 119 122 125 129 130 131
Chapter 1 Introduction This thesis presents an original method for algorithm parallelization using Petri Nets.
General view Two basic approaches are used for designing parallel algorithms: a) direct design b) use of already existing sequential algorithms which are parallelized automatically. The approach a), shown in Chapter 3, requires deep understanding of the problem for which we write a parallel algorithm. The first step that one may take is to understand the nature of computations. The second step is to design a parallel algorithm. The third is to map the parallel algorithm onto a suitable computer architecture. This design is often linked to a given architecture. Its implementation on other machines can lead to serious problems. In the approach b), studied in Chapters 6 and 5, we investigate a possibility to transform automatically sequential programs into equivalent parallel programs. This task is very complex and it is not possible to find a solution for a general case. That is why we look for basic principles in order to do as much work as possible automatically. Even if the complete application cannot be translated automatically, the aim is to facilitate the task of the programmer by translating some sections of the code and performing operations exploiting parallelism and detecting global data movements. There are several surveys on automatic parallelization [1], [53] and several descriptions of experimental systems [54], [80]. Banerjee et al. [5] presented an overview of techniques for a class of translators whose objective is to transform sequential programs into equivalent parallel programs. These studies are usually based on data dependence graphs. The methodology adopted in this thesis is based on Petri Nets and their 1
2
CHAPTER 1. INTRODUCTION
structural properties, described in Chapter 4. We try to clarify the contribution of Petri Nets in the domain of automatic parallelization. The contribution of this thesis does not lay in the theory of Petri Nets but it studies why and how they can be used in a general methodology for program parallelization. This thesis has strong interdisciplinary character. An attempt is made to put a knowledge of various scientific branches to the same theoretical platform. We try to adopt a method developed in one of them and to elaborate it in another scientific branch. Originated by problems arising from parallel processing we look for solutions from domains like Petri Nets and graph theory. In general we can say that the methods adopted in this thesis fall into applied mathematics. We do not concentrate only on the problem solution but we focus our interest namely on the problem analysis in order to better understand the problem nature.
Organization The thesis is divided into five principal chapters. Chapter 2 is an introduction to parallel processing. Parallel processing is a fast growing technology that covers many areas of computer science and control engineering. It is natural that the concurrences inherent in physical systems start to be reflected in computer systems. Parallelizm brings higher speed and both hardware and software distribution, but rises a new set of complex problems to solve. Most of these problems are surveyed in Chapter 2. The chapter contains basic computer models, modelling by Data dependence graphs, partitioning, scheduling, performance measures and global communication primitives. It is a basic chapter introducing the concept of dependencies and techniques for parallelism analysis in algorithms. Simple examples introducing the notions of antidependencies and output dependencies are given. Chapter 3 is an example of a parallel algorithm implementation. This chapter presents a usual approach to parallel processing where parallelization is not done automatically. The example chosen is a message-passing architecture simulating multilayer neural networks, adjusting its weights for each pair, consisting of an input vector and a desired output vector. It is shown why such an algorithm is difficult to parallelize. A solution based on fine-grain partitioning is studied, implemented and compared with a classical one demanding more communications. A time complexity analysis is given and theoretical results are compared with experimental ones measured on a physical machine. Petri Nets make it possible to model and visualize behaviours comprising concurrency, synchronization and resource sharing. As such they are a convenient tool to model parallel algorithms. The objective of Chapter 4 is to study structural properties of Petri Nets which offer profound mathematical background originated namely from linear algebra and graph theory. Chapter 5 uses Petri Nets to formalize algorithm data dependencies. It is
3 stated that a model can be based either on the problem analysis or on a sequential algorithm. Noniterative as well as iterative problems are under consideration. An attempt is made to put Data dependence graphs and Petri Nets on the same platform when removing antidependencies and output dependencies. A Petri Net based data flow model is defined and the difficulties arising from the algorithm representation are clarified. After modeling issues and structural analysis we focus our interest on scheduling problem without communication delays. A scheduling problem of cyclic algorithms is studied on an unlimited, but possibly minimal, number of processor resources in Chapter 6. Then data movement patterns are identified from the Petri Net model algorithm representation and the calls to communication routines are issued without having any detailed knowledge of the target machine. The field of parallel processing is expanding rapidly and new, improved results become available every year. That is why current parallel supercomputers and detailed implementation issues of parallel algorithms are described only in brief.
4
CHAPTER 1. INTRODUCTION
Chapter 2 Basic concepts of parallel processing The objective of this chapter is to introduce the basic terminology and some elementary methods from the field of parallel processing. It has evolved from class notes for a graduate level course in parallel processing originated by the author at the Department of Control Engineering, CTU Prague. Parallel processing comprises algorithms, computer architectures, programming, scheduling and performance analysis. There is a strong interaction among these aspects and it becomes even more important when we want to implement systems. A global understanding of these aspects allows programmers to make trade-offs in order to increase the overall efficiency. This chapter is organized in four paragraphs. First a systematic architectural classification is given. Then the principal problems of program parallelization as modelling, partitioning and scheduling are classified. The third paragraph gives the basic terminology of time complexity measures and the fourth presents communication aspects of parallel processing. The chapter emphasizes the crucial problems of parallel processing - modelling, scheduling and communications. Difficult concepts are introduced via simple motive examples in order to facilitate understanding.
2.1
Architecture
A multitude of architectures has been developed for multiprocessor systems. The first systematic architectural classification by Flynn [31] is not unique in every respect, but it is still widely used. Flynn classifies the computing systems according to the hardware provided to service the instruction and data streams. A classical, purely serial monoprocessor, executes the only stream of instructions, the execution is sequential. The system that includes the only processor of this type is classified as SISD - Single Instruction, Single Data stream 5
6
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
machine. There exists certain level of parallelism in SISD machines, but it is limited to overlapping of instruction cycles (on RISC processors) and to the internal, register level parallelism in hardware. Classical form of parallelism exists in vector processors that support massive pipelining and in array processors that operate on multiple data streams. These computers are obviously denoted as SIMD - Single Instruction, Multiple Data stream machines and are generally used for numerical computation. Finally there is a large group of computing systems that includes more than one processor able to process Multiple Instruction over Multiple Data streams - MIMD. These computers are suitable for a much larger class of computations than SIMD computers because they are inherently more flexible. This flexibility is achieved at the cost of a considerably more difficult mode of operation. Only MIMD computers will be considered further in this thesis. Two distinct classes of MIMD machines exist regarding the way in which processors communicate: (1) shared-memory communication, in which processors communicate via a common memory (2) message passing communication, in which processors communicate via communication links. Running parallel algorithms on these two distinct classes of parallel computers leads to quite distinct kinds of problems (e.g. memory access management in shared-memory computers or message routing in message passing computers). Only message passing communication will be considered further in this thesis. In such systems each processor uses its own local memory for storing some problem data and intermediate algorithmic results, and exchanges information with other processors in groups of bits usually called packets using the communication links of the network.
2.2
Parallelization
First of all we introduce a term SIMD parallelization where an algorithm is of such kind that no code decomposition is needed because we run the same code on different processors working on different data. First data are distributed to processors, then computation is performed in each processing element without communication with others and finally results are collected in one processor. This type of parallelization is very closed to the SIMD computers, where one instruction operates on several operands simultaneously. An algorithm partitioning (see Section 2.2.2) and scheduling (see Section 2.2.3) are easier problems in data parallel algorithms. In this thesis we will talk about so-called functional parallelization, where processors run different codes and communicate with each other during the processing. In such case the term parallelization covers the following problems that
2.2. PARALLELIZATION
7
have to be solved when mapping a program onto a MIMD computer: 1) algorithm modelling and parallelism detection 2) partitioning the program into sequential tasks 3) scheduling the tasks onto processors The parallelism in a program depends on the nature of the problem and the algorithm used by the programmer. The parallelism analysis is usually independent of the target machine. On the other hand, partitioning and scheduling is designed to minimize the parallel execution of the program on a target machine, and depends on parameters such as number of processors, processor performance, communication time, scheduling overhead etc.
2.2.1
Modelling and parallelism detection
There are several ways, how to model a dynamic behaviour of algorithms; the most common modelling techniques are data dependence graphs (DDG), directed acyclic graphs (DAG), and Petri nets. The following paragraph will explain in brief a modelling technique based on DDGs. DAGs and Petri nets will be analyzed later. Parallelism detection involves finding sets of computations that can be performed simultaneously. The approach to parallelism is based on the study of data dependencies. The presence of a data dependence between two computations implies that they cannot be performed in parallel; the fewer the dependencies, the greater the parallelism. An important problem is determining the way to express dependencies. Data dependence graph The data dependence graph (DDG) is a directed graph G(V, E) with vertices V corresponding to statements in a program, and edges E representing data dependencies of three kinds: 1) Data-flow dependencies (indicated by the symbol −→) express that the variable produced in a statement will be used in a subsequent statement 2) Data antidependencies (indicated by the symbol −→ / ) express that the value produced in one statement has been previously used in another statement. If this dependency is violated (by parallel execution of the two statements), it is possible to overwrite some variables before they are used 3) Data-output dependencies (indicated by the symbol −→ ◦ ) express that both statements overwrite the same memory location. When executed in parallel it is not determined which statement writes first.
8
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
Example 2.1: consider a simple sequence of statements: S1: A = B + C S2: B = A + E S3: A = B
An analysis of this example reveals many dependencies. The data dependence graph is shown in Figure 2.1(a). Statement S1 produces the variable A that is used in statement S2 (flow dependence d1), statement S2 produces the variable B that is used in statement S3 (flow dependence d2), and the previous value of B was used in statement S1 (antidependence d3); both statements S1 and S3 produce variable A (output dependence d4); statement S3 produces variable A, previously used in statement S2 (antidependence d5). The antidependencies and output dependencies can be eliminated at the cost of introducing new redundant variables. Some techniques for this elimination, proposed in [70] are variable renaming, scalar expansion and node splitting.
Figure 2.1: Data dependence graph
The following demonstrates the variable renaming where the reocurrences of old variables are replaced with new variables. The program from the previous example does not change if statements S2 and S3 are replaced by S2’ and S3’ respectively: S1: A = B + C S2’: BB = A + E S3’: AA = BB
If this change is made, the antidependence and output dependence arcs are removed (see Figure 2.1(b)). Example 2.2: consider the following loop program:
2.2. PARALLELIZATION
9
FOR I=1,20 S1: A(I) = X(I) - 3 S2: B(I+1) = A(I) * C(I+1) S3: C(I+4) = B(I) + A(I+1)
The dependence graph is shown in Figure 2.2(a). There is a data dependence cycle S1,S2,S3, which indicates dependencies between loop iterations. However, one of the arcs in the cycle corresponds to an antidependence, if this arc is removed, the cycle will be broken. The antidependence relation can be removed from the cycle by node splitting: FOR I=1,20 S0: AA(I) = A(I+1) S1: A(I) = X(I) - 3 S2: B(I+1) = A(I) * C(I+1) S3’: C(I+4) = B(I) + AA(I)
The modified loop has the data dependence graph shown in Figure 2.2(b). The data dependence cycle S1,S2,S3 has been eliminated by splitting the node S3 in the DDG into S0 and S3’.
Figure 2.2: Data dependence graph
Statements S2 and S3’ are connected in a data dependence cycle which cannot be removed. In principle a data dependence cycle (this cycle consists exclusively of data flow dependencies) cannot be removed if all cycles containing at least one antidependence or output dependence were already removed. It is possible to vectorize (to perform a single instruction on multiple data, in another words to make use of the data parallelism) the statements that are outside the cycle. Thus the previous loop may be partially vectorized, as follows: S0: AA(1:20) = A(2:21) S1: A(1:20) = X(1:20) - 3 FOR I=1,20 S2: B(I+1) = A(I) * C(I+1) S3’: C(I+4) = B(I) + AA(I)
10
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
That means: we have used 20 processors to run the instructions given by S0 and S1 on 20 different data. These processors do not perform any communication among them. Simultaneous execution of S0 and S1 is impossible due to the antidependence relation from S0 to S1. Data dependence analysis done in DDG, and removal of antidependencies and output dependencies in fact lead to a simple directed graph plus informations about index shrinking given in the dependence vector (see [65]). The dependence vectors indicate the number of iterations between the generated variables and the used variables. In Example 2.1 there are no indices, and the value of the dependence vectors is zero. In Example 2.2 there is one iteration index I and four dependencies, so the four entries of dependence vector are given as d1 = I − (I) = 0, d2 = I + 1 − (I) = 1, d3 = I + 4 − (I + 1) = 3 and d4 = I + 1 − (I) = 1. Directed acyclic graph A directed acyclic graph (DAG) is a directed graph that has no positive cycles, that is, no cycles consisting exclusively of directed path (for additional terminology see 6.2.1). Let G = (V, E) be a DAG where V is a set of vertices (V is corresponding to statements in a program), and edges E representing data-flow dependencies. In particular, an arc (i, j) ∈ V indicates the fact that the operation corresponding to vertex j uses the results of the operation corresponding to node i. This implies the operation j to be performed after the operation i. That is why these graphs are acyclic. Comparison of DDGs, DAGs and PNs will be given further in the chapter 4.
2.2.2
Partitioning
When talking about the operations (vertices) in DAGs and DDGs it was assumed that the operation could be elementary (e.g., an arithmetic or a binary operation), or it could be a high-level operation like the execution of a subroutine. Let us now introduce a term granularity - a measure of the quantity of processing performed by an individual process. In general we distinguish between fine-grain parallelism (army of ants approach) and coarse-grain parallelism (elephants approach). The partitioning of a program specifies the nonparallelized units of the algorithm. Let us refer to these units as processes. There are some important process properties: 1) the process sequential execution time, which is a measure of the process size 2) the process inputs and outputs, which produce communication overhead 3) the synchronization requirements given by the precedence constraints (wrong process schedule can lead to processor busy waiting)
2.2. PARALLELIZATION
11
Figure 2.3: Partitioning a computational graph: (a) fine grain; (b) coarse grain Execution time is influenced by the process granularity determining the three factors mentioned above. The ideal execution time (computations without communication overhead), increases with the process size due to the loss of parallelism, while the communication overhead decreases with the process size. So, working with small granularity increases the parallelism but also increases the amount of communication, in addition to increasing software complexity. Partitioning should be designed in such a way that it provides a process size for which the effective execution time is minimized. This minimization is not simple because two different partitionings lead to two different schedules with different precedence constraints. In addition a continuous variation of process size is an oversimplification of the partitioning process. Real programs are discrete structures, and it may not be possible to partition a program into processes of equal size. Thus finding automatically the optimum partition for a real program is rather difficult. A partitioning technique proposed by Sarkar [83] is to start with an initial fine-granularity partition and then iteratively merge some processes selected by heuristics until the coarsest partition is reached, as illustrated in Figure 2.3. For each iteration, compute a cost function and then select the partitioning that minimizes the cost function that is a combination of two terms: the critical path and communication overhead. Another general conclusion is that fine-grain parallelism is found in tightly coupled systems (fast communication), and as hardware becomes increasingly
12
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
loosely coupled (slow communication, big startup), the granularity of data and program increases.
2.2.3
Scheduling
There is a big amount of scheduling methods for quite different purposes ranging from operating systems and parallel programming to manufacturing. Scheduling is defined as a function that assigns processes to processors. The goals of the scheduling, or task allocation, function are to spread the load to all processors as evenly as possible (so called load balancing) in order to obtain processor efficiency (decreasing its busy waiting time) and to minimize data communication, which will lead to shorter overall processing time. Allocation policies can be classified as static or dynamic. Under the static allocation policy, tasks are assigned to processors before run time either by the programmer or by the compiler. In some parallel languages, the programmer can specify the processor on which a task is to be performed, the communication channel used, and so on. There is no run-time overhead, and allocation overhead is incurred only once even when the programs are run many times with different data. Under the dynamic allocation policy, tasks are assigned to processors at run time. This scheme offers better utilization of processors, but at the price of additional allocation time. In addition the scheduling policies can be divided into preemptive and nonpreemptive. In a preemptive environment, tasks may be halted before completion by another task that requires service. This method requires a task to be interruptible, which is not always possible. In general, preemptive techniques can generate more efficient schedules than those that are nonpreemptive. However, a penalty is also paid in the preemptive case. This penalty lies in the overhead of task switching, which includes discontinuous processing and the additional memory required to save the processor state. For example one scheduling method coming from a real-time control application is the use of deadlines, or scheduled completion times established for individual processes. If there is some time associated with the completion of individual tasks, and this time is bounded, it is called a hard deadline or a hard real-time schedule. When programming MIMD machines, the scheduling is usually based on the DAG model of the problem and an execution time of each process. Before presenting a short survey on static scheduling of nonpreemptive tasks on identical processors in Section 2.2.4, we first give a simple introductory example.
Example 2.3: consider the program having a dependence graph shown in Figure 2.4 and adjacency matrix A (adjacency matrix A of a directed graph
2.2. PARALLELIZATION
13
G(V, E) is the asymmetric matrix | V | × | V | having element A(i, j) = 1 when there exists a directed edge from vertex i to vertex j).
A=
1 2 3 4 5 6 7 8
1 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 1 0 0 0
7 0 0 0 1 0 1 0 0
8 0 0 1 0 0 0 1 0
Figure 2.4: Directed acyclic graph For simplicity it is assumed that all processes, represented by the vertices of the graph, have the same execution time.
We wish to find the feasible execution time intervals for each processor. First we determine the earliest execution time for each process by the following procedure: repeat the following actions until the adjacency matrix disappears: 1) identify the processes whose columns contain only zeros 2) put these processes into separate sets 3) eliminate the rows and columns corresponding to these processes
The following sets of nodes are obtained for the earliest execution time: {1},{2,3},{4,5},{6},{7},{8} Then we find the sets of the latest processing time using the same procedure but with transposed matrix A: {8},{7},{6,4},{5,3},{2},{1} The time interval in which each task may be scheduled without delaying the overall execution starts at the earliest execution time and ends at the latest
14
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING Process 1 2 3 4 5 6 7 8
Earliest time 1 2 2 3 3 4 5 6
Latest time Permissible time 1 1 2 2 3 2,3 4 3,4 3 3 4 4 5 5 6 6
Table 2.1: Earliest, latest, and feasible task execution time
processing time as shown in Table 2.1. Concrete schedules would be found in respect to the feasible execution time and other constraints, like network topology, communication requirements etc.
2.2.4
Survey on static scheduling of nonpreemptive tasks on identical processors
Computational complexity theory The theory of NP-completeness proves that a wide class of decision problems are equivalently hard, in the sense that if there is a polynomial-time algorithm for one NP- complete problem, then there is a polynomial-time algorithm for any NP-complete problem. When talking about optimisation problem we will use the term ’NP-hard’ and when talking about decision problem ( problems solved by simple answer YES or NO) we will use the term ’NP-complete’. For some problems, for any arbitrarily small error tolerance, there exists a polynomial-time approximation algorithm that delivers a solution within that tolerance. We shall say that algorithm A is a ρ-approximation algorithm if, for each instance I, A(I) gives a solution within a factor of ρ of the optimal value, where ρ > 1. The main technique for proving that certain approximation algorithms are unlikely to exist is based on proving an NP-completeness result of the related decision problem. The following theorem shows that there is a minimum bound of ρ for certain class of optimisation problems. Theorem 2.1 Consider a combinatorial optimization problem for which all feasible solutions have non-negative integer objective function value. Let c be a fixed positive integer. Suppose that the problem of deciding if there exists a feasible solution of value at most c is NP-complete. Then for any ρ < (c + 1)/c, there does not exist a polynomial-time ρ- approximation algorithm A unless P=NP.
2.2. PARALLELIZATION
15
Proof: see page 4 in [16]. As stated by Gerasoulis and Yang [35] the objective of scheduling is to allocate tasks onto the processors and then order their execution so that every task dependence is satisfied and length of the schedule, known as a parallel time, is minimized. The following text introduces the main results obtained so far in the field of scheduling nonpreemptive tasks on identical processors. Independent tasks, no communication, limited number of processors, no duplication Consider the problem of scheduling n independent tasks T1 , ..., Tn on p identical processors (machines) M1 , ..., Mp . Each task Tj where j = 1, ..., n is to be processed by exactly one processor, and requires processing time tj . We wish to find a schedule of minimum length. Even if we restrict attention to the special case of this problem in which there are two processors (p = 2), the problem of computing a minimum length schedule is NP-hard. Graham [36] analyzed a simple approximation algorithm that finds a good schedule for this multiprocessor scheduling problem called ”list scheduling”. He showed, that if we list the tasks in any order, and whenever a processor becomes idle the next task from the list is assigned to it, then the length of the schedule produced is at most twice the optimum: in other words, it is a 2-approximation algorithm. Graham later refined this result, and showed that if the tasks are first sorted in order of nonincreasing processing times, then the length of the schedule produced is at most 4/3 times the optimum. Subsequently, a number of polynomial-time algorithms with improved performance guarantees were proposed. Tasks with precedence constraints, no communication, limited number of processors, no duplication Even if we introduce precedence relations among the tasks T1 , ..., Tn (given e.g. by a DAG) it is possible to find a good approximation algorithm. Graham [36] showed that the following algorithm is a 2-approximation one: the tasks are listed in any order that is consistent with the precedence constraints, and whenever a processor becomes idle, the next task on the list with all of its predecessors completed is assigned to that processor; if no such task exists, then the processor is left idle until the next processor completes a task. Lenstra and Rinnooy Kan [59] showed that, even if each task Tj requires processing time tj = 1, deciding if there is a schedule of length 3 is NP-complete (they showed that the NP-complete clique problem [21] can be reduced to this scheduling problem). So that with respect to Theorem 2.1, for any ρ < 4/3, there does not exist a polynomial-time ρ-approximation algorithm, unless P=NP.
16
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
Tasks with precedence constraints, communication delays, limited number of processors, no duplication Now consider the same scheduling problem with the following stronger precedence constraint: let Tj be a predecessor of Tk , if Tj and Tk are processed on different processors, then not only must Tk be processed after Tj completes, but it must be processed at least cjk units afterwards. The special case in which each tj = 1 and each cjk = 1 was shown to be NP-complete by Hoogeveenet et al. [47]; more precisely, they showed that deciding if there is a schedule of length at most 4 is NP-complete. Consequently, for any ρ < 5/4, no polynomial-time ρapproximation algorithm exists for this problem unless P=NP. Sarkar [83] has proposed a good approximation algorithm for scheduling with communication consisting of two steps: 1) Schedule the tasks on an unbounded number of processors of a completely connected architecture. The result of this step will be clusters of tasks, with the limitation that all tasks in a cluster must be executed in the same processor. 2) If the number of clusters is larger than the number of processors, then merge the clusters to the number of processors, and also incorporate the network topology in the merging step. The processor assignment part is also known as clustering. Using the same basis, a more efficient algorithm, called DSC (dominant sequence clustering), has been developed and analyzed by Gerasoulis and Yang [34] [35]. They introduce two types of clustering, the nonlinear and linear. A clustering is nonlinear if two parallel tasks are mapped in the same cluster, otherwise it is linear. Linear clustering fully exploits the natural parallelism of a given DAG, while nonlinear clustering sequentialises independent tasks to reduce parallelism. Tasks with precedence constraints, communication delays, unlimited number of processors, no duplication Hoogeveen et al. [47] also considered the model, including precedence constraints with communication delays as above, but there is no limit on the number of identical processors that may be used. For the special case in which each tj = 1 and each cjk = 1 they gave a polynomial-time algorithm to decide if there is a schedule of length 5, and yet deciding if there is a schedule of length 6 is NP- complete. Hence, for any ρ < 7/6, no polynomial-time ρ- approximation algorithm exists for this problem unless P=NP. Despite of these complexity results it is possible to find polynomial algorithms when a restriction on the precedence constraints is imposed (e.g. fork-join structure) or limitation of the communication time is assumed. The SCT (small communication time) assumption means that the largest communication time is less or equal to the smallest processing time. Such situation occurs in application programs involving tasks with a large granularity. The SCT assumption is sometimes called coarse-grain assumption (see a granularity theory - chapter 6 in [16] or paragraph 2.2.2 in this thesis).
2.2. PARALLELIZATION
17
Chr´etienne [15] has developed an O(n) algorithm solving the special case when SCT assumption is satisfied and the precedence constraints are given as an in-tree ( the tree consisting of directed paths going from the leaves to the root) or an out-tree ( the tree consisting of directed paths going from the root to the leaves). For additional terminology see paragraph 6.2.1. Valdes et al. [90] give a polynomial algorithm for fork-join graphs when SCT assumption is satisfied. For a general case (when no assumption on the structure of precedence constraints and no SCT assumption are satisfied) a good approximation algorithm using task clustering was proposed by Sarkar [83] and Gerasoulis and Yang [34] [35]. Tasks with precedence constraints, communication delays, unlimited number of processors, duplication When duplication is not allowed, each task must be processed only once. So a schedule is entirely defined by assigning to each task Tj a starting time and processor. When an unlimited number of processors is assumed and duplication is allowed it can be faster to process the same task on several processors and eliminate comunication delays. T1 4
2
T3 2
1 1
2
1 1
2
1 T4 2
T2 3
T6 3
T7 3
1
T9 2
2
1 T5 2
2
1
T8 3
Figure 2.5: An instance of a scheduling problem A polynomial algorithm developed by Colin and Chr´etienne [18] provides an optimal schedule when SCT assumption is satisfied. An instance of this problem is shown in Figure 2.5 and will serve to illustrate how the algorithm works. The set of the immediate predecessors (respectively successors) of Tj is denoted by IN (j) (respectively OU T (j)). First a topological order of the tasks is used to compute bi the release time of each task Ti : if Ti has no predecessor (IN (j) = ∅) then bi = 0 if Ti has one predecessor s (IN (j) = {s}) then bi = bs + ts if Ti has more predecessors (index s is used for a special predecessor)
18
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING 0 T1 4
2 2
0 T2 3
1 1
4 T3 2 4 T4 2 3 T5 2
1 1 1
7 T6 3 6 T7 3
1 1 1
6 T8 3
2 2
11 T9 2
2
Figure 2.6: An instance of a scheduling problem with the critical subgraph then bi = max{bs + ts , maxTk ∈IN (i)−{s} {bk + pk + cki } where s is the index of a special predecessor task Ts denoting the predecessor which satisfies the equation bs + ts + csi = maxTk ∈IN (i) {bk + pk + cki }. The release time bi is found in topological order for all tasks (in Figure 2.6 indicated by numbers above vertices). For example when calculating the release time b7 of task T7 the release times for its predecessors are already known (b3 = 4 and b5 = 3) because of topological order. Then T3 becomes the special predecessor (s = 3) because b3 + t3 + c37 > b5 + t5 + c57 . Finally we calculate b7 = max{b3 + t3 , b5 + t5 + c57 }. An arc (Ti , Tj ) of the precedence graph is said to be critical if bi + pi + cij > bj . By removing all noncritical arcs from the precedence graph we get so-called critical subgraph (in Figure 2.6 indicated by thick lines) in which each task has: - no input arc (a case where a special predecessor can be chosen among several possible ones) - one input arc. So the critical subgraph is a spanning outforest. The optimal schedule is finally built by assigning one processor to each critical path (i.e. a path from a root to a leaf in the critical subgraph) and by processing all the corresponding copies at their release times. The optimal schedule is shown in Figure 2.7. The algorithm provides an earliest schedule, but, as shown by Figure 2.7, it does not necessarily minimize the number of processors. The example shows that the tasks T6 and T9 could be assigned to processor M2 . However, it has been shown by Picouleau [78] that minimizing the number of processors is an NP-hard problem when the minimum makespan must be guaranteed. The above algorithm also works if for any Tj the largest communication time of an ingoing arc of Tj is at most the smallest processing time in IN (j) - a weaker assumption than SCT. Let us consider the case when the communication times may be larger than the
2.3. TIME COMPLEXITY MEASURES M1 M2 M3
T1 T2
19
T3
T7
T4
T8
T5
T1
M4
T6 0 1 2
3
4
5 6
7
T9
8 9 10 11 12 13
Figure 2.7: Earliest schedule for the instance from Figure 2.5 processing times. Papadimitru and Yannakakis [72] have shown that the special case tj = 1, cij > 1 is NP-hard and have proposed a sophisticated polynomial 2-approximation algorithm.
2.3
Time complexity measures
It is assumed now that a particular model of parallel computation has been chosen and a schedule of the algorithm has been done. Let us consider a computational problem parameterized by a variable n representing the problem size. Time complexity is generally dependent on n. A few concepts are described that are sometimes useful when comparing sequential and parallel algorithms. Suppose a parallel algorithm using p processors, terminating in time Tpar . Let Tseq be the optimal processing time of the sequential algorithm running on one processor The ratio: S(n, p) =
Tseq (n) Tpar (n, p)
(2.1)
is called the speedup S of the algorithm, and describes the speed advantage of the parallel algorithm, compared to the best possible sequential algorithm. The ratio E(n, p) =
S(n, p) Tseq (n) = p pTpar (n, p)
(2.2)
is called the efficiency E of the algorithm, and measures the fraction of time during which a typical processor is usefully running. Ideally, S(n, p) = p and E(n, p) = 1, in which case, the availability of p processors allows to speed up the computation by a factor of p. For this to occur, the parallel algorithm should be such that no processor ever remains idle or communicates. This ideal situation is practically unattainable. A more realistic objective is to aim at an efficiency that stays bounded away from zero, as n and p increase.
20
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
It is obvious that S(n, p) ≤ p. The proof can be done easily by contradiction: if S(n, p) > p then it is profitable to run the parallel algorithm on the network of p virtual processors mapped to one node processor where p virtual processors share the time of the node processor. Such a static schedule of the parallel algorithm could be used as a new sequential algorithm with a processing time shorter than Tseq . This is a contradiction because Tseq is the processing time of the best possible sequential algorithm Q.E.D. In practice the situation is even worse (communication overhead, scheduler overhead, etc..). Another fundamental issue is whether the maximum attainable speedup, given by Tseq (n)/Tpar (n, ∞), can be made arbitrarily large, as n is increased. In certain applications, the required computations are quite unstructured, and there has been considerable debate on the range of achievable speedups in real world situations. The main difficulty is that some programs have some sections that are easily parallelizable, but also have some sections that are inherently sequential. When a large number of processors is available, the parallelizable sections are quickly executed, but the sequential sections lead to bottlenecks, because just one processor is working and other processors perform so called busy waiting. This observation is known as Amdahl’s law and can be quantified as follows: if a program consists of two sections, one that is inherently sequential and one that is fully parallelizable, and if the inherently sequential section consumes a fraction f of the total computation, then the speedup is limited by: S(n, p) ≤ lim
p→∞
1 1 = f + (1 − f )/p f
(2.3)
On the other hand, there are numerous computational problems for which f decreases to zero as the problem size n increases, and Amdahl’s law is no more a concern.
2.4
Communication
In many parallel and distributed algorithms and systems the time spent for interprocessor communication is a sizable fraction of the total time needed to solve a problem. In this case the term communication/computation ratio is often applied in the literature and it can be given as: Tpar − Tinst Tinst
(2.4)
where Tpar is the time required by the parallel algorithm to solve the given problem, and Tinst is the corresponding time that can be attributed just to computation and busy waiting, that is, the time that would be required if all communication were instantaneous. This section is devoted to a discussion of a number of factors affecting the communication penalty.
2.4. COMMUNICATION
2.4.1
21
Communication model
To solve different communication problems it is needed to specify the basic terminology used to model a real communication. When a source and destination processors are not directly connected then a packet must travel over a route involving several processors. There are several possibilities for rooting the packet: 1) a store-and-forward (sometimes called message-switched) packet switching data communication model, in which a packet may have to wait at any node for some communication resource to become available before it gets transmitted to the next node. In some systems, it is possible that packets are divided and recombined at intermediate nodes on their routes, but this possibility will not be considered in this thesis. 2) a circuit switching, where the communication resources needed for packet transfers are reserved via some mechanism before the packet’s transfer begins. As a result the packet does not have to wait at any point along its route. Circuit switching is almost universally employed in telephone networks, but is seldom used in data networks or in parallel and distributed computing systems. It will not be considered further in this thesis. In the case of two processors p1 and p2 directly connected by one bidirectional communication link, packets can be transmitted in both directions but the two following cases can appear: 1) the packet can be communicated between the two processors just in one direction at a given time, either from p1 to p2 , or from p2 to p1 . This kind of communication, called half-duplex, occurs for example in radio communication on one frequency. 2) two packets can be communicated in opposite direction simultaneously from p1 to p2 and from p2 to p1 . This mode, called full-duplex, is used for example in normal telephone conversation (with exception when you are talking to your wife). In addition it is needed to specify the interface memory/communication link for each processor: 1) when each processor can use just one link then the communication is called 1-port (sometimes processor-bound or whispering) 2) on the other side when the processor can use all links simultaneously then the communication bound is a ∆-port (sometimes link-bound or shouting) 3) between the two frequent cases there is the communication bound when k links (where k is less than all the processor links) can be used simultaneously. This bound is called k-port and will not be considered in this thesis. The following notation will be used: F 1 and H1 indicate full duplex and half duplex 1-port models; F ∗ and H∗ indicate full duplex and half duplex ∆-port models. And now it is needed to specify the communication between two connected
22
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
processors. This communication is influenced by L the length of the message. That’s why the most general formulation of the communication time between two neighbours is the sum of: 1) the start-up β corresponding to a register/memory initialisation and sending/receiving procedures 2) propagation time Lτ proportional to the message length L and to the propagation time a unit length message τ (the link bandwith is 1/τ ) Such a model is called linear time and it is given by the following equation: Tneighbour
to neighbour
= β + Lτ
(2.5)
In practical applications there is very often a case where β τ . In such cases it is sufficient for the theoretical analysis to use a model called constant time where every communication between two neighbours costs one time unit. In this model the original messages can be recombined or split into separate parts without any influence on the unit communication time of new messages. Tneighbour
2.4.2
to neighbour
=1
(2.6)
Network topologies
In systems whose principal function is numerical computation, the network typically exhibits some regularity, and is sometimes chosen with a particular application in mind. Some example network topologies are presented in this section, and we focus on their communication properties. Topologies are usually evaluated in terms of their suitability for some standard communication tasks (see Section 2.4.3 on page 25). The following are two typical criteria: (a) The diameter r of the network is the maximum distance between any pair of nodes. Here the distance of a pair of nodes is the minimum number of links that have to be traversed to go from one node to the other. For a network of diameter r, the time for a packet to travel between two nodes is O(r) , assuming no queueing delays at the links. (b) The connectivity of the network provides a measure of the number of independent paths connecting a pair of nodes. We can talk here about the node or the arc connectivity of the network, which is the minimum number of nodes (or arcs, respectively) that must be deleted before the network becomes disconnected. In some networks, a high connectivity is desirable for reliability purposes, so that communication can be maintained even if several link and node failures occur. Some specific topologies are considered: 1) Linear processor array Here there are p processors numbered 1, 2, . . . , p, and there is a link (i,i+1) for every pair of successive processors (see Figure 2.8). The diameter and connectivity
2.4. COMMUNICATION 1
-
2
23
- p-1 -
p
-
2
-
- √p
6 6 6 6 ? ? ? ? - - 6
6
6
6
? ? ? ? - - 6 6 6 6 ? ? ? ? - - - p Mesh
2
p-1
Ring
Array
1
1
? ? - 2
1 6 6 ?- ?
6 6 ?
6 ?6
-
p
? - √? p 6 6 ?- ? 6 6
? - ?- ? 6 6 6 ? - ?- ? p 6
6
6
Mesh with wraparound (torus)
Figure 2.8: Some specific topologies properties of this topology are the worst possible. 2) Ring This is a topology having the property that there is a path between any pair of processors even after any one communication link has failed. The number of links separating a pair of processors can be as large as d(p − 1)/2e. 3) Tree A tree network with p processors provides communication among all processors with a minimal number of links (p − 1). One disadvantage of a tree is its low connectivity; the failure of any one of its links creates two subsets of processors that cannot communicate with each other. Furthemore, depending on the particular type of tree used, its diameter can be as large as p − 1 (note that the linear array is a special case of a tree). The star network has minimal diameter among tree topologies; however the central node of the star handles all the network traffic, and can become a bottleneck. 4) Mesh In a d-dimensional mesh the processors are arranged along the points of a d-dimensional space that have integer coordinates, and there is a direct communication link between nearest neighbors. In fact this is a d-dimensional version of linear array. Mesh with wraparound (torus) has in addition to the links of the ordinary mesh the links between the first and the last processor in a given dimension. Mesh with wraparound is in fact a d-dimensional version of ring. 5) Hypercube A hypercube consists of 2d processors, consecutively numbered with binary integers using a string of d bits. Each processor is connected to every other processor whose binary pid (processor identity number) differs from its own by
24
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING s
s
s
s
s
s
s
d=0
d=1
d=2
s s s
s
s s
s
s
s
d=3
s
s
s s
s s
s
s s
s s
s
s s
s d=4
Figure 2.9: Hypercube interconnection
exactly one bit. This connection scheme places the processors at the vertices of a d-dimensional cube. Formally the d-dimensional cube is the d-dimensional mesh that has two processors in each dimension. Hypercube interconnection networks for d varying from 0 to 4 are shown in Figure 2.9. The hypercube has the property that it can be defined recursively. A hypercube of order 0 has a simple node, and the hypercube of order d + 1 is constructed by taking two hypercubes of order d and connecting their respective nodes. This interconnection has several properties of great importance for parallel processing, such as these: a. As the number of processors increases, the number of connection wires and related hardware (such as ports) increases only logarithmically, so that systems with a very large number of processors become feasible. It follows that the diameter of a d-cube is d or log p, where p = 2d is the number of processors. b. A hypercube is a superset of other interconnection networks such as arrays, rings, trees, and others, because these can be embedded into a hypercube by ignoring some hypercube connections. c. Hypercubes are scalable, a property that results directly from the fact that hypercube interconnections can be defined recursively. d. Hypercubes have simple routing schemes. A message-routing policy may be to send a message to the neighbor whose binary pid agrees with the pid of the final destination in more bits. The path length for sending a message between any two nodes is exactly the number of bits in which their pid differ. The maximum is of course d, and the average is d/2.
2.4. COMMUNICATION
25
total exchange pATA ?
gossiping ATA
? - multinode accumul
ATA
PP P PP ) q P
scattering pOTA ?
?
broadcasting OTA ?
-
gathering pATO ?
?
- singlenode accumul ATO ?
point to point OTO Figure 2.10: Hierarchy and duality of the basic communication problems
2.4.3
Global communication primitives
Communication delays required by some standard tasks are typical in many algorithms performing regular numeric computations (see Chapter 3). These tasks described in the rest of this section will be called global communications or global communication primitives in the rest of this thesis. 1) Broadcasting (single node broadcast, One to All, diffusion) The same packet is sent from a given processor to every other processor. To solve the single node broadcast problem, it is sufficient to transmit the packet along a spanning tree rooted at the given node, that is, a spanning tree of the network together with a direction on each link of the tree such that there is a unique positive path from the given node (called the root) to every other node. With an optimal choice of such a spanning tree, a single node broadcast takes O(r) time, where r is the diameter of the network, as shown in Figure 2.12(a). Radio braodcasting is an example of this global communication primitive. 2) Gossiping (multinode broadcast, All to All) Gossiping is a generalized version of single node broadcast, where a single node broadcast is performed simultaneously from all nodes. To solve the multinode broadcast problem, it is needed to specify one spanning tree per node. The difficulty here is that some links may belong to several spanning trees; this complicates the timing analysis, because several packets can arrive simultaneously at a node, and require transmission on the same link that results in a queueing delay.
26
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
Let us imagine this example arising from our everyday life: there are x women, and each of them knows a part of a story which is not known by other women. They communicate by phone and exchange all details of the story. How many calls is it necessary to perform, so that all of them would know the whole story. 3) Single node accumulation (All to One) A packet is sent to a given node from every other node. It is assumed that packets can be ”combined” for transmission on any communication link, with a ”combined” transmission time equal to the transmission time of a single packet. This problem arises, for example, when it is desired to form at a given node a sum consisting of one term from each node as in an inner product calculation (see Fig. 2.12(b)). Addition of scalars at a node can be viewed as ”combining” the corresponding packets into a single packet of the same length. 4) Multinode accumulation (All to All) Involves a separate single node accumulation at each node. For example, a certain method for carrying out parallel matrix-vector multiplication involves a multinode accumulation (see example given in [7]). 5) Scattering (personalized One to All, distribution) This problem involves sending a separate packet from a single node to every other node. This problem appears usually in the initialisation phase in each parallel algorithm when data are distributed over the processor network. 6) Total exchange (personalized All to All, complete exchange, multiscattering) Where a packet is sent from every node to every other node (here a node sends different packets to different nodes in contrast with the multinode broadcast problem where each node sends the same packet to every other node). This problem arises frequently in connection with matrix computations. 7) Gathering (personalized All to One) Gathering involves collecting a packet at a given node from every other node. This global communication occurs for example when data are collected from distributed sensors. Hierarchy Note that the total exchange problem may be viewed as a multinode version of both a single node scatter and a single node gather problem, and also as a generalization of a gossiping, where the packets sent by each node to different nodes are different. The communication problems form a hierarchy in terms of difficulty, as illustrated in Figure 2.10. A directed arc from problem A to problem B indicates that an algorithm that solves A can also solve B simply by omitting certain algorithm parts. Figure 2.10 is an improved version of the similar one given by [7] where the author omitted the directed arcs from gossiping to broadcasting and from multinode accumulation to single node accumulation. The point-to-point
2.4. COMMUNICATION
27
1|m4 2|m3 3|m2 1
1|m4 2|m3 3|m2 4
1|m1 2|m4 3|m3
1
61|m3
61|m3
2|m2 3|m1
?
2
-3 1|m2 2|m1 3|m4
gossiping
4 2|m2
?
2
-3 1|m2 transmission of the message from the node 2 in time 1
gathering to node 1
Figure 2.11: Hierarchy example communication is added here to complete the logic meaning, but in fact it is a global communication. In particular, a total exchange algorithm can also solve the multinode broadcast (accumulation) problem; a multinode broadcast (accumulation) algorithm can also solve the single node gather (scatter) problem; and a single node scatter (gather) algorithm can also solve the single node broadcast (accumulation) problem. Figure 2.11 illustrates hiearchy relation of gossiping and gathering on given instance of four nodes arranged in oriented ring. Duality It can be shown that a single node accumulation problem can be solved in the same time as a broadcast problem. And more, any single node accumulation algorithm can be viewed as a broadcast algorithm running in reverse time; the converse is also true. As shown in Figure 2.12 the broadcast uses a tree that is rooted at a given node (which is node 1 in the figure). The time next to each link is the time at which transmission of the packet on the link begins. The single node accumulation problem involving summation of n scalars a1 , . . . , an (one per processor) at the given node (which is node 1 in the figure) takes 3 time units. So the single node accumulation and the broadcasting take the same amount of time if a single packet in the latter problem corresponds to packets combined in the former problem. This relation called duality (indicated in Figure 2.10 by horizontal bidirectional arcs) is observed between some communication problems in the sense that the spanning tree(s) used to solve one problem can also be used to solve the dual in the same amount of communication time. It is important to notice, that the above mentioned abstractions (definition of the communication primitives, hierachy and duality) is done regardless of the network being used.
28
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING 1
1 3|m24578
1|m1 2
A 1|m1 A U
2
3
2|m1 A 2|m1 U A 4
A 2|m1 U A
5
6
3|m1 7
A 3|m1 A U 8
(a) broadcasting
AK 3|m36 A 3
KA 2|m578 KA 2|m6 2|m4 A A 4
5
6
1|m7 7
AK 1|m8 A 8
(b) single node accumulation
Figure 2.12: Duality example Example 2.4: Let us now calculate the communication time for the scattering (gathering) and the gossiping (multinode accumulation) implemented on different topologies (1-port and unidirectional).
I.Ring (see Figure 2.8 on page 23) I.i.The scattering algorithm could be designed in OCCAM as follows (code for i-th processor)
PROC scattering (VAL INT i,CHAN OF [L]REAL64 in, out) SEQ IF i=1 [p*L]REAL64 data: SEQ k=0 FOR (p-1) out![data FROM(((p-k)-2)*L)+1 FOR L] TRUE [L]REAL64 data.for.me, data.for.others: SEQ SEQ k=0 FOR (p-i) SEQ in ? data.for.others out ! data.for.others in ? data.for.me
Assuming L to be the message length, the solution time for scattering is given by: sring H1 = (2(p − 2) + 1) × (β + Lτ ) = (2p − 3) × (β + Lτ )
(2.7)
I.ii. Supposing p to be an even value, a gossiping algorithm could be designed in OCCAM as follows (code for i-th processor).
2.5. CONCLUSIONS
29
PROC gossiping(VAL INT i,CHAN OF [L]REAL64 in, out) [p][L]REAL64 data: SEQ IF (i REM 2) = 0 SEQ k=0 FOR (p-1) SEQ out ! data[((p-k)+i) REM p] in ? data[((p-k)+(i-1)) REM p] (i REM 2) = 1 SEQ k=0 FOR (p-1) SEQ in ? data[((p-k)+(i-1)) REM p] out ! data[((p-k)+i) REM p]
Then the solution time for gossiping is given by: ring gH1 = 2(p − 1) × (β + Lτ ) = (2p − 2) × (β + Lτ )
(2.8)
II. Torus (see figure 2.8 on page 23) II.i. The algorithm scattering messages from node processor 1 can be designed for example in the following two phases: 1) scattering in the upper horizontal ring 2) scattering in all vertical rings √ √ storus = (2 p − 3) × (β + pLτ ) H1 √ +(2 p − 3) × (β + Lτ )
(2.9)
II.ii. The algorithm performing the gossiping in the torus topology can work √ for example in these two phases ( p is assumed to be an even value): 1) gossiping in all horizontal rings 2) gossiping in all vertical rings √ torus gH1 = (2 p − 2) × (β + Lτ ) √ √ +(2 p − 2) × (β + pLτ )
2.5
(2.10)
Conclusions
This chapter is a basic one introducing the terminology and some elementary methods from the field of parallel processing. Being a summary of contemporary
30
CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING
literature this chapter contains just a few original ideas. Some of the chapter’s most distinctive features are: • It covers the majority of the important topics in parallel processing. • It classifies systematically large and redundant terminology of parallel processing needed for comprehensive reading of the rest of this thesis. • It introduces the concept of dependencies and simple techniques for parallelism analysis in sequential algorithms. • It summarizes the results and the references of static scheduling of nonpreemptive tasks on identical processors. • It presents global communication primitives, which are important in many algorithms performing regular numeric computations. There are several books covering various aspects of parallel processing. Among the basic textbooks in parallel processing are [7] by Bertsekas & Tsitsiklis and [65] by Moldovan. Many aspects of scheduling theory are presented in [16] by Chr´etienne et al. and [27] by El-Rewini et al.
Chapter 3 An example of parallel algorithm - gradient training of feedforward neural networks This chapter presents a usual approach to parallel processing where parallelisation is not done automatically. It is a slightly modified version of the journal article [45] to appear in Parallel Computing. A message-passing architecture is presented to simulate multilayer neural networks, adjusting its weights for each pair, consisting of an input vector and a desired output vector. First, the multilayer neural network is defined, and the difficulties arising from its parallel implementation are clarified. Then the implementation of a neuron, split into the synapse and body, is proposed by arranging virtual processors in a cascaded torus topology. Mapping virtual processors onto node processors is done with the intention of minimizing external communication. Then, internal communication is reduced and an implementation on a physical message-passing architecture is given. A time complexity analysis arises from the algorithm specification and some simplifying assumptions. Theoretical results are compared with experimental ones measured on a transputer based machine. Finally the algorithm based on the splitting operation is compared with a classical one. This chapter does not require deep understanding of neural networks. Excelent survey articles dedicated to this branch of artificial intelligence are [61] by Lippmann and [48] by Hush & Horne.
3.1
Algorithms for neural networks
The neural approach to computation deals with problems for which conventional computational approaches have been proven ineffective. To a large extent, such problems arise when a computer interfaces with the real world. This is difficult 31
32
CHAPTER 3. AN EXAMPLE OF PARALLEL ALGORITHM
because the real world cannot be modelled with concise mathematical expressions. Some problems of this type are image processing, character and speech recognition, robot control and language processing. Programs simulating neural networks (NN) are notorious for being computationally intensive. Many researchers have therefore programmed simulators of different neural networks on different parallel machines (e.g. [6, 8]). Some implementations of algorithms such as self-organizing networks [69, 24] or heterogeneous neural networks [57] have been realized on transputer-based machines [26, 74, 79, 86]. For more references see the bibliography of neural networks on parallel machines [85]. A large number of neural network implementations on message-passing architectures have been reported in the last few years. These implementations usually deal with a conventional neural network adjusting its parameters (weights) after performing back propagation on a large number of input/output vectors. Such algorithms have high degree of data parallelism, so they are intuitively easier to decompose and many of them have already achieved linear speed-up. The aim of this chapter is to describe the implementation of a parallel neural network algorithm performing back propagation on a single sample pair consisting of an input vector and a desired output (target) vector for a given time. Then the weights are adjusted for each sample input/output pair; this loop is called an epoch. In contrast with conventional neural networks, this function introduces a specific noise that is convenient in certain applications. Such networks are sometimes called neural networks with a stochastic gradient learning.
3.2
Neural network algorithm specification
The neural network under consideration is a multilayer neural network using error back propagation with stochastic learning. The sigmoid activation function is used. The neuron under consideration is shown in Figure 3.1.
Σ
Figure 3.1: Artificial neuron j in layer l As shown in Figure 3.2, consecutive layers are fully interconnected. The following equations specify a function of the stochastic gradient learning algorithm
3.2. NEURAL NETWORK ALGORITHM SPECIFICATION
33
simulating a multilayer neural network with one input, two hidden and one output layer. Nl is the number of neurons in layer l, k denotes an algorithm iteration index, Ijl (k) denotes the input to the cell body of neuron j in layer l, ulj (k) denotes the output of neuron j in layer l, δil (k) denotes the error back propagated through l the cell body of neuron i in layer l, wij (k) denotes the synapse weight between cell body i in layer l − 1 and cell body j in layer l, η l denotes the learning rate, and αl denotes the momentum term in layer l. Activation - Forward Propagation Nl−1
Ijl (k)
=
X
l [wij (k) × ul−1 i (k)]
i=1
∀l = 1 . . . 3, ∀j = 1 . . . Nl − th neural network input
Ij0 (k) . . . j
(3.1)
2
ulj (k) = f (Ijl (k)) =
−1 l 1 + e−Ij (k) ∀l = 0 . . . 3, ∀j = 1 . . . Nl
(3.2)
Error Back Propagation - Output layer δil (k) = f 0 (Iil (k)) × (udesired (k) − uli (k)) i f or l = 3
(3.3)
Error Back Propagation - Hidden layers Nl+1
δil (k)
=f
0
(Iil (k))
×
X
l+1 (δjl+1 (k) × wij (k))
j=1
f or l = 2, 1
(3.4)
Learning - Gradient Method (k = 1, 2, 3) l l l ∆wij (k) = η l × δjl (k) × ul−1 i (k) + α × ∆wij (k − 1) ∀l = 1 . . . 3
l l l wij (k) = wij (k − 1) + ∆wij (k) ∀l = 1 . . . 3
(3.5)
(3.6)
34
CHAPTER 3. AN EXAMPLE OF PARALLEL ALGORITHM
Σ
Σ
Σ
Σ
Σ
Σ
Σ
Σ
Σ
Σ
Figure 3.2: Example of a multilayer neural network (NN 2-4-4-2) In this thesis, a stochastic gradient learning algorithm is assumed. The term ”stochastic” is used because the weights are updated in each cycle (activation ⇒ back prop. ⇒ learning ⇒ activation ⇒ ... ). Such a processing introduces a little noise into the learning procedure that could be advantageous in certain neural network applications. Let us consider a neural network used for non-linear system simulation with unknown model or a neural network used as a controller [39] interacting with a controlled system. In such cases, we deal with the problem of the dynamic behavior of a neural network algorithm. The problem is difficult to understand if the NN behavior is described just in terms of matrix operations, that is why a Petri net algorithm representation is given in paragraph 5.2.2.
3.3
Simple mapping
How the simulation task of the NN with various configurations and sizes is divided into subtasks is important for efficient parallel processing. The data partitioning approach proposed in [81] is dependent on the learning algorithm and needs the duplication of stored data. The network partitioning approach used by many researchers (e.g.[8, 56]), in this chapter called a ”classical algorithm”, uniformly divides neurons from each layer among p node processors (NP). Then each processor simulates N0 /p + N1 /p + N2 /p + N3 /p neurons. One part of the activation phase in the second hidden layer is represented in Figure 3.3. The problem due to this partitioning is seen from the Petri net representation: each neuron at
3.4. CASCADED TORUS TOPOLOGY OF VIRTUAL PROCESSORS
35
each processor has to receive the outputs of the previous layer from all other processors. 4 - −. / 0
−1
4
- [2 Σ
-3
=.
6
4
(5 )
(5 ) • 3
4
- −.
( 5 )]
(5 )
4
3 (5 ) 6
# ! " & $ ' % ( * " ),+
Figure 3.3: The activation in the second hidden layer ( 4 neurons in both hidden layers mapped on 4 NPs) and its Petri net representation. In order to avoid this problem we split the neuron into synapses and a cell body. The splitting operation makes it possible to divide the computation of one neuron into several processes and to minimize the communication as it is shown in the following section and proven in section 3.9.
3.4
Cascaded torus topology of virtual processors
In this section the algorithm running on a network of virtual processors (VPs) will be considered, hence we don’t have to care about load balancing and training data delivering. Problems of this type will be solved in the following two sections, in this section we will focus on the algorithmic matters, so that the results will be applicable to several architectures. The network of VPs arranged in Cascaded Torus Topology (CTT) of size N0 − N1 − N2 − N3 corresponding to the neural network given in Figure 3.2 is shown in Figure 3.4. The VPs are divided into three categories: • synapse virtual processors (SVPs) • cell virtual processors (CVPs) • input/output virtual processor (IO)
36
CHAPTER 3. AN EXAMPLE OF PARALLEL ALGORITHM qsrutwv xzy {}|s~}
rutv xzy {}|s~} qsrutwv xzy {}|s~}
rutv xzy {}|s~}
GFEDCBWVUTSRQPONMLKJIH GFEDCBA@?>== (N0 + N3 ) because there is relatively less work for the communication subsystem included in Troot .
46
CHAPTER 3. AN EXAMPLE OF PARALLEL ALGORITHM
20
NN 30−150−150−30 neurons NN 64−64−64−64 neurons
S speedup
15
NN 32−32−32−32 neurons
10
5
1 1
5
10 15 20 25 p number of node processors
30
Figure 3.10: Experimental results achieved on a T-node machine Number of node processors Execution time [ms] Speedup
1 4 6 9 15 20 25 30 753 212 144 99 63 48 40 34 1 3.5 5.2 7.6 11.9 15.5 18.7 21.8
Table 3.1: Numerical values for neural network with 30-150-150-30 neurons
3.9
Comparison with a classical algorithm
In the following analysis, we will distinguish between a ”classical algorithm” and the one explained in the sections 3.4 to 3.8 - ”splitting algorithm”. In the case of the classical algorithm it is assumed that each node processor handles one partition of neurons (refer to Figure 3.3) as shown in the section 3.3. All weights coming into a neuron are stored at the same NP as the neuron. In other words, the neuron was not split into the cell and body. To derive the time complexity of the classical algorithm, let us assume the same conditions as in paragraph 3.7 with the exception of assumption 4. This means that the messages will differ in length, of type β + x × L × τ where x is a count of data units and L is the data unit length. Using the terminology of section 2.4.3, let us imagine one iteration of the
3.9. COMPARISON WITH A CLASSICAL ALGORITHM
47
classical algorithm: • calculate input layer at the ROOT distribute results [u01 , ... , u0N0 ] to the processor network (scattering) • for 1st, 2nd and output layers do: calculate [ul1 , ..., ulNl ] exchange results with all other node processors (gossiping) • collect results [u31 , ..., u3N3 ] at the ROOT (gathering) • calculate the error at the ROOT distribute results [e31 , ..., e3N3 ] to the processor network (scattering) 3 • calculate [δ13 , ..., δN ] 3
• for 2nd and 1st layers do: calculate the partial sums of errors, exchange results (gossiping), l ] add the partial sums and calculate [δ1l , ..., δN l • update weights As argued by [67], there is an upper bound for the gossiping problem. Let us omit assumptions 2) and 3) from the section 3.7 and let us now consider a general topology. Each node processor in this topology has ∆ fully duplex links able to work in parallel (∆ port). During scattering the node processor 0 has to send (p − 1) packets of length n/p over ∆ links, so the solution time for scattering n Lτ . Let us consider that this topology has a diameter r, sF ∗ is at least p−1 ∆ p so the solution time for scattering sF ∗ is at least rβ. Then the lower bound for scattering is: sF ∗ (n) ≥ max(rβ, Lτ
p−1n ) ∆ p
(3.11)
This fact shows that Troot is at least proportional to n in both algorithms (classical and splitting). Communication with ROOT could be accelerated using processing elements having more communication links and arranged in a convenient architecture. The efficiency of this fact could be increased in the classical algorithm, because the connection of four links are already predefined in the splitting algorithm. Assuming ∆ is a constant given by the processor hardware, the mentioned acceleration is only constant depending on ∆ and the given topology. Concerning the hierarchy of basic communication problems, it is evident that gossiping takes at least the same time as scattering (sF ∗ ≤ gF ∗ ). During gossiping in the general topology, any node processor has to receive (p − 1) packets of length n/p from ∆ links, so the lower bound for gossiping (used only by the
48
CHAPTER 3. AN EXAMPLE OF PARALLEL ALGORITHM
classical algorithm) is also at least proportional to n. In the case of the classical algorithm, it means that: Tcomm.clas = 5 × gF ∗ (n). On the other hand, in the case √ of the splitting algorithm, we communicate only n/ p data units in vertical and horizontal rings, so: Tcomm = 11 × gF ∗ ( √np ). Please refer to Eq. (3.8). So finally we can write: n n2 Tsplitting = Troot (n) + Tcomm ( √ ) + Tcomp ( ) p p
(3.12)
n2 Tclassical = Troot.clas (n) + Tcomm.clas (n) + Tcomp ( ) (3.13) p The above-mentioned equations express the difference between both algorithms. The computational workload is the same, the time for communication with ROOT can differ, but it is a function of n in both cases. The only difference is in the communication time inside the processor network that is decreased by √ p in the case of the splitting algorithm. This difference is significant in the case of a large processor network. The equation (3.13) shows that the splitting algorithm is faster than the classical one, but the difference is not large. The splitting algorithm is better than the other known algorithms in the case of fully connected neural networks adjusting weights for each input/output pair. The classical network partitioning approach is the most effective in the case of neural networks with sparse connections between layers. A data partitioning approach can be used only in the case where the neural network does not use stochastic learning. In such a case, separate input/output pairs are treated in the different processors, each of them containing the whole neural network. When using a parallel computer with a big communication/computation ratio, then the data partitioning algorithm is probably the only one achieving reasonable speedup.
3.10
Conclusion
The problem of multilayer neural network simulation on message passing multiprocessors was addressed in this chapter. The benefit of this chapter for the rest of the thesis lies namely in the fact that we have presented a typical algorithm performing regular numeric computations. Our parallelisation strategy needed deeper analysis of the algorithm structure. It was argued that the splitting of the neuron into synapse and cell body makes it possible to efficiently simulate a neural network of a given class. The decomposition and the mapping on this architecture was proposed, as well as a simple and convenient message passing scheme. The experimental results show a very good speedup, especially for networks having many neurons in hidden layers. The time complexity analysis matches the experimental results well and facilitates estimation of the parallel execution time for large processor networks.
3.10. CONCLUSION
49
This example of parallel algorithm reveals many interesting items: • Fine grain partitioning leads to small granularity which increases parallelism. On the other hand small granularity can increase communications, in addition to increasing software complexity. • Good overall speedup is achieved through a compromise between granularity and communication. • Even if the data parallelism is very low it is possible to obtain good results with the use of structural parallelism. • Parallelism detection in iterative algorithms is a very complex task and it needs deep algorithm analysis.
50
CHAPTER 3. AN EXAMPLE OF PARALLEL ALGORITHM
Chapter 4 Structural analysis of Petri nets Petri Nets make it possible to model and visualize behaviours comprising concurrency, synchronization and resource sharing. As such they are a convenient tool to model parallel algorithms. The objective of this chapter is to study structural properties of Petri Nets which offer profound mathematical background originated namely from linear algebra and graph theory. Carl Adam Petri is a contemporary German mathematician who defined a general purpose mathematical tool for describing relations existing between conditions and events [76]. This work was conducted in the years 1960-62. From that time, these nets have been developped in the USA, notably at MIT in the early seventies. Since the late seventies, european researchers have been very active in organizing workshops and publishing conference proceedings on Petri Nets (PN) in the series LNCS by Springer-Verlag. Murata [66] provides a good survey on properties, analysis and applications of Petri Nets. Several books on Petri Nets, where theory takes an important place, have been published [10, 22, 75]. This chapter gives the basic terminology first, then the notion of linear invariants is introduced. It is argued that only positive invariants are of our interest when analyzing structural net properties and the notion of generator is introduced. Then three existing algorithms finding a set of generators are explained and implemented. The importance of these vectors lies in their usefulness for analyzing net properties, especially the parallelism detection as will be seen in the following chapters. Time complexity measures are given and an original algorithm, first reducing fork-join parts of PNs and then finding a set of generators, is proposed. The three existing algorithms were implemented in Matlab and tested on various examples.
4.1
Basic notion
A Petri Net is a particular kind of directed graph, together with an initial state called the initial marking. The underlying graph of a Petri Net is a directed, 51
52
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
weighted, bipartite graph consisting of two kinds of nodes, called places and transitions, and arcs connecting places to transitions or transitions to places. Definition 4.1 A Petri net is four-tuple < P, T, P re, P ost > such that: P is a finite and non-empty set of places (represented as a vector with entries P1 , . . . , Pi , . . . , Pm ), T is a finite and non-empty set of transitions (represented as a vector with entries T1 , . . . , Tj , . . . , Tn ), P re is an input function, representing weighted arcs connecting places to transitions called precondition matrix of size [m,n], P ost is an output function, representing weighted arcs connecting transitions to places called postcondition matrix of size [m,n].
T
1 2 ?
P
2
r P 1 I @ @ @ @
@ R @ r P T 4 I 2 @ @ ? ?
P3
2
@
@ ?
T3
Figure 4.1: An example of Petri Net In graphical representation (see Figure 4.1), places are drawn as circles, transitions as bars or boxes. Arcs are labeled with their weights (positive integers), where a k-weighted arc can be interpreted as a set of k parallel arcs. Labels for 1-weighted arcs are usually omitted. A marking assigns to each place a nonnegative integer. If a marking assigns to place Pi a nonnegative integer k, we say that Pi is marked with k tokens. Definition 4.2 A marked Petri net is a five-tuple < P, T, P re, P ost, M0 > such that M0 is an initial marking Pictorially, we place k black dots ( tokens) in place Pi . A marking is denoted by M , a vector with entries M (1), . . . , M (i), . . . , M (m), where m is the total number of places. The i-th component of M , denoted by M (i), is the number of tokens in place i.
4.1. BASIC NOTION
53
The behaviour of dynamic systems can be described in terms of system states and their changes. In order to simulate the dynamic behavior of a system, a state or marking in a Petri Net is changed according to the following firing rule: 1) A transition Tj is said to be enabled if each input place Pi is marked with at least P reij tokens, where P reij is the weight of the arc from Pi to Tj . 2) An enabled transition may or may not fire (depending on whether or not the event associated with the transition actually takes place). 3) A firing of an enabled transition Tj removes P reij tokens from each input place Pi of Tj , and adds P ostkj tokens to each output place Pk of Tj , where P ostkj is the weight of the arc from Tj to Pk . A firing sequence is an ordered set of firing operations. To a firing sequence is associated a characteristic vector ~s, whose i-th component indicates the number of times the transition Ti is fired in the sequence. From the marking in Figure 4.2 one can have, for example, the firing sequence A = T1 , B = T1 T2 T3 or C = T1 T2 T3 T1 T3 whose characteristic vectors are s~A = (1, 0, 0), s~B = (1, 1, 1), s~C = (2, 1, 2). A characteristic vector may correspond to several firing sequences: for example (1,1,1) corresponds to both T1 T2 T3 and T1 T3 T2 . But not all the ~s vectors whose components are positive or zero integers are possible; for example, there is no firing sequence ~s = (0, 1, 1) from M0 since neither transition T2 nor transition T3 can be fired before a firing of transition T1 . P
T2
P
1 2 r r @ I @ @ R @ @ T1 @ T3 I @ @ @ @ R @
P3
P4
Figure 4.2: A Petri Net (an billiard balls in [22]) If a firing sequence ~s is applied from marking M then the reached marking M 0 is given by the fundamental equation: M 0 = M + P ost · ~s − P re · ~s
for M ≥ 0, ~s ≥ 0
(4.1)
A transition without any input place is called a source transition, and one without any output place is called a sink transition. A place Pi having the same transition Tj as input and output (P reij = P ostij 6= 0) is called a self loop place (for example place P4 in Figure 4.1). A transition Tj having one input place Pi and one output place identical to the input one (P reij = P ostij ) is called self loop transition.
54
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
Definition 4.3 Let a PN with P = P1 , ..., Pm and T = T1 , ..., Tn be given. A Matrix C = (cij ) where (1 ≤ i ≤ m, 1 ≤ j ≤ n) is called the incidence matrix of PN iff: C = P ost − P re
(4.2)
Then the fundamental equation 4.1 could be rewritten as: M 0 = M + C · ~s
for M ≥ 0, ~s ≥ 0
(4.3)
Remarks: The incidence matrix does not have a capability to hold an information about self-loop places and self-loop transitions. The incidence matrix is sometimes called the change matrix. Definition 4.4 A Petri Net is pure iff: ∀Pi ∈ P and ∀ Tj ∈ T : P ostij × P reij = 0
(4.4)
Definition 4.4 implies that a pure net does not contain any self-loop place or self-loop transition and it is fully representable by the incidence matrix C.
4.2
Linear invariants
In this section we introduce structural properties of Petri Nets. The structural properties are those that depend on the topological structure of Petri nets. They are independent of the initial marking M0 in the sense that these properties hold for any initial marking or are concerned with the existence of certain firing sequences from some initial marking. Thus these properties can often be characterized in terms of the incidence matrix C and its associated homogeneous equations or inequalities. That is why we introduce a term linear invariants comprising P-invariants and T-invariants defined further in this chapter. Let us consider the PN given in Figure 4.2. The sum M (1) + M (3)
(4.5)
is equal to 1 for M0 = (1, 1, 0, 0). Firing of T1 or T2 does not change anything on this sum. Even any other firing does not change anything on this sum. So we can write for any marking M of the given Petri Net: M (1) + M (3) = 1
(4.6)
That is why the subnet, formed by the set of places P1 and P3 and their input and output transitions, is called a conservative component. Similarly when considering a circuit of places P1 ,P2 and P3 in Figure 4.1 we can write:
4.2. LINEAR INVARIANTS
55
2M (1) + M (2) + M (3) = 2
(4.7)
In order to have a general rule we multiply each term of the fundamental equation 4.3 by f T from the left side and we obtain: f T · M 0 = f T · M + f T · C · ~s
(4.8)
It is clear that the only possibility for f T to fulfil f T · M 0 = f T · M = constant for any firing vector ~s, is to satisfy the following set of linear equations: fT · C = 0
(4.9)
In a similar way we introduce a repetitive component. Let us have a look again on Figure 4.2. Firing transitions T1 , T2 and T3 gives again the same marking. The subnet, formed by these transitions and their input and output places, is called a repetitive component. Definition 4.5 Let a finite PN system with P = P1 , ..., Pm and T = T1 , ..., Tn be given. 1. A vector f ∈ Z m is called a P-invariant of the given PN, iff: CT · f = 0
(4.10)
2. A vector s : s ∈ Z n is called a T-invariant of the given PN, iff: C ·s=0
(4.11)
To find a solution of the homogeneous set of equations (4.10 or 4.11) is an easy task solved by the Gauss elimination method in linear time. But in the following analysis we will be interested in a positive version of linear invariants. Definition 4.6 Let a finite PN system with P = P1 , P2 , ..., Pm and T = T1 , T2 , ..., Tn be given. 1. A vector f : f ∈ Z m is called a positive P-invariant of the given PN, iff: CT · f = 0
∧
fi ≥ 0∀i = 1, ..., m (4.12)
2. A vector s : s ∈ Z n is called a positive T-invariant of the given PN, iff: C ·s=0
∧
si ≥ 0∀i = 1, ..., n
(4.13)
Linear invariants were introduced by Lautenbach [58]. In this chapter, furthermore, only P-invariants will be considered. To obtain T-invariants, we only need to transpose the matrix C and use the same method.
56
4.3
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
Finding positive P-invariants
We are looking for solutions of the equation: C T .f = 0
for f ≥ 0
This is in fact a set of n homogeneous equations (n is the number of transitions) and m variables, the entries of the vector f = (f1 , ..., fm )T (m is the number of places): c11 .f1 c12 .f1 ... c1n .f1
+ c21 .f2 + c22 .f2 + ... + c2n .f2
+ + + +
... ... ... ...
+ cm1 .fm + cm2 .fm + ... + cmn .fm
= = = =
0 0 0 0
n equations
Example 4.1: Let us consider for example the Petri net given in Figure 4.3. Then the set of equations is: h
f1 f2 f3 f4
i
.
−1 1 0 1 −1 0 0 1 −1 −1 0 1
=
0
− f1 + f2 − f4 = 0 f1 − f2 + f3 = 0 − f3 + f4 = 0 When solving this set of three equations with four variables we can deduce the following: 1) from the third equation we obtain f3 = f4 2) then from the first equation f3 = f2 − f1 3) the second equation is a linear combination of the first one and the third one So the solution of this set of equations is: f = (f1 , f2 , f2 − f1 , f2 − f1 )T . By variation of two parameters f1 and f2 we obtain all possible P-invariants. In other words the dimension of the P-invariant subspace (which is Kernel(C T )): dim(P − invariant subspace) = 2. With respect to linear algebra (see Newman [68]) we can deduce: dim(subspace of P − invariants) = m − rank(C) For example one basis of the P-invariant subspace is:
(4.14)
4.3. FINDING POSITIVE P-INVARIANTS
6
@ R @
57
T1 P4
6 ? r P2 P1 T3 6 6 ? T2 P3 @ @ 6
Figure 4.3: A Petri Net with two positive P-invariants and one T- invariant
F
=
[f 1 f 2 ]
=
1 0 −1 −1
0 1 1 1
for f1 = 1, f2 = 0 for f1 = 0, f2 = 1
But the first P-invariant f 1 is not positive, so we can perform a new variation of parameters f1 and f2 in order to obtain positive P-invariants:
F
=
1 1 0 0
0 1 1 1
for f1 = 1, f2 = 1 for f1 = 0, f2 = 1
It is not always possible to find positive P-invariants (see Figure 4.4 where we can find just one P-invariant (−1, 1, 1, 0, 0)T which is negative). From equation 4.14 we deduce: dim(subspace of positive P − invariants) ≤ m − rank(C)
(4.15)
Definition 4.7 An invariant f = (f1 , · · · , fm )T , solution of C T · f = 0, is called standardized iff: 1) f1 , · · · , fm ∈ Z 2) for the first fi 6= 0 holds fi > 0 3) f cannot be divided by an element k ∈ N , k > 1 (without destroying condition 1).
Example 4.2: If
f1
=
0 − 17 +3 − 13
58
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
P
4
T1
?
?
P
2
P
T2
?
1 @ @ @
?
P3
@ @ R ? @
T3
?
P
5
Figure 4.4: A Petri net with one negative P-invariant the equivalent standardized invariant is:
f2
=
0 3 −63 7
Definition 4.8 A standardized invariant f ≥ 0 is called minimal, iff it cannot be composed of other k standardized invariants in the form:
f=
k X
λi f i
for λi ∈ Q+
(4.16)
i=1
Example 4.3: Among the following three invariants only the invariants f 1 and f 2 are minimal, because f 3 = 0.5f 1 + 0.5f 2
f1
=
2 2 0 1
f2
=
0 2 0 1
f3
=
1 2 0 1
4.3. FINDING POSITIVE P-INVARIANTS
6
@ R @
T1
59
6
? r P2 P3 P1 6 6 ? T2 @ @
Figure 4.5: Event graph with two generators More generally a given invariant f can be written as a composition of invariants called generators f=
g X
λi x i
(4.17)
i=1
with factors λi , generators xi and g number of generators. The generators xi are the invariants that are used for the composition. A composition of the form 4.17 is obviously simpler if the factors λi are elements of Z + . Unfortunately, in the case of high simplicity of composition, the complexity of calculating the generators xi is much higher than the complexity of calculating the basis of P-invariant subspace. For example the subspace of positive P-invariants of the Petri net in Figure 4.5 is given by two generators x1 = (1 1 0)T and x2 = (0 1 1)T . This subspace is shown in Figure 4.6. Kruckenberg and Jaxy [52] considered several algorithms calculating the generators and divided the computations into five levels as shown in Table 4.1. Level λi ∈ 1 Q 2 Z 3 Q+ 4 Z+ 5 {0,1}
Generators xi xi ∈ Z m xi ∈ Z m xi ≥ 0 xi ≥ 0 xi ∈ {0, 1}m
Set {xi } {xi } Base {xi } Base {xi } Unique {xi } Unique {xi } Unique
Table 4.1: Generator computational levels In the following subsections we will show the algorithms finding the generators from the third level as the positive minimal standardized P-invariants. In Pascoletti [73] it is proved that the set of generators X is finite and unique for a given net if this set is characterized by a minimality condition. In this thesis we will focus only on the third level, where each positive invariant is a positive linear combination of generators, which are minimal standardized
60
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
20
P2
15 10
beginning of discrete solution subspace
5 P3 0 10
P1 10 5 0
0
5 generators
Figure 4.6: Subspace of positive linear invariants
4.3. FINDING POSITIVE P-INVARIANTS t1
61
@ @ @ R @
P1
P2
@ @ @ R t2 @ @
@ @ R @
P3
r P5 6
P4
@ @ @ R @
t3
Figure 4.7: A Petri Net with four generators invariants. It is evident that in the case of event graphs (subclass of PNs where each place has no more than one input and one output arc with weight one) λi ∈ Z + and xi ∈ {0, 1}m holds already for the third level. In other words: in the case of event graphs the sets of generators X for the third, fourth and fifth level are identical. In the rest of this thesis the generators from the third level will be called simply generators (or Q+ -generators) and the set of P-invariant generators will be denoted by the matrix X of size [m number of places, g number of generators]. It is clear that the number of generators can be larger than the dimension of the P-invariant subspace. This fact is demonstrated in Figure 4.7 where we find a unique set of four generators X and by choosing three of them we obtain a P-invariant basis F .
X
=
1 0 1 0 1
0 1 1 0 1
1 0 0 1 1
0 1 0 1 1
F
=
1 0 1 0 1
0 1 1 0 1
1 0 0 1 1
When solving the Example 4.1 it was possible to use the following approaches to find positive P-invariants: 1) find a basis of (m − rank(C)) linearly independent P-invariants by a modified Gauss Elimination Method (GEM) favorising positive P-invariants (see subsection 4.3.1) 2) find a set of positive P-invariant generators by solving a set of equations (see subsection 4.3.2) 3) first find a basis of a certain type and then construct the generators by varying
62
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
the basis vectors (see subsection 4.3.3)
4.3.1
An algorithm based on Gauss Elimination Method
The method introduced in this subsection is explained in details in [92] and it is developed to find P-invariants and T-invariants all at once in [71]. I am very grateful to professor Valette for various consultations on the subject. The algorithm performs gaussian operations with rows on the set of equations in order to find a maximum number of positive P-invariants. Algorithm 4.1: % GAUSS - Modified Gauss’s algorithm / see polycop by Valette page 40/. % F = GAUSS(C) is the base of the Petri Net specified by % the incidence matrix C. Rows of F are the P-invariants. % To find T-invariants use F = GAUSS(C T ) T F = identity matrix of size m ∗ m while (dim(C) 6= 0) %end test % phase 1 - transition with one input and no output or one output and no input; % place connected to this transition can never form positive conservative comp. % (e.g. source transition can generate an infinite number of tokens to just one place) j=1 while (j ≤ number of columns in C) if there is a unique nonzero element Cij in column C:j delete column C:j delete row Ci: delete row Fi:T else j =j+1 end end CATCHED=FALSE % phase 2.1 - transition with one output and at least one input j=1 while ((j ≤ number of columns in C) and (not CATCHED)) if there is a unique positive element Cij in column C:j for all rows k with negative element Ckj %annulate element Ckj Ck: = a∗Ck: + b∗Ci: T = a∗F T + b∗F T Fk: i: k: end delete column C:j delete row Ci: delete row Fi:T CATCHED=TRUE;
4.3. FINDING POSITIVE P-INVARIANTS
63
else j =j+1 end
end % phase 2.2 - transition with one input and at least two outputs j=1 while ((j ≤ number of columns in C) and (not CATCHED)) if there is a unique negative element Cij in column C:j for all rows k with positive element Ckj %annulate element Ckj Ck: = a∗Ck: + b∗Ci: T = a∗F T + b∗F T Fk: i: k: end delete column C:j delete row Ci: delete row Fi:T CATCHED=TRUE else j =j+1 end end % phase 3 - transition with more inputs and outputs j=1 while ((j ≤ number of columns in C) and (not CATCHED)) if there are at least two positive and two negative elements in C:j choose one row Ci: with positive element Cij choose one row Cy: with negative element Cyj for all rows k with negative element Ckj except row Cy: %anulate negative element Ckj Ck: = a∗Ck: + b∗Ci: T = a∗F T + b∗F T Fk: i: k: end CATCHED=TRUE %algorithm will continue in the phase 2.2 else j =j+1 end % phase 4 - transition with many inputs and no output or many outputs and no i % places connected to these transitions can take part % just in a negative conservative component (even if a % source transition generates an infinite number of tokens % these tokens are subtracted owing to negativeness of the component) j=1 while ((j ≤ number of columns in C) and (not CATCHED)) if there are at least two positive or two negative elements in C:j choose one row Ci: with nonzero element Cij
64
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS for all rows k with nonzero element Ckj except row Ci: %annulate nonzero element Ckj Ck: = a∗Ck: + b∗Ci: T = a∗F T + b∗F T Fk: i: k: end delete column C:j delete row Ci: delete row Fi:T CATCHED=TRUE else j=j+1 end end end .
End of algorithm
4.3. FINDING POSITIVE P-INVARIANTS
65
Example 4.4: Let us apply Algorithm 4.1 to the Petri net shown below: T1
P1
T2
P2
P3
T3
T4
P5
P4
P6
P7
T5
P8
P1 P2 P 3 C= P4 P5 P6 P7 P8
T6
T1 T2 T3 T4 T5 T6 1 −1 0 0 0 0 1 0 0 −1 0 0 0 −1 0 1 0 0 0 1 0 0 0 −1 0 0 1 0 −1 0 0 0 0 1 −1 0 0 0 0 −1 0 1 0 0 0 0 −1 1
P1 P2 P3 P4 P5 P6 P7 P8 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 F T = 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
Step 1 - phase 1 executed (row 5 and column 3 deleted) T1
P1
T2
P2
P3 T4
P4
P6 T5
P7 P8
T6
T
1 T2
C=
1 1 0 0 0 0 0
T4 T5 T6 −1 0 0 0 0 −1 0 0 −1 1 0 0 1 0 0 −1 0 1 −1 0 0 −1 0 1 0 0 −1 1
P
1 P2 P3 P4 P5 P6 P7 P8
F T =
1 0 0 0 0 0 0
0 1 0 0 0 0 0
0 0 1 0 0 0 0
0 0 0 1 0 0 0
0 0 0 0 0 0 0
0 0 0 0 1 0 0
0 0 0 0 0 1 0
0 0 0 0 0 0 1
Step 2 - phase 2.1 executed (row 4 and column 2 deleted) T1
P1+4 P2
P3+4 T4
P6 T5
P7 P8
T6
T1 T4 T5 T6 1 0 0 −1 1 −1 0 0 C= 0 1 0 −1 0 1 −1 0 0 −1 0 1 0 0 −1 1
P1 P2 P3 P4 P5 P6 P7 P8 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 F T = 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
Step 3 - phase 3 executed (row 5 replaced by combination of the rows 5,3) T1
P1+4 P2
P3+4 T4
P6 T5
P3+4+7 P8
T6
T1 T4 T5 T6 1 0 0 −1 1 −1 0 0 C= 0 1 0 −1 0 1 −1 0 0 0 0 0 0 0 −1 1
P1 P2 P3 P4 P5 P6 P7 P8 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 F T = 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1
66
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
Step 4 - phase 2.1 executed (row 6 and column 4 deleted) T1
P1+4+8 P2
P3+4+8 T4
P6
P3+4+7
T5
T
1 T4
1 1 0
C=
0 0
T5 0 −1 −1 0 1 −1 1 −1 0 0
P
1 P2 P3 P4 P5 P6 P7 P8
1 0 0
F T =
0 1 0 0 0 0 0
0 0 1 0 1
1 0 1 0 1
0 0 0 0 0
0 0 0 1 0
0 0 0 0 1
1 0 1 0 0
Step 5 - phase 2.2 executed (row 2 and column 2 deleted) T1
P1+4+8 P2+3+4+8 P2+6 P3+4+7
T5
T1 T5 1 −1 C= 1 −1 1 −1 0 0
P1 P2 P3 P4 P5 P6 P7 P8 1 0 0 1 0 0 0 1 F T = 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0
Step 6 - phase 4 executed (row 1 and column 1 deleted) T1 P2-1+3 P-1+2-4+6-8 P3+4+7
T1 0 C= 0 0
P1 P2 P3 P4 P5 P6 P7 P8 −1 1 1 0 0 0 0 0 T F = −1 1 0 −1 0 1 0 −1 0 0 1 1 0 0 1 0
The Algorithm 4.1 is successful when solving Example 4.4 because the basis F contains a maximum number of positive P-invariants (in this case we have just one invariant P3 P4 P7 ) The Example 4.5 shows the case when the algorithm finds just negative Pinvariants in spite of existence of one positive P-invariant P4 P6
4.3. FINDING POSITIVE P-INVARIANTS
67
Example 4.5: Let us apply the Algorithm 4.1 to the Petri net shown below: T2 P5 T1
P1 P6
P4
P2
T3 P3
T4
T1 T2 T3 T4 P1 0 1 −1 0 P2 1 0 0 −1 C=P3 0 0 −1 1 P4 0 1 0 −1 P5 1 −1 0 0 P6 0 −1 0 1
P1 P2 P3 P4 P5 P6 1 0 0 0 0 0 0 1 0 0 0 0 F T = 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1
Step 1 - phase 3 executed (row 6 replaced by combination of the rows 6,1) T2 P5
P1
T1
P4 P2
T3 P3 P1+6
T4
T1 T2 T3 T4 0 1 −1 0 1 0 0 −1 C= 0 0 −1 1 0 1 0 −1 1 −1 0 0 0 0 −1 1
P1 P2 P3 P4 P5 P6 1 0 0 0 0 0 0 1 0 0 0 0 F T = 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1
Step 2 - phase 2.2 executed (row 5 and column 2 deleted) P1+5 T1
P4+5 P2
T3
T
1 T3
P3 T4
P1+6
C=
1 1 0 1 0
T4 −1 0 0 −1 −1 1 0 −1 −1 1
P
1 P2 P3 P4 P5 P6
F T =
1 0 0 0 1
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
1 0 0 1 0
0 0 0 0 1
Step 3 - phase 3 executed (row 4 replaced by combination of the rows 4,3) P1+5 T1
P3+4+5 P2
T3
T
1 T3
P3 T4
P1+6
C=
1 1 0 1 0
T4 −1 0 0 −1 −1 1 −1 0 −1 1
P
1 P2 P3 P4 P5 P6
F T =
1 0 0 0 1
0 1 0 0 0
0 0 1 1 0
0 0 0 1 0
1 0 0 1 0
0 0 0 0 1
68
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
Step 4 - phase 2.2 executed (row 2 and column 3 deleted) P1+5 T1
P3+4+5
T3 P2+3
P1+2+6
T1 T3 1 −1 C= 1 −1 1 −1 1 −1
P1 P2 P3 P4 P5 P6 1 0 0 0 1 0 F T = 0 1 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1
Step 5 - phase 4 executed (row 1 and column 1 deleted) P -1+3+4
T3
P-1+2+3-5 P2-5+6
T3 0 C= 0 0
P1 P2 P3 P4 P5 P6 −1 1 1 0 −1 0 F T = −1 0 1 1 0 0 0 1 0 0 −1 1
The Example 4.5 proves that Algorithm 4.1 is not always able to find a maximum number of positive P-invariants. The reason lies in the phase 3 where one combination of input and output arcs is chosen among several possible ones. T1
P9 P11
P14
P10
P1 P4
T4 P12
T8
T7
P5 T2
P3 P8
T5
P7 T6
P13 P6
P2
T3
Figure 4.8: Subnet of the PN representation of a Neural network algorithm
4.3. FINDING POSITIVE P-INVARIANTS
69
The Figure 4.8 shows a Petri net with generators x1 to x5 :
XT
=
x1 x2 x3 x4 x5
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1
The fact that matrix X has rank 4 shows that four linear independent positive P-invariants can be found in Figure 4.8, but the algorithm finds a basis containing just one positive P-invariant:
FT
=
f1 f2 f3 f4 f5 f6 f7
P1 P2 P 3 P4 P5 P 6 P7 −1 0 0 1 1 0 0 −1 −1 0 0 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 −1 −1 0 0 0 0 0
P8 0 −1 1 −1 0 −1 −1
P9 P10 P11 P12 P13 0 0 0 0 0 −1 0 0 0 0 1 0 0 0 0 −1 1 0 0 0 −1 1 1 0 0 0 0 0 1 0 −1 0 0 0 1
P14 0 −1 1 −1 0 0 −1
When permuting rows and columns in matrix C before running the algorithm, we can get a basis with four positive P-invariants. This is another proof of disfunction of Algorithm 4.1, because it cannot be dependent on the permutation of the input matrix C.
4.3.2
An algorithm based on combinations of all input and all output places
The algorithm presented in this subsection was published in [63]. It generalises in some sense the Jensen’s rules [49], offering a systematic way to finding all invariants. In the sense of the table 4.1 Algorithm 4.2 finds generators of the 3-rd level. Algorithm 4.2: %SILVA - algorithm Alfa2 /Martinez&Silva - LNCS 52/ %X = SILVA(C) is a non-negative matrix such that: %1) each positive p-invariant could be done as a lambda %combination of the columns of X %2) no column of X could be done as a lambda
70
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS %combination of other columns of X %Remark: lambda is a vector of non-negative rational numbers Q+ %phase 1 - initialize X T X T = identity matrix of size m ∗ m for j = 1 : n %eliminate all transitions %phase 2.1 - generate new places resulting as a non-negative linear combination %of one input and one output place to the transition j (new places are neither %input nor output places to the transition j) for all rows p with positive element Cpj for all rows q with negative element Cqj %combine row p and row q when annuling j-th entry %and add this line to C and X T Cnew: = a*Cq: + b*Cp: T T + b*X T Xnew: = a*Xq: p: end end %phase 2.2 - eliminate input and output places to the transition j for all rows i with nonzero element Cij delete row Ci: delete row Xi:T end %phase 2.3 - eliminate the non minimal generators for all rows i %when Xi:T is already completed P-invariant if Ci: = zero line %has the same effect as a usual condition %for P-invariant Xi:T × Coriginal = 0 %find non-zero indices in row Xi: arr = vector of q indices of Xi: that are non-zero %create submatrix of C M = Carr: if q 6= rank(M ) + 1 delete row Ci: delete row Xi:T end end end end .
End of algorithm
4.3. FINDING POSITIVE P-INVARIANTS
71
Example 4.6: Let us apply Algorithm 4.2 to the Petri net shown below:
P2
T1
P1
T2
P3
P4
P5
T3
P6
T1 T2 T3 T4 P1 1 −1 0 0 P2 −1 0 1 0 C=P3 1 0 −1 0 P4 −1 0 0 1 P5 0 1 0 −1 P6 0 −1 0 1
T4
P1 P2 P3 P4 P5 P6 1 0 0 0 0 0 0 1 0 0 0 0 X T = 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1
Step 1 - transition T1 deleted P5 P1+2
T1 T2 P4 P1+4 P6 P2+3
P3 P3+4
P1 P5
T2 T3
P2 P6
T3 T4
T2 T3 T4 P5 1 0 −1 P6 −1 0 1 C=P1+2 −1 1 0 P2+3 0 0 0 P1+4 −1 0 1 P3+4 0 −1 1
P1 P2 P3 P4 P5 P6 0 0 0 0 1 0 0 0 0 0 0 1 X T = 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0
Step 2 - transition T2 deleted P4 P1+2+5 P3 P1+4+5 P5 P2+3
P2 P3+4
P1 P5+6
T
T4 0 0 −1 1 C= 0 0 1 −1 0 0 3
T1 T3
T2 T4
P
1 P2 P3 P4 P5 P6
0 0 0
X T =
1 0 0 1 1 1 0
1 1 0 0 0
0 1 0 0 1
0 0 1 1 1
0 0 1 0 0
Step 3 - transition T3 deleted and nonminimal invariant P1 P2 P3 P4 P5 eliminated P3 P1+2+3+4+5 \\// //\\
P2 P1+4+5 P1 P5+6
P4 P2+3
T1 T4
T4 0 C= 0 0
P1 P2 P3 P4 P5 P6 0 1 1 0 0 0 X T = 0 0 0 0 1 1 1 0 0 1 1 0
72
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
Step 4 - transition T4 deleted P2 P1+4+5 P3 P2+3
P1 P5+6
C=[ ]
P1 P2 P3 P4 P5 P6 0 1 1 0 0 0 X T = 0 0 0 0 1 1 1 0 0 1 1 0
Example 4.6 shows a case when a non-minimal invariant P1 P2 P3 P4 P5 is eliminated because it contains an already existing invariant P2 P3 .
4.3.3
An algorithm finding a set of generators from a suitable Z basis
Kruckenberg and Jaxy [52] propose an algorithm which computes generators based on the algorithm [51] by Kannan and Bachem calculating a Hermite normal form. Z basis coming from a Hermite normal form Equations 4.12 and 4.13 can be regarded as a linear homogeneous Diophantine system: A · x = 0,
x∈Z
(4.18)
where the matrix A corresponds to the transposition of the incidence matrix C of the associated Petri Net. Theories of sets of linear Diophantine equations can be found in the paper by Fiorot/Gorgan [30] and the book by Newman [68]. Algorithms for solving a set of linear Diophantine equations are described in Bradley [9] and Frumkin [32]. The basic idea used in their algorithms is to triangularize the augmented coefficient matrix of the system by a series of column (row) operations consisting of adding an integer multiple of one column (row) to another, multiplying a column (row) by -1 and interchanging two columns (rows). Frumkin [32] has observed that the order of intermediate expression growing in the algorithm by Bradley [9] can be exponential in the number of equations. Kannan and Bachem present in their paper [51] two polynomial algorithms. In the following we describe how the construction of the Hermite normal form can be used in solving linear Diophantine equations.
4.3. FINDING POSITIVE P-INVARIANTS
73
Theorem 4.1 Given a nonsingular (m,m) integer matrix A, there exists a (m,m) unimodular matrix K (a matrix whose determinant is either 1 or -1) such that AK is lower triangular with positive diagonal elements. Further, each off-diagonal element of AK is non positive and strictly less in absolute value than the diagonal elements in its row. Proof: see Hermite [46] AK is called the Hermite normal form of A. The original algorithm computing the Hermite normal form is concerned with square nonsingular integer matrices. But an examination of the procedures (see Kannan/Bachem [51]) shows that the algorithm works on rectangular matrices as well. Algorithm 4.3: %Hermite Normal Form of Rectangular matrices - Kannan & Bachem %SIAM J.Comput.,Vol8,No.4.,November 1979,chap.2, rectang.-pp.500-504 %Supposing A to be a rectangular integer matrix with full row rank. %K = HNFR(A) is a square unimodular matrix (determinant being +1 or -1) %such that A*K is lower triangular with positive diagonal elements. %Further, each off-diagonal element of A*K is nonpositive and %strictly less in absolute value than the diagonal element in its row. [t, u]=size(A) %phase 1 - initialize K K = identity matrix of size u ∗ u %phase 2 %permute the columns of A so that every principal minor from (1,1) to %(t, t) is nonsingular for i = 1 to t %find column j, making principal minor non-singular j=i while ((j < u)&(d=0)) j =j+1 d=det(submatrix of A consisting of rows [1 : i] and columns [1 : (i − 1), j]) end interchange columns i and j of A and of K end %by joining identity matrix (u − t, u − t) and zero matrix (u − t, t) %make matrix A square put the first principal minor into HNF if A1,1 < 0 A:,1 = −A:,1 K:,1 = −K:,1 end %phase 3 to phase 6 for i = 1 to (u − 1)
74
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS %phase 4 %put (i + 1)(i + 1) principal minor into HNF for j = 1 to i %phase 4.1 %calculate greatest common divisor [r, p, q]=gcd(Aj,j , Aj,i+1 ) %where r = p*Aj,j + q*Aj,i+1 %phase 4.2 %perform elementary column operations in A and K %so that Aj,i+1 becomes zero if r6=0 D1,1 =p D2,1 =q D1,2 = −Aj,i+1 /r D2,2 = Aj,j /r A:,[j,i+1] = A:,[j,i+1]) · D K:,[j,i+1] = K:,[j,i+1]) · D end %phase 4.3 if j > 1 reduce off diagonal elements in column j end end %phase 5 reduce off diagonal elements in column i + 1 end End of algorithm
.
Algorithm 4.3 performs elementary operations with matrix A(n, m) (corresponding to C T ). These operations are memorized in matrix K(m, m) (serving in similar way like matrix F in Algorithm 4.1 mentioned in subsection 4.3.1) and were performed in such a way that the product A · K is in a Hermit normal form: m
z m
m
z }| #{ (" n
A
z" }| #) {
×
K
m
}|
h11 0 · · · · · · ··· 0 .. . .. . 0 · · · 0 · · · .. . hi1 · · · hii 0 · · · 0 .. .. ... . . hn1 · · · · · · hns
= |
{z
s=rankA
| | | | |
0 ··· .. . .. . 0 .. .
0 .. . .. . .. .
0 ··· 0
{
}
(4.19) The next theorem shows how a Z-basis (the generatorts of the 2nd level - in the sence of Table 4.1 ) of a homogenous linear Diophantine system can be found.
4.3. FINDING POSITIVE P-INVARIANTS
75
Theorem 4.2 Let A be an (n,m) integer matrix, K be a (m,m) unimodular matrix, such that A · K = (h1 , · · · , hs , 0, · · · , 0) (4.20) and the columns h1 , · · · , hs are linearly independent. Then the set B = {k s+1 , · · · , k m } formed from the last r=(m-s) columns of K is a Z-basis of 4.18. Proof: see Newman [68] This reasoning is clear when we take into account that all the columns of B satisfy the basic equation of P-invarians 4.12 (see equation 4.19). Finding Q+ generators from Z basis To demonstrate a procedure finding a set of Q+ generators we consider first the following set of linear homogenous inequalities: B·y ≥ 0 y ≥ 0
(4.21)
B is a Z-basis of size (m, r) where m is number of places in associated Petri net, r = m − rank(C) and y is a column vector of size (r, 1). Sets of linear equations may (in obvious manner) be considered as a special case of sets of inequalities. Then general idea to find the Q+ generators is the following: The Z-basis B contains the very important and complete information to generate all integer solutions of the problem - by linear combination (with integer factors) of the vectors of B. The set of all nonnegative solutions is a subset of all integer solutions. The set of the Q+ generators is a subset of the set of all nonnegative solutions. Therefore it is possible to generate the Q+ generators by suitable linear combinations of the vectors of the Z-basis B. The problem is only to construct the Q+ generators by a systematic reduction procedure. A procedure described in an Algorithm 4.5 from the paper [52], leads to the result. The following Algorithm 4.4 shows how a set of Q+generators is found using procedures HNFR (Hermite normal form of rectangular matrices) in the Algorithm 4.3 and JAXY in the Algorithm 4.5 (explained afterwards). Algorithm 4.4: % % % %
Q+ generator - Kruckenberg & Jaxy (LNCS 266, pp.104-131) Let C is an incedence matrix representing a Petri Net free of sink and source places (having excusively output or input arcs). X = QGEN(C) is a nonegative matrix such that:
76
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS % 1) each positive p-invariant could be done as a lambda % combination of the columns of X % 2) no column of X could be done as a lambda % combination of other columns of X %Remark: the lambda is a vector of nonnegative rational numbers Q+ %Calls - HNFR, JAXY [m, n]=size(C) %delete linearly dependent columns (than C is full column rank) for j = 1 to n if column C:,j is linear combination of C:,1 , ..., C:,j−1 delete column C:,j end end %find Z basis of the linear Diophantine system K=hnfr(C 0 ); B = K:,[f rom (rank(C)+1) to last column] % B is Z-basis %find a set of Q+ generators X Y =jaxy(B); X =B∗Y; .
End of algorithm
So now we explain how a set of Q+ generators is found from Z basis B(m, r) (Algorithm 4.5 consisting of the procedure JAXY). The idea is based on Theorem 4.3. Theorem 4.3 Given Y old a finite set of Q+ generators of the set of linear homogenous inequalities: y ≥ 0 ˜ ·y ≥ 0 B
(4.22)
˜ is a given (k,r) integer matrix with 1 ≤ k ≤ m and y is a column where B ˜ is identical to the first k lines of matrix B of size vector of size (r,1) (matrix B new (m,r)). Then Y a new finite set of Q+ generators of the set enlarged by a new constraint ((k + 1) − th line of matrix B): y ≥ 0 ˜ ·y ≥ 0 B b·y ≥ 0
(4.23)
is created by the following rules: 1) old generators satisfying constraint b · y ≥ 0 are put among new generators
4.3. FINDING POSITIVE P-INVARIANTS
77
2) let POZ be a set of old generators satisfying constraint b · y > 0 and NEG is a set of old generators satisfying the constraint b · y < 0. Find all couples [yi , yj ] for which the conditions a) b) and c) are fulfilled and make new generator yˆ =| b · yi | ·yj + | b · yj | ·yi . Conditions: a) yi ∈ P OZ and yj ∈ N EG b) the vectors yi and yj annul at least (r − 2) linear inequalities of the set 4.23 simultaneously c) there does not exist a third vector yl , (l 6= i, j) which annuls all those inequalities which are annulled by yi and yj
Proof: Rule 1) the proof is trivial because the new constraint does not chage anything on the capability of the vector y to be a generator Rule 2) here we should remain that each generator annules at least (r − 1) inequalities, but two generators never annul the same (r − 1) inequalities. So the generator yˆ annules one equation by its definition and (r − 2) equations being a linear combination of two vectors annuling the same (r−2) equations. In addition we have to exclude a case when there is already such a vector (condition c)). When applying iteratively the rules from the Theorem 4.3 we calculate the set of Q+ generators: Algorithm 4.5: %Kruckenberg & Jaxy (LNCS 266, Theorem 6.4, page 121) % Supposing B to be rectangular integer matrix with full column rank % ( size(B,2)=rank(B) ) % Y = JAXY(B) is a matrix of Q+ generators such that X=B*Y % fulfils the following: % 1)each nonnegative vector of subspace given by base B % could be done as a lambda combination of the columns of X % 2)X >= 0 % 3)no column of X is a lambda combination of other columns % Remark: the lambda is a vector of nonnegative rational numbers Q+ [m, r]=size(B); M = identity matrix of size r∗r %subspace (corresponds to y ≥ 0 from 4.22 Y = identity matrix of size r ∗ r %old generators for k = 1 : m %each place in the Petri gives one constraint %each iteration in this loop is based on the theorem %phase 0 if equation k was not already present, then
78
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS %phase 1 - update subspace M = k-th line of B joint to M %calculate scalar product S = B(k, :) · Y P OZ=indexes of generators with S > 0 ZER=indexes of generators with S = 0 N EG=indexes of generators with S < 0 %find new generators of updated subspace N G=[ ] %phase 2 - hold generators with nonnegative scalar product for j = 1 to length of P OZ N G = [N G, Y (:, P OZj )] end for j = 1 to length of ZER N G = [N G, Y (:, ZERj )] end %phase 3 - for generators i, j whose scalar product differ in sign for i = 1 to length of P OZ ii = P OZi for j = 1 to length of N EG jj = N EGj %phase 3.1. - do generators ii, jj annul at least r − 2 equations? if annul AN N U =TRUE %phase 3.2. - does exist any vector annuling the same equations? if exist EX=TRUE; end end if AN N U (not EX) %add new generator alpha=abs(S(1, ii))/gcd(S(1, ii), S(1, jj)) beta=abs(S(1, jj))/gcd(S(1, ii), S(1, jj)) N G = [N G, alpha ∗ Y (:, jj) + beta ∗ Y (:, ii)] end end end %new generators are old generators for a next iteration Y = NG end end .
End of algorithm
4.3. FINDING POSITIVE P-INVARIANTS
79
Example 4.7: Let us apply the Algorithm 4.4 to the Petri net shown below.
First of all we find a Z-basis B applying procedure hnfr (given in the Algorithm 4.3):
b1 b2 b3 b4 b5 P −1 1 1 1 −1 1 P2 0 0 1 −1 0 P −1 1 0 −1 0 3 B= P4 1 0 0 0 0 P5 0 1 0 0 0 P6 0 0 1 0 0 P7 0 0 0 1 0 P8 0 0 0 0 1
Afterwards we apply the procedure jaxy (8 steps of this procedure correspond to 8 Petri net places): Step 0 - initiate M and Y
b1 b2 b3 b4 b5 1 0 0
M =
0 1 0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
y1 y2 y3 y4 y5 21 0 0 0 0 b 0 1 0 0 0 Y = 3 b4 0 0 1 0 0 b 0 0 0 1 0 b5 0 0 0 0 1
b1
Step 1 - y 1 , y 2 , y 3 are chosen in phase 2 (e.g. y 1 shows that b2 is OK for place P1 ) and y 4 to y 9 are chosen in phase 3.1. (e.g. y 4 shows a combination of b1 and b2 to be OK for P1 )
b1 b2 b3 b4 b5 1 0 0 0 0 0 1 0 0 0 M = 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 −1 1 1 −1 1
y1 y2 y3 y4 y5 y6 y7 y8 y9 20 0 0 1 0 1 0 1 0 b 1 0 0 1 1 0 0 0 0 Y = 3 b4 0 1 0 0 0 1 1 0 0 b 0 0 0 0 1 0 1 0 1 b5 0 0 1 0 0 0 0 1 1
b1
80
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
Step 2 - y 1 to y 7 are chosen in phase 2
b1 b2 b3 b4 b5
1 0 0 1 0 0 0 0 0 0 −1 1 0 0
M =
0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 −1 1 1 −1 0
y1 y2 y3 y4 y5 y6 y7 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1
1 b2 b Y = 3 b4 b b5
Step 3 - y 1 to y 4 are chosen in phase 2, y 5 , y 6 chosen in phase 3.1., y 6 is eliminated in phase 3.2. (y 6 annul 3rd, 4th and 7th equations in M but these are already annulled by y 3 )
b1 b2 b3 b4 b5 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 M = 0 0 0 1 0 0 0 0 0 1 −1 1 1 −1 1 0 0 1 −1 0 −1 1 0 −1 0
Step 4,5,6,7,8 - phase 0 (these lines are already present in M ) The resulting set of generators is: P
1 P2 P3 P4 P5 P6 P7 P8
X T =
1 1 1 0 1
0 1 0 0 0
1 0 0 0 0
0 0 0 1 0
1 0 0 1 1
0 1 0 0 1
0 0 0 0 1
0 0 1 0 0
y1 y2 y3 y4 y5 y6 2 0 0 0 1 0 6 1 b 1 0 0 1 1 6 1 Y = 3 b4 0 1 0 0 1 6 0 b 0 0 0 0 1 60 b5 0 0 1 0 0 6 1
b1
4.3. FINDING POSITIVE P-INVARIANTS
4.3.4
81
Time complexity measures and problem reduction
When using the mentioned algorithms for further analysis it is important to know their time complexity. Algorithm 4.1 is working in polynomial time, but its results are only of reduced applicability, because it is not able to find the maximum number of positive P-invariants. Time complexity for non-polynomial algorithms is often evaluated in terms of the size of the algorithm output. In this case it is sufficient to find an instance in which the algorithm output is of non polynomial size. Then we can say that the algorithm is running in non polynomial time. Such an instance is shown in Figure 4.9 where g (the number of Q+ generators) is given by equation 4.24: g = kn
P11
P21
P1k
P31 T3:
T2 :
T1 :
P2k
(4.24)
....... P3k
Pn1 Tn : Pnk
Figure 4.9: Example of Petri Net with exponential number of generators To facilitate the analysis of large systems, we often reduce the system model to a simpler one, while preserving the system properties to be analyzed. There exist many transformation techniques for Petri Nets [66, 91]. In our analysis we will use only very simple reduction rules which are a subset of more complex rules preserving liveness, safeness and boundeness. We will take only the binary Petri Nets into the consideration. In the case of the example in Figure 4.9 (Fork-Join Petri Net) we can apply a reduction procedure (see Algorithm 4.6) and represent the Petri Net in the form of ’recepie’ containing the same information as the set of Q+ generators (the set of Q+ generators can be obtained as an evolution of the ’recepie’). The ’recepie ’ uses two operators (each operator specifies a relation between two nodes N1 and N2 representing either two places or transitions): k ... parallel relation, ... serial relation. Algorithm 4.6: %FJ - algorithm reducing a fork-join part of a Petri Net %[PPre,PPost,T,P] = FJ(Pre,Post) are nonnegative matrices such that:
82
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS %PPre and PPost are precondition and postcondition matrices of %reduced network and T and P are sets of transitions and places where %each entry is a parallel/serial combination of the original Petri Net transitions/places P P re = P re P P ost = P ost T = [T1 , T2 , ..., Tn ] P = [P1 , P2 , ..., Pm ] while not KERNEL %place with one input and one output arc if exist Px , Ty , Tz such that P P ostx,y is only nonzero element in P P ostx,: and in P P ost:,y and P P rex,z is only nonzero element in P P rex,: and in P P re:,z then eliminate Px , Ty and Tz = Ty Px Tz %transition with one input and one output place elseif there exist Tx , Py , Pz such that P P rex,y is only nonzero element in P P re:,x and in P P rey,: and P P ostx,z is only nonzero element in P P ost:,x and in P P ostz,: then eliminate Tx , Py and Pz = Py Tx Pz %two places with one input and one output transition elseif there exist Px , Py such that P P rex,: = P P rey,: and P P rex,: has just one nonzero element and P P ostx,: = P P osty,: and P P ostx,: has just one nonzero element then eliminate Py and Px = Py k Px %two transitions with one input and one output place elseif there exist Tx , Ty such that P P re:,x = P P re:,y and P P re:,x has just one nonzero element and P P ost:,x = P P ost:,y and P P ost:,x has just one nonzero element then eliminate Ty and Tx = Ty k Tx else %no serial/parallel relation found KERNEL = TRUE end end .
End of algorithm
When applying Algorithm 4.6 to the example in Figure 4.9 we obtain in polynomial time the ’recepie’ T1 (P11 k · · · kP1k ) T2 (P21 k · · · kP2k ) · · · Tn (Pn1 k · · · kPnk ) containing the same information as a set of Q+ generators (the set of Q+ generators can be obtained by evolution of the ’recepie’ to the form ( )k( )k( )). Unfortunately not all Petri Nets are representable by the ’recepie’, but we can always run Algorithm 4.6 before running the algorithms finding the set of Q+ generators in order to reduce nonpolynomial execution time only to the kernel (non reducible) part of the Petri Net.
4.3. FINDING POSITIVE P-INVARIANTS A Petri Net of this kind is shown in Example 4.8.
83
84
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
Example 4.8: Let us apply Algorithm 4.6 to the Petri Net shown below. T1 P1 T2
T3
P4
P2 T4
T6
P3
P9
P5
T5
P7
T1 T2 T3 T4 T5 T6 T7 P1 1 −1−1 0 0 0 0 P2 0 1 1 −1 0 0 0 P3 0 0 0 1 −1 0 0 P 1 0 0 0 0 −1 0 C= 4 P 0 0 0 0 1 −1 0 5 P6 0 0 0 0 1 0 −1 P 0 0 0 0 0 1 −1 7 P8 0 0 0 0 0 1 −1 P9 −1 0 0 0 0 0 1
P8
P6 T7
The result is seen below - this Petri Net is the kernel, that could not be written in the form of Fork-Join relations. T1 PP1 T5
P4 P5
P6
T6 PP2
T7
P9
T1 T5 T6 T7 P P1 1 −1 0 0 P4 1 0 −1 0 C= P5 0 1 −1 0 P6 0 1 0 −1 P P2 0 0 1 −1 P9 −1 0 0 1
Where ’recepie’ for P P1 is P1 (T2 kT3 ) P2 T4 P3 and ’recepie’ for P P2 is P7 kP8 . The Q+ generators of the kernel could be found either by Algorithm 4.4 or by Algorithm 4.2:
XX T
=
P P 1 P4 P5 P6 P P 2 P 9 1 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1
4.3. FINDING POSITIVE P-INVARIANTS
85
The Q+ generators X of the original Petri Net are found by evolution of P P1 and P P2 .
XT
=
P1 P2 P3 P 4 P5 P6 P 7 P8 P9 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 0 0 1 1
The reduction of Fork-Join Petri Nets leads to polynomial size ’recepie’ containing information about the structural properties. This fact reveals a possibility to base scheduling algorithms of Fork-Join Petri Nets on the ’recepie’ and not on the set of Q+ generators. This observation corresponds to studies proving a general scheduling problem to be NP-hard and only for special classes of DAG’s (such as Fork-Join and trees) special polynomial algorithms are known [16].
4.3.5
Conclusion
The objective of this chapter was to study structural properties of directed, weighted, bipartite graphs called Petri Nets. Some of the most distinctive features of the chapter are: • It introduces the notion of minimal standardized invariants called generators, which are used in data dependence analysis in Chapter 5 and in scheduling algorithms in Chapter 6. • It proves the reduced applicability of algorithm 4.1, because it is not able to find the maximum number of positive P-invariants. • It shows an implementation of two different algorithms finding the set of Q+ generators accompanied by illustrative examples. • It refers to a non-polynomial size of the set of Q+ generators and it describes a simple and original reduction method. One of the fundamental features of this chapter is its closed relation to other disciplines, namely convex geometry [17] and integer linear programming [93]. Therefore we can find many algorithms, some of them dating from the last century [28, 29], using different terminologies. So non-negative left annullers of a net’s flow matrix are called positive P-invariants or P-semiflows or direction of a positive cone. In a similar way the set of Q+ generators is called ’set of minimal support invariants’ or ’set of extremal directions of a positive cone’ or simply ’generator of P-semiflows’. A recent article [19] by Colom & Silva highlights the connection between convex geometry and Petri Nets and presents two algorithms using heuristics for selecting the columns to annul. Performance evaluation of several algorithms can be found in [87] by Treves.
86
CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS
Chapter 5 Petri net based algorithm modelling This chapter focuses on algorithm representation by means of Petri Nets. The algorithm modelling and parallelism analysis are essential for designing new parallel algorithms, transforming sequential algorithms into parallel forms, mapping algorithms onto architectures, and, finally, designing specialized parallel computers. Some of these problems, like scheduling will be addressed in the next chapter. PNs are frequently used in modelling, designing, and analyzing concurrent systems (see [82], [11], [12]) owing to their capability to model and to visualize behaviours including concurrency, synchronization and resource sharing. Dependency analysis based on PNs for synthesizing large concurrent systems was given by Chen et al. [13]. The proposed method, knitting, synthesizes large PN by combining smaller PN and basic properties are verified by dependence analysis. This chapter gives some additional terminology first, then various modeling techniques are briefly compared. It is stated that a model can be based either on the problem analysis or on the sequential algorithm. The problems are modelled as noniterative or iterative ones and corresponding algorithms are modelled as acyclic or cyclic ones. Finally an attempt is made to put DDGs and Petri Nets on the same platform when removing antidependencies and output dependencies. This chapter contains many original ideas; that is why it is written in educative manner. It is possible that some ideas were just rediscovered and I will be very grateful to all comments on the subject.
5.1
Additional basic notions
This paragraph introduce an additional terminology needed for further analysis of PN models and comparison of various formalisms. 87
88
CHAPTER 5. PETRI NET BASED ALGORITHM MODELLING T1
T1
?
? r P 2
P2
P
1 @ @ @ @
?
T2
?
P3
@ R @
T3 (a) ?
P
?
1 @ @ @
T2
?
P3
@ @ R ? @
T3 (b)
Figure 5.1: Implicit place
5.1.1
Implicit place
Definition 5.1 Let a PN with P = P1 , ..., Pm and T = T1 , ..., Tn be given. A place Px is called an implicit place iff the two following conditions hold: 1) any reachable marking M (Px ) is equal to a positive linear combination of markings of places from P 2) M0 (Px ) does not impose any additional condition to fire its output transitions Meaning of the condition 2 from Definition 5.1 is explained in Figure 5.1. The PN given in Figure 5.1(a) shows a case where the place P1 is implicit to places P2 and P3 . The structural property reflected by the condition 1 from Definition 5.1 is fulfilled because M (P1 ) = M (P2 ) + M (P3 ). The condition 2 is fulfilled because M0 (P1 ) ≥ M0 (P2 )+M0 (P3 ). In Figure 5.1(b) the place P1 is not implicit, because the condition 2 is not fulfilled. Notion of implicit place is important for structural parallelism detection, because implicit places do not bring any information of data dependencies.
5.1.2
FIFO event graph
A basic assumption that will be made throughout this thesis is that both places and transitions are First In First Out channels. Definition 5.2 A place pi is FIFO if the k-th token to enter this place is also the k-th token which becomes available in this place. A transition tj is FIFO if the k-th firing of tj to start is also the k-th to complete. An event graph is FIFO if all its places and transitions are FIFO. This assumption is needed, when cyclic algorithms are modelled with a use of event graphs.
5.1. ADDITIONAL BASIC NOTIONS
5.1.3
89
Uniform graph
Let us introduce a new graph theory formalism, called uniform graph. Uniform graphs are used to model cyclic algorithms (see chapter 9 in [16]) with uniform constraints. The algorithms with uniform constraints consist of n statements having in general the following form: xi (k) = fi (x1 (k − βi1 ), ..., xn (k − βin ))
(5.1)
where k is the iteration index (k =0,1,2,...,), and βi1 to βin are constants from N . That means that there is no statement of the form x2 (k) = f2 (x1 (2 ∗ k − 3)) (linear constraints) or of the form x2 (k + 2) = f2 (x1 (k + 1) (general uniform constraints). It is important to notice here that only general uniform constraints will be considered further in this thesis. For more details see Equation 5.3. general Petri Nets (Max,+) algebra uniform graph marked event graph directed graph unmarked event graph DAG
Figure 5.2: Expressive power of different modelling methods
Definition 5.3 The triple (G, L, H) is a uniform graph if it is such that: G(V, E) is a directed graph where V is a set of vertices and E is a set of edges L : E → {N , 0} is a length associated to each edge H : E → N is a height associated to each edge The notion of the uniform graph is given here in order to compare modelling methods based on various formalisms comming from from different scientific branches. Figure 5.2 summarizes in brief the relationships among some modelling
90
CHAPTER 5. PETRI NET BASED ALGORITHM MODELLING
methods used in this thesis. The time aspect (usually associated to vertices or edges) is not taken into consideration for the figure simplicity. Figure 5.2 shows a hierarchy of different modelling methods given by their generality, two methods in the same rectangle signify the same expressive power of the two methods. In certain sense it clerifies why only DAGs and marked event graphs are used to model and schedule algorithms. Event graph in contrast to general Petri Net does not contain structural conflict, which introduces nonlinear behaviour. So event graphs are more popular among theoreticians analyzing net properties and solving scheduling problems. Unmarked event graph containing cycle does not bring information, whether the cycle contains zero tokens (evoking deadlock situation) or one token (reflecting the fact that all transitions in the cycle have to be fired in sequence) or more tokens. So only acyclic versions of directed graphs are of the interest. For deaper discussion on this subject see paragraph 6.1.3. In the following text just only event graphs instead of general PNs will be used, because no conditional branching, represented by structural conflict, will be taken into consideration.
5.2
Model based on the problem analysis
In this paragraph we will focus on the situation when the model is constructed directly from the problem specification. It means that we do not make use of any sequential algorithm, because the sequential algorithm specifies in which order the instructions will be performed. In other words, when we construct a model directly at the moment when we make the problem analysis and there is no conditional branching then the model contains just data dependencies.
5.2.1
Noniterative problems
The situation mentioned above is seen from Figure 5.3 serving as data-flow model of the simple numerical problem given by the equation y = (x2 +1)(x2 −1). Figure 5.3(a) shows a Petri Net model constructed in the following way: • transitions correspond to computational blocks (e.g. procedures or separate instructions) • data are represented by places • input relation of data to the computational blocks is represented by matrix P re • output relation of data to the computational blocks is represented by matrix P ost
5.2. MODEL BASED ON THE PROBLEM ANALYSIS
91
• presence of a token in a place signifies validity of the data For a correct use of PNs it is necessary to have the following restriction: each datum is represented by so many places as many times the datum is used. The example given in Figure 5.3(a) shows a data-flow computation where places P1 and P2 represent the same value. r P 0 x ?T SQR @ 0 R @
x2
P1
DEC
T1
P
2
?T 2
?
x2 v1
IN C
? ? 2
x − 1 P3
P
x2 + 1
4 @ R T @ MUL 3
r v0 B B B B B BNr v2 r B B B B B BNr v3
?
P
5
y = (x2 + 1)(x2 − 1)
(a)
(b)
Figure 5.3: Representation of data flow by means of PNs and DAG Let us clerify a relationship between two modelling methods - Petri Nets and DAGs. Figure 5.3(b) shows a DAG representation of the same problem as a Petri Net in Figure 5.3(a). In this case the DAG has the same information for scheduling purposes as the Petri Net given in Figure 5.3(a) because: • no program cycles are assumed • no gain of pipe-line parallelism, enabled by more tokens, is assumed Another example of a noniterative problem is the matrix-vector multiplication, see Figure 5.4. For a fixed matrix size [3,3] and a vector with three entries we can write the following equation:
a11 a12 a13 a21 a22 a23 a31 a32 a33
.
b1 b2 b3
=
c1 c2 c3
(5.2)
This example will be used later to show detection of data parallelism and global communications.
92 a11
CHAPTER 5. PETRI NET BASED ALGORITHM MODELLING b1
a12
b2
a13
b3
a21
b1
c1
a22
b2
a23
a31
b3
b1
a32
b2
a33
c3
c2
Figure 5.4: Matrix[3,3]-vector[3] multiplication
5.2.2
Iterative problems
PN model of a PD controller Let us imagine a simulation problem shown on Figure 5.5 where a discrete time linear system of second order with a PD controller is modelled. w(k)
e(k)
P -1
z
e(k-1)
D
p(k)
u(k)
System
y(k)
d(k)
Figure 5.5: Discrete time linear system The following equations hold for the specific blocks of Figure 5.5 (please notice that the order in which the equations are written is not important): • Sum block: e(k) = f1 (w(k), y(k)) • Controller: p(k) = f2 (e(k)) d(k) = f3 (e(k), e(k − 1)) u(k) = f4 (p(k), d(k))
b3
5.2. MODEL BASED ON THE PROBLEM ANALYSIS
93
• System: x1 (k + 1) = f5 (x1 (k), x2 (k), u(k)) x2 (k + 1) = f6 (x1 (k), x2 (k), u(k)) y(k) = f7 (x1 (k), x2 (k))
1
x1 T5
e w
T1
T2
p
e
1
e
u T4
T3
d
1 1
x2
1
u
x1
x1
T7
1
y
x2
T6
1
x2
Figure 5.6: PN model of the discrete time linear system of second order with PD controller shown in Figure 5.5 The general structure of the iterative problem under consideration consists of n statements having in general the following form: xi (k + αi ) = fi (x1 (k − βi1 ), ..., xn (k − βin ))
(5.3)
where: k is the iteration index (k =0,1,2,...,) α,β are constants from Z (this assumption implay that also negative number of tokens will be under considaration) f1 ,...,fi ,...,fn are functions Rn −→ R represented by the PN transitions x1 ,...,xi ,...,xn are variables from R represented by the PN places (each variable is represented by so many PN places as many times it is read) Rule 5.1: The PN model of iterative problems consisting of statements given by Equation 5.3 can be constructed in the following way: 1) create the transition Ti for each function fi with the input places corresponding to the input variables (matrix P re) 2) put β tokens into the transition input places 3) draw arcs from transitions to places (matrix P ost) 4) put α tokens into the transition output places.
94
CHAPTER 5. PETRI NET BASED ALGORITHM MODELLING
Remark: Notice that if some α and β are negative numbers then the resulting number of tokens can be positive. Finally, a positive number of tokens in the PN model corresponds to variables that have to be initialized. An algorithm, reading data before their actualisation, implies that a certain part of the model is not live (e.g. comprising a negative number of tokens in a place). Neural network PN model Behaviour of the neural network algorithm given by equations 3.1 to 3.6 in Section 3.2 is modelled by the PN in Figure 5.7. Place and transition representations are as follows: P0 ... inputs to neural network T0 ... sigmoid function in input layer P1 ,P10 ,P2 ,P20 ,P200 ,P3 ,P30 ,P300 , P4 ... outputs from layers T1 ,T2 ,T3 ... activation procedures P5 ... desired network outputs T4 ... error evaluation in output layer T5 , T6 ... error evaluation in hidden layers P6 ,P60 ,P7 ,P70 ,P8 ... error values T7 ,T8 ,T9 ... learning procedures (when α = 0) P9 ,P90 ,P900 ,P10 ,P100 ,P1000 ,P11 , P110 ... weights The initial markings in P9 ,P90 ,P900 ,P10 ,P100 ,P1000 ,P11 , P110 represent initial weights generated by a random generator.
5.2.3
Model reduction
In order to show which transitions could be computed in parallel it is needed to simplify the PN model - eliminate the places that do not influence sequential execution of any transitions because this sequential execution is given by other data dependencies. Implicit places and self-loop places are places of this kind. Correct elimination of implicit places and self-loop places preserves the properties of liveness, safeness and boundedness because this elimination does not change the graph of reachable markings. Manual reduction Let us consider for example the Petri net given in Figure 5.7. Then the manual model reduction could be done as follows: First of all we can eliminate the self-loop places P90 ,P100 and P110 . The place P30 is implicit to the places P3 -P4 -P6 and the place P20 is implicit to the places P2 -P3 -P4 -P6 -P7 . The resulting PN is shown in Figure 5.8.
5.2. MODEL BASED ON THE PROBLEM ANALYSIS
9:; 0! $ &'
-'
-(
&(
9?:;@