and offer a suitable context for monitoring program execution. 1. Introduction. In 1984 a consortium of five Italian universities initiated a. project for the design and ...
THE PAPIA2 MACHINE: HARDWARE AND SOFTWARE ARCHITECTURE
A. Biancardi, V. Cantoni, M. Ferretti and M. Mosconi Dipartimento di Informatica e Sistemistica University of Pavia Via Abbiategrasso 209 I-27100 PAVIA, ITALY Tel: int +39.382.391350
ABSTRACT This paper presents the overall structure of PAPIA2, a pyramid system belonging to the family of massive parallel machines. It embeds the topology of the quad-pyramid into a highly regular, fault tolerant, eight-connected processor array by means of specially reconfigurable near-neighbor interconnections. The system comes with a fully-fledged software environment designed to optimize the use of machine resources. The highly interactive graphic tools help in understanding the machine's capabilities, provide a valuable testbed for the machine instruction set, and offer a suitable context for monitoring program execution.
1. Introduction
In 1984 a consortium of five Italian universities
initiated a
project for the design and the construction of a new system for image analysis. This system was called PAPIA (Pyramidal Architecture for Parallel Image Analysis), and can be classified into the family of '"compact" quad-pyramids. The novel feature of the PAPIA, with respect to other quad-pyramid proposals, consisted of the three-dimensional topology of the network of PEs included in the chip. In fact, it is based on a the chip of 5 PEs which make up the basic two level pyramid. This chip is the building block to assemble the pyramid.
A small working prototype of PAPIA was already running in 1987. It consisted of a pyramid of four levels, with a base of 8x8 PEs, for a total amount of 85 processors. An industrial version of the system was not released because of the mismatch between the usual image size and the machine feaseble size which could be implemented with a 5-PEs chip. On the basis of this experience a new project started in 1988 at the Department of Computer Engineering of Pavia University, with the purpose of mantaining the full capability of the pyramidal topology, while overcoming the limitations imposed by the cumbersome interconnection needs and functional specialization of the initial solution. This new system, called PAPIA2, will be described here. Some preliminary statements will be given on the overall system specifications. Subsequently the main characteristics of the PEs and of the hardware architecture will be described together with some highlights of the software environment.
2. The Hardware Architecture
The embedding of a quad-pyramid onto a flat array which preserves modularity, recursiveness, and the fault tolerance capability, has been described in several papers(1, 2). In this section the basic characteristics of such an approach are summarized and the implementation in the PAPIA2 machine is described in detail. The system core is an array of PEs with the following structure: • the flat array of PEs is conceived as a multi-grid physical structure which allows operation on the same image simultaneously in multiresolution;
• three operative modalities are possible: 1) the base modality, in which the image is processed at full resolution in the SIMD mode (the machine works as an eight-connected flat array); 2) the “horizontal” modality, which allows the multi-resolution operation by configuring the array as a set of disjoint, non interfering, four-connected meshes (this excludes full resolution); 3) the “vertical” modality, which allows data exchanges between couples of consecutive layers, according to the quad-tree topology; • two different regular meshes make up the array as shown in figure 1. The PEs communicate with each other through a switch lattice. Both meshes are programmable; the array of PEs with the quoted Multi-SIMD modality, and the switch mesh with a SIMD modality. • at most each processor plays two roles: one at full resolution, and a second one at a reduced resolution. When the array operates in the horizontal modality, those PEs that work only at full resolution short-port themselves (are by-bassed) in North-South, East-West directions and viceversa . The simplicity of this hardware solution makes it a simple task to identify PEs belonging to a given level, and to establish the role of any given PE. Detours for by-passing faulty PEs are easily found and the array reconfiguration is implemented in an effective way without modifying the general control and code distribution.
2.1. The Interconnection Network
As it is shown in figure 1, the distribution of switches and PEs follows a bidirectional 4-connectivity. Each PE (switch element) is connected to four switches (PEs). However, at full resolution the array con-
nectivity is eight; which is achieved by distributing part of the interconnection logic in the switches, as described in (2). The near neighbor access modality follows a broadcasting-gating scheme. Each PE broadcasts its binary signal to all the direct neighbors (8 in the base, and 4 in the higher levels, and in the vertical mode), and gates-in the subset of neighbors on the basis of an enabling vector distributed by means of the opcode in SIMD mode to all the PEs in the level. In particular, near neighbor operations can be applied in recursive mode so allowing the identification of connected components, through a seed growing technique, in one basic machine elementary operation. Note that, this propagation, which allows the manipulation of connected components as atomic data, can be executed simultaneously at different resolutions (excluding full resolution) without interference among planes. This is a very peculiar capability, usually not implemented even in the richest reconfigurable mesh arrays(3,4). These features (also including the 8-connection capability of the base) are realized by the quoted bidirectional 4-connected scheme, which minimizes the inter-chip and inter-board connections. Beside the broadcasting/gating interconnection network, a second network realized in the PEs mesh allows a column wise parallel data distribution. This second interconnection structure is composed of one delay element per PE arranged as a row distributed shift register, operating in complete overlap with the PE. In fact, the main reason for introducing these shift register is to solve the image I/O bottleneck.
2.2. The Elementary Processor
The block diagram of the elementary processor is shown in figure 2. As in most single bit processors for fine granularity, the major components of the PE are: i) the ALU; ii) the set of core registers; iii) the interconnection logic with near neighbors; iv) the local on-chip memory. a) The ALU As has been mentioned above, the PE adopts bit-serial arithmetic. The ALU is a standard full adder, and the basic boolean operations (namely And, Or, Ex-Or, Ex-Nor, plus NOT) between the input signals are achieved by suitably setting one of the inputs. In particular, it is possible to implement data register exchanges through the ALU (following Ni(5) this can be called a TALU). The three input signals are respectively: 1) the accumulator A; 2) either the on-chip RAM or the near neighbor result; 3) either the carryregister C, or the accumulator A, or a signal set by the chip decoder. The two output signals (SM for sum, and CY for carry) are recirculated respectively to the A register, which acts as accumulator, and either to the carry register C and the accumulator. b) The Registers The structure of the PE is extremely simple, and the boolean registers for storing intermediate results are kept to a minimum. In all, five registers make up the whole set, apart from the A and C registers mentioned already, there are the so-called Mask register, the I/O register for loading/unloading bit-planes, and the UP register. Here their functions are briefly reviewed.
As for A, its usage as accumulator has already been covered . In near-neighbor operations, it is both the source of the broadcasted data, and the recipient of the newly computed value. To effectively implement the sub-set of recursive near-neighbor
operations, the A register is
designed to operate also as an asynchronous Set-Reset flip-flop (during the other operations, it works as a synchronous master-slave D-type flipflop). In the same type of operations, the C registers of the plane where the operation is running, store the binary image over which the propagation occurs(6). The Mask register implements the conditioning logic, whereby only the PEs, with the register set to one, carry out the current instruction. In actual facts, the content of the Mask register is used to condition the control signals that strobe the data into the other registers and into the local memory. It is usually loaded from the memory (this operation is conditional on the content of the Mask register itself), but can be set unconditionally through a special instruction. Beside the function of realizing this limited operation autonomy in the PEs (see (7) for a detailed discussion of PE processing autonomy in SIMD systems), the Mask register allows the implementation of rather complex operations, such as integer multiplication, in a non conventional way. The UP register is a configuration register. It specifies whether a PE belongs only to the base (UP=0), or to another plane as well (UP=1). It can only be loaded from memory. By loading from memory a pre-stored bit-plane into the UP registers of the mesh array, the topology of interconnections within the array can be configured at running time. This way the pyramid vertex can be dinamically adjusted in the system where it is more appropriate according to the algorithm and to the image. Moreover, such a scheme is the basis of implementing reconfiguration strategies, should any PEs be faulty.
The last I/O register is used to realize the column wise parallel data distribution that loads (unloads) images in the array as described in the previous section 2.1. c) The Near Neighbor Circuitry Two sections of the PE handle the near-neighbor data. The first one is the gating logic, which accepts the four data which come from the four switches around the PE, and produces the Or signal subsequently fed to the ALU for non-recursive and recursive nearneighbor operations. The second section handles the short-porting logic required in those PEs that do not belong to the upper levels of the pyramid. It is realized through a set of multiplexers, to which the four incoming values are routed with a minimum of intervening gates to minimize signal delays. The control of these multiplexers takes into account the value of the UP register and the position of a PE within the chip, then it is congruent with the reconfiguration capability based on the content of the UP register. d) The Instruction Set The instruction length is 20 bits, including one byte for memory address or near neighbors selection. The other twelve bits are used in an almost orthogonal way, to identify ten instruction classes. These classes include, simple and recursive near neighbor operations, load and store memory operations, arithmetic and logical operations with an operand from memory, logical and transfer register operations, and finally, the class of the set/reset operations on the basic registers.
2.3. The On-Chip Memory
The on-chip memory is composed of 32 Kbit allocated in blocks of 256 bits to each of the 8x16 PEs. In order to optimize the silicon area the memory is designed as a dynamic RAM laid out in an array of 8x32. The activity of the memory is conditioned by the Mask register during the store operation, while the load operation is executed even if the PE is not masked. PEs that play two roles (base and one upper level) obviously split the local RAM in two halves logically: one third of the nodes of the pyramid mapped into the array has only 128 bits of onchip RAM. All the PE registers can be loaded in one clock cycle with the memory content. Moreover, data from memory can be used, with the accumulator A, as operands in many dyadic operations. Storing is implemented directly from the A register (A and A ), and the I/O register.
3. The Software Environment
The design of a new architecture gives the opportunity to rethink the requirements, characteristics and structure its software development support ought to provide. Dealing with PAPIA2 and other fine-grained architectures, the traditional approach of grouping tools with textual interfaces and loose integration is clearly insufficient. The main reasons are that the objects to be worked upon are images, and the implementation of computer-vision tasks calls for an extension of common high-level languages and their debugging/development tools if image manipulation has to be transparent to the programmer.
The inadequacy of existing environments has stemmed from the absence of a global, enlightening philosophy, guiding the design phase a deep foundation capable of coping with the difficulties of fine-grained machine programming and the peculiarities of image-processing software. The starting point has been the enucleation of the techniques that underpin the creation of an application, i.e. a problem-solving program. Low- and intermediate-level stages of image analysis are mainly a collection of basic, "atomic" transformations of images either into images or into symbolic data. Thus it can be easily defined the border between the application domain, where problem-related flow control and task invocation lie, and what can be defined as the system library domain, where problem independent code resides. This approach can be restated with the application machine terminology of Lawson(8) in its quest for understandable systems. Within this framework an application is seen as a collection of (data-dependent) calls to "situation handlers." That is, specific routines that perform a simple action and provide an "easy route" for the task-resource mapping problem. Hence the target has been the release of a collection of situation handlers, or the "system library", which in turn requires addressing the problem of its creation in terms of its contents (which are placed within the library) and implementation (language/approach). The latter point deserves particular attention. In fact, owing to the novelty and complexity of PAPIA2 architecture, it is necessary to precede the library programming with an acquaintance period, which may be further subdivided into two stages. The first step occurs when the wouldbe programmer acquires a basic knowledge of the architecture (processing element model, instruction set, communication capabilities, etc.). Later, the programmer analyzes the code-creating process trying to
understand strong and weak points of the architecture or to locate potential flaws. Furthermore, it should be emphasized that the central role of images must always be retained, at first as a visual feedback of instruction execution, later as a visibility feature for the monitoring of procedures. Thus four main requirements were singled out: i) easy opcode creation; ii) graphical machine status visualization; iii) minimization of global complexity; iv) consistency of user-interface among the various tools. These requirements are common to three key activities, which will be analized in the following sections: • getting to know the machine • programming macro-sequences • extending the macro-sequences into parametrized functions.
3.1 A Soft Approach to PAPIA2 Understanding
The approach to the programming of a new, non traditional machine is often somewhat slow and complex. The excitement and the curiosity created by the new resources available is usually severely tested by the continuous consulting of manuals (and data sheets) which are not so handy or clear as we would like. Furthermore, the understanding of most basic functions usually relies on instinct and sagacity, and the validation of new ideas can imply a big effort. The aim of easing the acquaintance of PAPIA2, in order to optimize the work of new users and to permit an effective test by potential
clients, has been principally accomplished by means of a couple of graphic tools: Xilebo and DataPeeker, respectively represented in figures 3 and 4. The completion of a software simulator "at the machine level" of PAPIA2, which is capable of executing instructions written in the PAPIA2 micro code, allows the use of this tools (and more generally of the whole software environment) even failing the machine, for both testing and development purposes. Xilebo (namely X Interface for the LEarning of Basic Operations) is a tool devised to provide programmers with a workbench (a window) where they can "build" up instructions to be executed by the machine (or the simulator) and experiment their effects immediately. The PE diagram in the Xilebo window shows the main components of the processor in the form of buttons. By highlighting them, a new scroll list is generated, containing the associated basic operations, from which the user can make a selection. This choice may reduce the number of allowable operative modalities and consequently of the remaining options for code completion (near-neighbours and memory locations). If Xilebo permits the exploration of the basic mechanism of PAPIA2, the true feedback to the user is given by the DataPeeker, a visibility program for the pictorial monitoring of every action performed by the processor array (or the simulator). Figure 4 shows the main window of DataPeeker and two of its monitor windows. Single-bit registers and memory locations over the whole array (as well as memory segments to handle grey-scale cases) are depicted in the form of images, where each pixel represents the value for a PE. This pictorial representation of data meets the need for highlyparallel machine programmers to reduce potentially huge amounts of
information to manageable and meaningful proportions. For this reason, the graphic solution is by far more effective than a textual one and seems very natural in an image-processing context.
3.2 Programming Primitives at the Machine Level
A fundamental step in the production of optimized code consists in developing and testing primitives and basic functionalities, while comparing different implementative solutions. This activity is a part of the machine validation process. On the hardware side this low level approach is capable of supplying reliable performance figures to justify design options, while it also represents a valuable testbed for the chosen instruction-set as far as the software point of view is concerned. This phase is quite attractive because it provides a fruitful interaction between the hardware designer and the machine-level programmer (iterative refining). The PAPIA2 software environment supports this early stage of system development by means of the above mentioned tools Xilebo and DataPeeker, in addition to the integrated debugger Xifedi(9). The human-computer interaction provided in this context by Xilebo reveals itself to be simple enough for novices, yet fast enough for experts. In addition to the primary gain of a fast instruction retrieving, their correctness is
automatically ensured and typing effort is
minimized. A "macro recorder" is also included in the workbench for direct storage of sequences of instructions, and facilities for further text editing in the X Window context are provided as well.
By means of the standard user-interface of the module named Xifedi (X-Interface For Enhanced Debugging), illustrated in figure 5, users are allowed to control the execution of programs, single stepping through the code, setting or removing breakpoints, and pasting macrosequences. In this context, DataPeeker plays the role of the graphic counterpart of Xifedi and permits the pictorial monitoring of system behaviour in a natural and profitable way. The crucial units in the resulting monitoring process are therefore images, whose possible differences in shade or shape (with respect to the results expected) are readily perceived by the user and can indicate peculiarities of the system behaviour or bugs in the programs observed. A major facility of the monitoring tool is the display of data the way they are logically organized, i.e. binary or grey-scale images at the various levels of a pyramid. Indeed, a pyramid modality is also applicable to data representation (see figure 4 where pyramids are shown as a set of images, one for each plane, halving the resolution to keep the dimensions constant). This feature is essential for an effective monitoring of multiresolution algorithms. Both DataPeeker and Xifedi are wholly machine independent. They interact with the processor array (or with the simulator for a step by step debugging) by sending textual commands to a control program (with an embedded interpreter). These commands are interpreted by the control module which delivers the services requested by front-end programs, including the dispatching of machine opcodes, the extraction of monitored entities and their convertion into a machine independent format, understandable by DataPeeker.
3.3 Development of the System Library
Moving from early stages to this phase of software development (in our environment) implies a change of perspective by the programmer. Indeed, building primitives may be thought as programming a single processor, assuming that its behaviour is repeated on the whole array. No particular emphasis is given to control structures (mainly loops) which are practically unused when dealing with binary images. Extending the macro-sequences into complex parametrized functions (that operate on fixed-point entities) imposes the use of suitable control structures. As a matter of fact the most natural point of view is now that of the host, and building a procedure can be seen as programming the way in which the host must send commands to the PAPIA2. The adopted solution is to use Tcl (Tool Command Language)(10) as the high level language, which provides the control structures and host variables support, because its interpreter is already emebedded in the control module (as previously stated). However, while for the stages described in the previous sections this customized interpreter only performs the actions associated with simple commands, in this context it is also responsible for all the control duties that are a typical apanage of the host connected to a multiprocessor machine, including run-time generation of machine opcodes (and control commands). The choice of Tcl gives several benefits; it is the language upon which all the environment is built and, from the new programmer’s point of view, it imposes just a minimal number of changes to csh-like syntax and constructs, and furthermore allows simple interfacing to C procedures.
Our solution provides an approach that clearly distinguishes parallel from serial constructs. Access to images is defined through the definition of an image-object. When an image-object is declared a segment of PE memory is allocated for its storage and a new Tcl command is created with the same name as the image object. In order to perform an image transformation it is sufficient to code, as command, the destination object, the wished transformation and possible arguments. For example: • to declare binary image alpha and grey-scale image beta the following commands are issued: bina alpha grey beta • to load image beta from disk: beta load • to memorize the result of thresholding beta into alpha: alpha threshold beta .
4. Conclusion
The new engineered version of the PAPIA project has been illustrated, pointing out the full capability of the virtual pyramid approach. The design of both the hardware and software platforms have been completed. In particular, the new chip is undergoing the final stages of silicon fabrication. In the meantime, a double approach to system programming has been provided, which allows the creation of applications according to a full exploitation of the hardware and its peculiarities, while retaining a valuable user-friendliness. The industrialized version is expected to be finished for the end of next year.
References
1.
V. Cantoni and S. Levialdi, "Multiprocessor Computing for Images", IEEE Proc., vol. 76, no. 8, August 1988.
2.
M. Albanesi, V. Cantoni, U. Cei, M. Ferretti, and M. Mosconi, “Embedding Pyramids into Mesh Arrays,” in Reconfigurable Massively Parallel Computers, H. Li, and Q. F. Stout, Eds, Englewood Cliffs, NJ: Prentice Hall, 1991, pp. 123-140.
3.
H. Li and M. Maresca, "Connection Autonomy in SIMD Architectures: A VLSI Implementation," Journal of Parallel and Distributed Computing, Vol. 7, No. 2, pp. 302-320, 1989.
4.
C. C. Weems, and D. Rana, “Reconfiguration in the Low, and Intermediate Levels of the Image Understanding Architecture,” in Reconfigurable Massively Parallel Computers, H. Li, and Q. F. Stout, Eds, Englewood Cliffs, NJ: Prentice Hall, 1991, pp. 88-105.
5.
Y. Ni, A. Mérigot, and F. Devos, "Designing a VLSI Processing Element Chip for Pyramid Computer SPHINX", in Progress in Image Analysis and Processing, V. Cantoni, L. P. Cordella, S. Levialdi, and G. Sanniti di Baja, eds., Singapore: World Scientific, 1990, pp. 759766.
6.
V. Cantoni, M. Ferretti, and M. Savini, “Connectivity and Spacing Checking with Fine Grained Machine,” in Machine Vision for Inspection and Measurement, H. Freeman Ed., San Diego (CA): Academic Press, 1989, pp. 85-100.
7.
H. Li and M. Maresca, "Polymorphic-Torus Network," IEEE Trans. on Computer, Vol. 38, No. 9, pp. 1345-1351, 1989.
8.
H. W. Lawson "Application Machines: an Approach to Realizing Understandable Systems," Microprocessing and Microprogramming, Vol. 35, No. 1-5, pp. 5-10, 1992.
9.
A. Biancardi, V. Cantoni and M. Mosconi, "Program Development and Coding on a Fine Grained Vision Machine", to appear on Journal of Machine Vision and Applications.
10. J.K. Ousterhout, "Tcl: An Embeddable Command Language", Proc. 1990 Winter USENIX Conference.
Captions 4 1
1
2
1
1
2
3 1
1
2
1
1
2
1
3
1
2
1
4
1
2
1
3
1
2
1
1
2
3
Figure 1. A simplified scheme of the PAPIA2 flat pyramid. Square boxes represent PEs, circles represent switches.
Figure 2. Block diagram of the processing element of PAPIA2. The four connections from and to switches are on the top-left and bottom-left side respectively.
Figure 3.