Efficient SoC Design with Homogeneous Processor Arrays M. Zajc, R. Sernec, J. Tasič Digital Signal Processing Laboratory Faculty of Electrical Engineering University of Ljubljana Ljubljana, Slovenia
[email protected], http://ldos.fe.uni-lj.si
ABSTRACT
It is our opinion that the system-on-chip, all-in-one approach to parallel digital signal processing combining the homogeneous processor array with large amounts of data memory, will be the core of future embedded digital signal processing applications. In this paper we present two radical approaches to efficient system-on-chip (SoC) design with homogeneous processor arrays for real-time applications. We propose two novel techniques to enhance the throughput of the SoC designed systolic arrays. A systolic array control mechanism based on very long instruction word (VLIW) principles is presented and a proposal is given for simultaneous execution of independent algorithm data sets, or even different algorithms, on the programmable systolic array, multithreaded systolic computation. Simulation results of our multithreading approach are based on a set of linear algebra algorithms. Keywords - System-on-Chip, processor array, systolic array, multithreading, VLIW, systolic array control.
1
Introduction
Modern digital signal processing (DSP) and image processing applications depend on a sufficiently high processing throughput and massive data bandwidths needed in computations [1-4]. The need for real-time, high performance computational algorithms and architectures is one of the driving forces of modern signal processing technology and is behind the expansion of the semiconductor industry. The semiconductor industry is developing new products at an enormous pace driven by the Moore’s law. The symbiotic integration of once dissimilar memory and logic processes is very promising for processor array and memory integration on a single chip, which is the key to reliable massively parallel systems.
With tremendous advances in the VLSI technologies new horizons have opened. Developments in the integrated-circuit technology have led to a rising interest in parallel or highly concurrent algorithms and architectures [5, 6]. As a result of large available transistor counts on a single chip, different projects have emerged with the vision of integrating processor and memory on a single chip [7]. Current application specific integrated circuits (ASICs) provide the possibility for mixing memory and logic processes on the same chip. Many SoC projects and studies have been done in the past few years to show benefits of system-scale integration [8, 9]. Most of the studies have been dealing with vector processors or small-scale MIMD processor systems. The two well-known projects are IRAM (Intelligent RAM) [10] and CRAM (Computational RAM) [11]. These projects may mark the real start of different parallel SoC systems, indeed also systolic arrays. DSP, image processing as well as linear algebra algorithms have found an efficient implementation medium in the parallel computing domain. These algorithms can be efficiently mapped onto array processors (systolic arrays, wavefront arrays and SIMD arrays), due to their regularity on the data level [3, 4, 5]. Our paper focuses on systolic arrays (Section 2) where algorithm execution can be simultaneously triggered on all data elements of data set with a single instruction. The advantage of such arrangement is that the replication of execution resources may be employed. Each datum or part of a data set can be associated with a separate processing element. Processing elements form a parallel processing ensemble that is triggered by one instruction, where each processing element operates on a different data element.
1.1 The proposed approach Our proposal for efficient design of real-time capable SoC structures consists of the following steps: 1.
Apply all of the existing knowledge of the systolic arrays and systolic algorithms to design a computational structure for DSP processing tasks.
2.
Design a processing element and replicate it to form a systolic array.
3.
Design a suitable simple VLIW based scalar processor that can also control the systolic array.
4.
Integrate embedded an DRAM array with the systolic array to limit the external pin count and increase data bandwidth between the systolic array and the global memory.
5.
Add additional interfaces that can suit real-time DSP processing (TDM, UTOPIA, Ethernet MAC, PCI, etc.).
The purpose of this paper is to focus on points 1. – 3. and propose new techniques that can enhance the throughput of such SoC designed systolic arrays.
The use of the processor array enables concurrent execution on multiple data of the selected algorithm. The whole system requires efficient control and simple programming paradigm. VLIW concepts offer a promising required solution by releasing the majority of the hardware complexity to the compiler. The task of a single VLIW processor is three-fold: To execute all scalar operations within algorithms. To control the systolic array, which is composed of a number of processing elements. To enable simultaneous execution on a scalar processor and a processor array. The VLIW approach is beneficial for SoC implementation in several respects. VLIW control path is streamlined and simplified, thus allowing higher operating frequencies of the designed processor. Silicon resources are saved and rather devoted to implement replicated processing elements within a homogeneous systolic array, which is favorable from the performance point of view. The traditional approaches and details on our novel approach based on the VLIW principles are given in Section 4. We also propose a multithreaded systolic computation as an efficient throughput enhancement technique. In general, multithreading is a technique to hide various types of latencies. The term thread refers to a single path of execution or control flow. There exist several flavors of multithreading, which defer by thread switching intervals, synchronization mechanisms among threads and implementation requirements. Multithreading on systolic arrays enables simultaneous execution of independent algorithm data sets, or even different algorithms, on a systolic array level. This approach results in a higher throughput and improved utilization of the systolic array composed of pipelined functional units within processing elements. Unlike some traditional approaches, multithreading increases the throughput of a systolic array without changing the systolic algorithm. Details are given in Section 5 together with simulation results.
2
Systolic array computation 2.1 A historic overview
The research in the area of systolic arrays began at the end of 70's [3]. The original motivation behind the systolic array concept was its potential for VLSI implementation. In fact, only a few systolic algorithms have been actually implemented in VLSI chips. The main obstacle is their limited flexibility since general-purpose (high volume) designs are the main driving force from the commercial point of view. The second problem was the available technology at that time. Nevertheless, several multiprocessor projects have been directly inspired by the systolic array concept, such as Warp Processor, developed at Carnegie-Mellon, the Saxpy Matrix-1 or the Hughes Research Labs Systolic/Cellular System. Some other projects are covered in [5, 12].
2.2 The systolic concept Systolic algorithms are parallel versions of sequential algorithms suitable to run on array processors that execute operations in the so-called systolic manner. Systolic arrays are generally classified as high-performance, special-purpose VLSI computer systems suitable for specific application requirements that must balance intensive computations with demanding I/O bandwidths. Systolic arrays are massively parallel architectures, organized as networks of identical and relatively simple processing elements, which execute operations synchronously. Modular processors interconnected with homogeneous (regular) and local interconnections provide basic building blocks for a variety of algorithms. Systolic algorithms address the performance requirements of special-purpose systems by achieving significant speedup, due to parallel processing and prevention of I/O and memory bandwidth bottlenecks. Data are pumped in a rhythmic manner from memory through the systolic array before the end result is returned to the memory. The global clock and explicit timing delays synchronize the system. 2.2.1
Systolic array features
The systolic array is a computing system that possesses several features very amenable to SoC designs [4]: Network. It is a computing network employing a number of processing elements with local interconnections. Homogeneity. Interconnections between processing elements are homogeneous (regular). The number of interconnections between processing elements is independent on the problem size. This is the first important feature that we exploit for SoC design. Locality. The interconnections are local. Only neighbor processing elements can communicate directly. This is the second important feature required to achieve high-speed VLSI SoC realizations. Boundary. Only boundary processing elements in the network communicate with the outside world. This eliminates the classical memory, I/O bandwidth bottleneck. Modularity. The network consists of one or, at most, a few types of processing elements. If there is more than one type of processors, the systolic array can usually be decomposed into distinct parts with only one processor type. This feature enables quick, high performance SoC designs. Rhythm. Data are computed and passed through the network in a rhythmic and continual manner. Synchrony. A global clock synchronizes the execution of instructions and data interchange between processing elements. Extensibility. The computing network may be extended arbitrarily. Pipelinebility. Pipelining on the array level, i.e. between processing elements, is present.
From the features presented above we can summarize that a large number of processing elements work in parallel on different parts of the computational problem. Data enter the systolic array only at the boundary. Once input into the systolic array, data are reused many times before they are output. In general, several data flows move at constant velocities through the array and interact with each, where processing elements execute repeatedly one and the same function. Only initial data and final results are transferred between the host computer/global memory and the systolic array. As far as VLSI – SoC implementation is considered, certain constraints should be kept in mind and incorporated in the design methodology. The most important are short communication paths, and limited input/output interaction and synchronization. These constraints are all inherent features of systolic arrays. Features like homogeneity, modularity and locality are especially favorable from the VLSI - SoC point of view and make the systolic arrays ideal candidates for SoC implementation [4, 5, 12]. 2.2.2
A model of processing on systolic arrays
Each systolic algorithm viewed from the processing element perspective includes three distinct processing phases: Data input. A new data item is input into the processing element from processing element neighbors or global memory for bordering processing elements of the systolic array. Algorithm processing. Algorithm is processed as specified by the processing element definition of the systolic algorithm. Data output. Result of the algorithm data processing phase is output to designated processing element’s neighbors or to global memory for bordering processing elements of the systolic array. All three phases constitute what is known as a systolic cycle. Figure 1 depicts classical single-threaded systolic processing phases of a processing element, which comprise a systolic cycle. Observe the distinctive I/O communication phases and a longer processing element algorithm-processing phase. Note that the input phase can not begin before the data output phase is finished. Data communication phases are the synchronization mechanisms between processing elements in the systolic array.
Systolic cycle
Sample #1 Sample #2
Figure 1: Systolic processing phases.
I #1
P#1
O #1 I #2
2.3 Trends in systolic computation Systolic arrays are finding their way in many practical applications. In recent years, several applications of systolic arrays have been presented, ranging from radar signal processing to low-level image processing problems [5]. Typical applications include an image sequence analysis, DNA sequence analysis, visual surface control, image processing, cryptography and computer tomography. General-purpose programmable systolic arrays are also found on the market. Recently, Systolix has announced PulseDSP architecture. PulseDSP technology uses a large number of highly efficient processors arranged in a systolic array. The first PulseDSP based product is a wide bandwidth sigma-delta A/D converter from Analog Devices (AD7725) [13]. Full custom designs of systolic algorithms are also available. For example, low-level image processing algorithms for motion estimation are already implemented in hardware as systolic array structures [14]. Furthermore, systolic algorithms for DSP problems have found an efficient implementation medium in FPGAs, during the development stage or when multiple FPGAs are employed to fit larger systolic arrays on a single printed circuit board.
3 Systolic array SoC concept 3.1 On the suitability of systolic processing for SoC We propose a homogeneous, modular, expandable systolic array combined with a large global memory. There are two main reasons for suitability of the SoC integrated systolic array with the memory system: Off-chip memory buses are slow. A systolic array inherently depends on a continuous supply of data from the memory. Inability to realize this at a fast enough data rate can cause significant performance degradation. Off-chip inter processor element communication slows the data exchange. This ultimately slows down the whole systolic array, too. Package pin-count limitation. A rectangular systolic array can be connected to the memory system on all four borders. Using a narrower memory interface can again cause slowdown. Integration of the systolic array and memory on the same piece of real estate alleviates these problems altogether. Homogeneity of the systolic array is a very important factor in the design of such system since it provides the following benefits: Only one processing element design is reused in the whole systolic array.
Performance scales linearly with the number of processing elements in the systolic array. The systolic array design cycle can be shortened, due to reuse of the same processing elements that form the processing ensemble. A general model of programmable systolic processor arrays suitable for SoC integration works as follows: data samples from the on-chip global memory are pumped through the systolic array, processed and end results returned back to the same global memory. The systolic array is integrated within a unified architectural framework [1]. There is no explicit division between the “host” computer and a separate systolic array (Figure 2). The main scalar processor acts as “host” since it fetches the instruction stream and decodes it. Instructions are then executed on the main scalar processor or on the systolic array. We can summarize suitable features of systolic arrays for SoC as: Control of a systolic array is not limited to any particular implementation. Later we shall see that the VLIW approach presents a natural way to efficiently control such parallel SoC processing architecture. The number of functional units within the scalar part of the architecture can be arbitrary. The topology of physical interconnections between processing elements is not limited to a specific network. Implementation should be most suitable for VLSI realization, where perpendicular interconnections are preferred. Due to synchronous data pumping through the systolic array, a separate global data memory bank is assigned to each row/column of processing elements to assure conflict free accesses. Furthermore, there is no need for additional crossbar buses between the memory and the processing element array since, by definition, only the nearest neighbor data transfers are allowed in systolic processing.
GLOBAL, SHARED, EMBEDDED INSTRUCTION & DATA MEMORY ARRAY
PE # (i,j)
I&D CACHE M x N PE
VLIW SCALAR PROCESSOR with Processor array control
Global dual ported memory bank
SYSTOLIC / SIMD PROCESSOR ARRAY DATA PATH
Figure 2: Unified scalar-systolic array concept as a System-on-Chip design.
PE # (i,j+1)
3.2 The proposed architecture Our target is a programmable systolic array implemented as SoC. By programmability we understand the possibility of executing different systolic algorithms on the same systolic array. To achieve this, we integrate a general VLIW scalar processor with the systolic array composed of multifunctional processing elements. All operations within a processing element are programmed and scheduled via a VLIW scalar processor. To generalize and achieve the execution of the widest possible systolic algorithm set on our system, we designed the systolic array as a rectangular array of processing elements. Note that such topology can execute also systolic algorithms for linear, square, triangular or a combination of these forms [1, 2]. Although each part of the systolic array can proceed at its own pace, different parts must be synchronized at the borders. This is achieved with appropriate programming and a novel VLIW systolic array control technique [1].
3.3 The algorithm selection A basic core of mathematical algorithms arising in problems of modern multimedia and image processing consists of linear algebra and linear operator theory. Specific tasks that need to be performed in real time in DSP systems include matrixvector multiplication, matrix-matrix multiplication and addition, matrix inversion, solution of linear systems, least-squares approximate solution of linear systems, eigensystem solution, generalized eigensystem solution, and singular value decomposition of matrices. These algorithms are computationally intensive, requiring O(N 3 ) operations for each data set of N input data samples [4].
Figure 3: Systolic array. An array of interconnected processing elements.
To practically present the use of concepts described above, we focus on two well-known linear algebra algorithms, namely Gaussian elimination and Givens rotation. The systolic algorithms for Gaussian elimination and Givens rotations are well known [4, 5]. The algorithm set fits well into the concept from Section 3. In addition, both systolic arrays are composed
of a triangular sub-array on the left hand side (dotted processing elements are not active) and a square sub-array on the right hand side (Figure 3). All processing elements in the systolic array with the same fill pattern execute the same operation as defined by the systolic algorithm. Gaussian elimination is a standard method for solving linear systems of equations and is one of the most widely used algorithms for LU decomposition of matrices. The Gaussian elimination algorithm is mainly based on multiply-accumulate operation with the exception of a row multiplier computation, which requires a division (Figure 4). The Givens rotations algorithm is a representative of a computationally demanding algorithm. Givens rotations are usually considered as a method for QR decomposition of matrices. More generally, Givens rotations can be used for QR decomposition as well as for orthogonal triangularization. The algorithm is very attractive due to its numerical properties, i.e. stability and accuracy. As such it is a vital part of many computationally demanding methods such as solving linear systems of equations, least squares and eigenvalue problems. The algorithm is rich in divide and square-root operations (Figure 5). A special combination of the two algorithms, a result of generalization of Gaussian elimination, is especially attractive. The resulting algorithm is also known as the Modified Faddeev Algorithm (MFA) [15]. The algorithm opens a new approach to systolic design with a fixed topology systolic array and a flexible algorithm capable of executing a set of linear algebra algorithms [2]. The abovementioned systolic algorithms were modeled, simulated and verified using the Simulink environment [16].
6
6
5
5
4
4
3
3
2
2
1
1
a)
SQRT
DIV
ADD
I/O
SQRT
DIV
ADD
MUL
I/O
MUL
0
0
b)
Figure 4: Gaussian elimination. Operation count for: a) Diagonal, b) Off-diagonal processing element types.
6
6
5
5
4
4
3
3
2
2
1
1
SQRT
DIV
ADD
I/O
MUL
SQRT
DIV
ADD
0
I/O
MUL
0
a)
b)
Figure 5: Givens rotations. Operation count for: a) Diagonal, b) Off-diagonal processing element types. The SoC systolic array requires realization of the “algorithm-program-SoC control” chain. Descriptions of two novel approaches to achieve this are given below. VLIW control of the complete systolic array is presented as a cost-effective, scalable way for the end part of the chain, whereas multithreading augments the middle, the program part. Multithreading is particularly attractive as a method to increase the utilization of resources on SoC and thus its throughput.
4
VLIW based systolic array control 4.1 Systolic array control issues
In this section we describe an efficient mechanism, based on VLIW (Very Long Instruction Word) principles supporting systolic computation models, capable of controlling spatially distinct parts of the systolic array with different instruction streams. The proposed mechanism also features easy implementation. Our study of several DSP and linear algebra algorithms showed the benefit to restrict the control mechanism of systolic arrays to the following [1]. Control requirements of systolic arrays: Mesh topology of the systolic array can be reconfigured into 1D linear, triangular and rectangular. All processing elements can execute the same instruction stream. Processing elements can be selectively masked out (typical feature of SIMD processor array control). Separate control for processing elements on the main diagonal (linear algebra algorithms). Even portions of the processing element array can be decomposed into rows and columns (sub-arrays) and can be independently controlled by separate instruction streams. The abovementioned restrictions result in several possible spatial control patterns on systolic arrays [1]. 4.1.1 Classical approaches to systolic array control There are two traditional approaches to processor array control [5]: A processing element is implemented as a self-contained processor, with its own program and sequencing logic. The negative aspect is the large real estate overhead due to multiple redundant copies of program storage and instruction control logic that lowers the efficiency of this model for SoC processor arrays.
The whole parallel processor ensemble is controlled via horizontally microcoded engines attached to some host computer. Array controller interprets high-level commands and generates wide horizontal microcode control words for processing element control and global memory addressing.
4.2 Novel systolic array control Based on the abovementioned facts we propose control of the systolic array and each processing element based on VLIW principles. The solution presents a combination of a smart compiler and VLIW control with simple control signal generation logic. In our architecture the SoC systolic array architecture is highly orthogonal and all processing element operations are exposed to a programmer having a direct control over all operations executed on processor array. Instead of a rugged interface between the source program, library routines and a classical micro-coded systolic array controller, we have only one source of information that generates instructions for the main scalar processor as well as for the systolic array. The most notable difference of this approach, compared to previously described, is that the application is executed within a single environment. Scalar and array processing are integrated within the same architectural specification. This results in object code output from the compiler that is directly executable on a target systolic parallel implementation. All libraries used within a compiler framework are coded in the object code of the same architecture as is used to run scalar portions of the application. A VLIW architecture is intimately connected with sophisticated instruction scheduling and operation-to-functional unit assignment problems. These are handled by a smart compiler that processes the source code in two paths: one for detection of instruction level parallelism and the other for extraction of data parallelism. All operations operating on vectors or matrices are converted to systolic forms with the help of a library containing known systolic solutions of data parallel problems (DSP, algebra operations, etc.). A compiler generates a VLIW program from a systolic algorithm description. Detailed instruction scheduling is done according to the systolic array implementation specifics. The resultant code is run on a VLIW scalar processor, with plurality of processing elements in the systolic array (Figure 6).
Application
Source program
Instruction level paralellism detection
Conversion to systolic/SIMD forms
Code transformation
Architecture specification
Data level paralellism detection
Library of common systolic/SIMD algorithm solutions
Scheduling
Object code in VLIW form Main scalar processor
Processor element systolic/SIMD array
Figure 6: Embedded homogeneous processor array SoC development framework. Program execution proceeds as follows: the main scalar processor decodes VLIW instructions and forwards operations/instruction streams to the systolic array, whenever non-scalar instructions appear. The process of generating instruction streams is contained within the compilation framework. The VLIW program additionally encodes information on different spatial instruction control patterns vital to execute a multi-instruction stream systolic algorithm on the systolic array. The idea of explicit systolic array control is further developed to allow control only of its specific portions. This is advantageous for systolic arrays that include two or more different processing element types on the same processor array. Our prior analysis of systolic algorithms revealed several distinct spatial control patterns that can cover almost all known DSP and algebra systolic implementations [1]. The VLIW control mechanism can be implemented also at the level of operations within VLIW. This offers the possibility that a single VLIW controls the systolic array composed of different processing element types. In addition each processing element can execute several operations in parallel. Note that with this approach we can promote a parallel execution at following four levels: Concurrent instruction/operation execution on the scalar VLIW processor. Overlapped operation of the scalar processor and the systolic array. Data parallel execution on the whole systolic array triggered by a single instruction. Parallel instruction/operation execution on the processing element within the systolic the array triggered by single VLIW.
5
Multithreaded systolic Computation 5.1 Traditional approaches to throughput enhancement
The throughput of the systolic array can be limited due to various reasons: low systolic array efficiency, data synchronization between parts of the systolic array with different types of processing elements (e.g. recursive least squares), data dependencies within processing element and long latency of operations within the processing element. Systolic array efficiency and systolic cycle length depend on the complexity of the systolic algorithm and implementation of the processing element. Both may bound the systolic array throughput. In order to solve the mentioned problems, several systolic algorithm transformation techniques have been devised: c-slowing, folding, double pipelining, fast designs and two-level pipelining [4]. They operate on the algorithm level.
5.2 A novel approach to throughput enhancement The main problem addressed in this section is the possibility of increasing the throughput of systolic arrays beyond that limited by a systolic cycle without applying any systolic algorithm transformation/mapping techniques. Rather, this can be achieved with the combination of multithreading and systolic array computing, termed multithreaded systolic computing presented in sequel [1]. Unlike classical algorithm level transformations, multithreading can increase the throughput of the systolic array without changing the original systolic algorithm. We have to note that if a certain algorithm is already 100 % effective, the incorporation of multithreading on pipelined systolic processing elements can improve the throughput of the same algorithm by a constant factor. In general, multiple independent algorithm threads (i.e. instances of the same algorithm) are interleaved within a given systolic cycle on the same systolic array. Data from multiple threads are pumped through the multithreaded systolic array in packets resulting in a dramatic improvement of the net throughput. All threads share the same resources. We can define several sources of data sets available within systolic arrays that can be treated as unrelated threads: Data vectors from different algorithm partitions. Loop unrolled systolic processing element algorithm. Multiple instances of the same algorithm. Simultaneous execution of different types of algorithms. Suitable combinations of the above.
Each thread can be another instance of the same algorithm or a different type of the algorithm. Data streams from systolic algorithms execute operations without noticing the presence of others. All data streams share the processing element resources including inter-processing element communication data paths. Note that we assume the same processing element I/O bandwidth and as such the same bisection bandwidth as in the original single threaded systolic array. The performance increase is due to the elimination of true data hazards within each algorithm, better functional unit utilization and larger amounts of usable instruction level parallelism uncovered within longer basic blocks constituting instruction loops. Functional unit pipelines within processing elements are kept busy by interleaving independent threads running simultaneously through multithreaded systolic computation. As a side effect of multithreading, efficiency of each processing element improves since a group of algorithms can finish execution in shorter time than they had in the serial case.
5.2.1
A model of processing on multithreaded systolic computation
We can examine the multithreading systolic computation on a logical level where all threads are concurrently executing on the systolic array with the granularity of one processing element clock cycle. Implementation of multithreaded programs uses a data packet transfer approach. For each iteration of the systolic program M data elements from all M threads are input, processed and output. A graphical representation of I/O activities within the multithreaded systolic computation is shown in Figure 7. Data elements of input data vectors of M threads are time multiplexed into a multithreaded systolic array at a rate of one element per a processing element clock cycle. Data elements of result vectors are output at the same rate. This whole process constitutes a multithreaded systolic cycle. The process repeats for all subsequent elements and all threads. Since threads are independent data sets, there are no data dependencies present within processing element pipelines. Input data vector_Thread_1
Input data vector_Thread_2
: : : Input data vector_Thread_ M
Output data vector_Thread_1
Multithreded systolic array PE#1
:
PE#P
Switch rate=1/PE clock cycle
Output data vector_Thread_2
: : : Output data vector_Thread_ M
Figure 7: Logical representation of multithreaded systolic computation. Since all functional units within processing elements are pipelined, data are pumped at the pipeline rate of functional units, compared to the systolic cycle rate in traditional systolic arrays (Figure 1). Within one ordinary systolic cycle for each thread, multiple unrelated data samples from other threads are input, processed and output (Figure 8). After a data sample of
the first thread (O #1,1) exits the systolic processing element, input of the next sample of the first thread (I #2,1) can begin. The process is repeated for each subsequent thread. All threads run in a lock step, i.e. they are synchronized for each iteration of new data sample for the set of threads (loop iteration). Serialization of data elements from multiple threads occurs at the I/O phase since we assume that only single threaded processing element’s I/O bandwidth is available. Multithreaded systolic cycle
I #11
P #11 I #12
Sample #1 Thread #1
O #11 P #12
Sample #1 Thread #2
O #12
• • I #1i
P#1i
O #1i PE clock cycle
Sample #2 Thread #1
I #21
P #21
O #21
Figure 8: Multithreaded systolic processing phases.
5.3 Simulation environment The simulation is focused on the execution of computations within each processing element of the multithreaded systolic computation. We functionally simulate a multithreaded systolic computation operation as it is executed on a processing element employing a parameterized VLIW processor simulator. Three different models of the processing element are used. The parameters and names of the three simulated processing element models are described in Table 1. Within this simulation environment we have direct control over the number of each type of functional units: adders (Add), multipliers (Mul) and divider/square rooter units (Div/Sqrt). All mathematical functional units are pipelined into 5 stages, except for non-pipelined divide/square root units with 17 processing element clock cycle latency. The number of memory and inter-processing element communication functional units is fixed at two (model A) and eight units (models B, C), respectively, and with a two stage pipeline. The issue width describes the maximum number of operations simultaneously executable within each processing element. There are no restrictions on the type of operations that can proceed concurrently, as long as there are no data or structural hazards present. Functional correctness of all programs on the systolic array level was verified through the Simulink simulation environment [17].
5.4 Results The results present speedups achievable on a multithreaded systolic computation running the presented algorithms on selected processing element models. Speedups are based on the execution time ratios of multithreaded systolic computating vs. single threaded systolic computing. We show that a collection of algorithms executing simultaneously on multithreaded systolic array is faster than serial execution of the same algorithms on a single threaded systolic array. Each task separately will not be executed in shorter time, though. Simulation results are collected in Figure 9 and Figure 10. From Figure 9 we can observe that the speedup curve flattens already with in two threads, since the algorithm contains long latency arithmetic operations, non-pipelineable operations and insufficient number of resources available within a given processing element. A similar observation applies to Figure 10, although higher speedups are achieved since the long latency operation count is smaller for this algorithm. When multiple long latency functional units are available, the speedup continues to increase. This suggests that it pays out to equip processing elements with multiple long latency functional units if a systolic array is going to be more linear algebra oriented. Multithreaded systolic computation creates numerous effects: Processing element utilization is increased. This is due to the fact that there are many independent operations available and these can be executed concurrently. The net throughput increases as a direct consequence of multithreading. As long as there are enough pipelined functional units and inter processing element communication channels, the speedup curve can experience a proportional increase. Basic blocks of the multithreaded program code executed on processing elements become longer, which results in the opportunity for the extraction of more instruction level parallelism. The need for register storage increases proportionally to the number of threads.
processing element model name A
processing element Issue width 5
Number of functional units by Type/Latency L Mul L Div/Sqrt
Add 2
3
2
3
1
17
B
8
2
5
2
5
2
17
C
8
4
5
4
5
4
17
Table 1: Description of processing element models used in the simulations.
L
4 A B C
Speedup
3
2
1
1
2
4
Number of threads
Figure 9: Modified Faddeev algorithm; Givens rotations step. Speedup.
4
Speedup
3
A B C
2
1
1
2
4
Number of threads
Figure 10: Modified Faddeev algorithm; Gaussian elimination step. Speedup. Multithreading is very effective in increasing the throughput and utilization of systolic arrays. A linear increase in the throughput is observed as long as processing element functional units are pipelined and algorithms do not experience very low computation-to-communication ratios. In the case of linear algebra algorithms that require the computation of lengthy square root and division operations multithreading can not provide any significant benefit. The use of multiple nonpipelineable functional units is therefore mandatory. Multithreaded systolic computation also increases the number of available operations within a given systolic cycle and since threads are independent of each other, more instruction level parallelism can be easily extracted from the code. Furthermore, pipelined functional units can be kept busy, producing results with a theoretical maximum throughput rate. We have to note that linear algebra algorithms, which we selected for evaluating the concept of our ideas, required very complex, long latency operations within each processing element (Figure 4 and Figure 5). Classical DSP algorithms (linear FIR and IIR
filtering, pattern matching, etc.) require only a multiply-accumulate type of operations, though. As such they can experience even better speedup results [1].
6
Conclusions
In this paper a SoC implementation investigation that assumes the integration of the systolic array and a global memory is presented. We believe that a system-on-chip, all-in-one approach to parallel digital signal processing combining the homogeneous processor element array with large amounts of data memory will be the core of future embedded digital signal processing applications. Processing elements within the systolic array should support the multithreading as an efficient speedup mechanism. This is especially appealing to the reuse of existing algorithm definitions and promises to increase the design & test productivity of complex SoC systems. Two novel approaches are presented. Since the control unit is also integrated on the same chip, it should be of the VLIW type, which results in a straightforward single architecture environment for parallel SoC DSP processing. Additional programmability of individual processor elements of the multithreaded systolic computation offers an additional degree of freedom that enables straightforward implementation of various systolic algorithms. Speedups observed asymptotically approach the number of threads. The ideas outlined in the paper are equally applicable to the SIMD processor arrays.
7 References [1] R. Sernec, “Massively Parallel Architectures for Digital Signal Processing”, Ph.D. Thesis, University of Ljubljana, 2000. [2] M. Zajc, "Faddeev algorithm based systolic structures for digital signal processing", Ph.D. Thesis, University of Ljubljana, 1999. [3] H. T. Kung, C. E. Leiserson “Systolic Arrays (for VLSI),” Technical Report CS 79-103, Carnegie Mellon University, 1978. [4] N. Petkov, Systolic Parallel Processing, North-Holland, 1993. [5] High Performance VLSI Signal Processing: Innovative Architectures and Algorithms, vol. I, II, Edited by: K. J. R. Liu, K. Yao, IEEE Press, 1998. [6] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. [7] Special issue on: The Future of Micro Processors, Computer, vol. 30, no. 9, September 1997. [8] D. Burger, “System-Level Implications of Processor-Memory Integration”, 24th ISCA, June 1997. [9] P. M. Kogge, “Processor-In-Memory (PIM) Chip Architecture for Petaflops Computing”, http:/cesdis.gsfc.nasa.gov/~creschke/peta /report/node39.html, Page accessed: 17.1.1998. [10] D. Patterson, et al., “A case for intelligent RAM”, IEEE Micro, vol. 17, no. 2, pp. 34-44, March/April 1997. [11] D. Elliott, et.al., “Computation RAM: The case for SIMD computing in memory”, 26th ISCA, 1997. [12] Special issue on: Systolic Arrays, Computer, vol. 20, no. 7, July 1987. [13] www.systolix.co.uk, Page accessed March 2001. [14] Y. Katayama, T. Kitsuki, Y. Ooi, “A block processing unit in a single-chip processing elementG-2 video encoder LSI”, Proc. of SIPS’97, pp. 459-468, Leicester, 1997. [15] J. G. Nash, S. Hansen, “Modified Faddeeva Algorithm for Concurrent Execution of Linear Algebraic Operations,” IEEE Trans. on Computers, vol. 37, no. 2, February 1988. [16] www.mathworks.com, accessed May 2001. [17] M. Zajc, R. Sernec, J. Tasič, "Modelling, simulation and verification of massively parallel algorithms", Proceedings of SIM'01, WSES conference, Malta, September 2001.