Appears in 2nd ISPA (2004) A Parallel Reed-solomon Decoder on the Imagine Stream Processor1 Mei Wen, Chunyuan Zhang, Nan Wu, Haiyan Li, Li Li Computer School, National University of Defense Technology Chang Sha, Hu Nan, P. R. of China, 410073
[email protected]
Abstract. The increasing gap between processor and memory speeds is a wellknown problem in modern computer architecture. Imagine stream architecture can solve bandwidth bottleneck by its particular memory hierarchy and stream processing for computationally intensive applications. Good performance has been demonstrated on media processing and partial scientific computing domains. Reed-Solomon (RS) codes are powerful block codes widely used as an error correction method. RS decoding demands a high memory bandwidth and intensive ALUs because of complex and special processing (galois field arithmetic), and real time requirement. People usually use specialized processor or DSP to solve it that gains high performance but lacks flexibility. This paper presents a software implementation of a parallel Reed-Solomon decoder on the Imagine platform. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that Imagine has comparable performance to TI C64x. This work is an ongoing effort to validate the stream architecture is efficient and makes contribution to extend the application domain.
1 Introduction RS codes[1] are powerful block codes widely used as an error correction method in the areas such as digital communication, digital disc error correction, digital storage, wireless data communication systems etc. A RS encoder takes a block of digital data and adds extra “redundant” bits. And the RS decoder processes received code block and attempts to correct burst errors that occur in transmission or storage. RS(n,k) is defined over the finite field GF(2m). It means that the length of each block is n where n≤2m-1 including k data symbols and (n-k) parity symbols. The decoding process uses this parity information to identify and correct up to t errors, where t=(n-k)/2. Imagine[4] is a prototype processor of stream architecture2 developed by Stanford University in 2002, which is designed to be a stream coprocessor for a general pur1 2
This work was supported by the 973 Project and the 863 Project(2001AA111050) of China. There are several kinds of stream architectures, the common feature is to take stream as architectural primitives in hardware. In this paper, stream architecture is Imagine stream architecture.
pose processor that acts as the host. It contains host interface, stream controller, streaming memory system, microcontroller, 128k stream register file (SRF), eight arithmetic clusters, local register file (LRF) and network interface. Each cluster consists of eight functional units: 3 adders, 2 multipliers, 1 divide/square root unit, 1 communication unit and 1 local scratch-pad memory. The input of each functional unit is provided by LRF in a cluster. The microcontroller issues VLIW instructions to all the arithmetic clusters in a SIMD manner. The main idea of stream processing is organizing the related data words into a record. The streams are ordered finite-length sequences of data records of an arbitrary type (records in one stream are of the same type). The stream model decomposes applications into a series of computation kernels that operate on data streams. A kernel is a small program executed in arithmetic clusters that is repeated for each successive element of its input streams to produce output stream for the next kernel in the application. Imagine can be programmed at two levels: stream-level (using StreamC) and kernel-level (using KernelC) [2, 3].
2 Imagine Implementation The Peterson-Gorenstein-Zierler (PGZ) algorithm[5] is a popular method for RS decoding. We parallelize and optimize the PGZ algorithm so that it can be adaptive to Imagine’s memory hierarchy and parallel processing.
Fig. 1. stream/kernel diagram and stream program structure for RS decoding
RS decoding application has natural stream features, and there is no dependence between RS code blocks. According to the PGZ algorithm, the whole RS decoding process can be decomposed into four kernels: syndrome, bm, chsrh and forney, respectively corresponding to syndrome computation, BM algorithm, Chien search and Forney algorithm. The data flow diagram is shown in Figure 1.Figure 1 shows that the relationship between kernels of RS decoder is a complex producer-consumer
model where streams are produced and consumed, and four kernels are organized into a four-stage pipeline (not including initialization). For RS decoding algorithm, the parallelism exploited on Imagine architecture includes instruction-level parallelism (ILP), data-level parallelism (DLP) and threadlevel parallelism (TLP). An ILP approach to partition the RS algorithm for Imagine sends elements from a single stream to all eight clusters. Obviously, the data between clusters is redundant. If necessary, records are computed redundantly on all clusters. But it is the simplest parallel approach. It minimizes serial communication blocks, and makes best use of large computing capability of Imagine architecture. It brings waste to bandwidth, LRF and computing capability. For those records with weak dependence, DLP is a better parallel approach. DLP can exploit parallelism very well and decrease the redundancy at the same time. Here we introduce a new conception-frame. Frame is a field that consists of one or more related records. The records in a frame are not always continuous, but the interval is best to be integral times of cluster number. The dependence between frames is very weak, while the data dependence in a frame is complicated so it can be regarded as a child stream. Though input stream is the same, it is partitioned into several independent frames in the DLP implementation. With this approach, the particular stream reference method[3] makes every cluster have different frames every time and keeps high bandwidth throughput at the same time. The data in cluster has little redundancy and the computing capability of ALU is utilized enough. However, communication overhead becomes heavier because weak dependence still exists between frames of input and output stream. When the computation of each record is very large, the communication overhead is acceptable. Thus, bm and forney of RS algorithm are implemented in this approach. Syndrome mentioned in previous section can also adopt this approach. However, its records have dependence, so that partitioning frame is difficult and the communication overhead is heavy. So ILP is a better choice. A third implementation of the RS decoder uses a SIMD architecture to exploit TLP. In this implementation, each cluster receives a separate data stream and acts as a full RS decoder. However, this approach will not be useful for applications that require only a fast real-time RS decoder because of its long latency. The approaches above can be mixed to exploit parallelism efficiently. During the practical design process, it is necessary to consider the parallelism of each stream and kernel, and choose a reasonable parallel approach. More details refer to [9].
3
Performance Evaluation 3
A statistic of bandwidth requirement of memory units in each level on Imagine and general purpose processor is in [10]. The conclusion is that the memory hierarchy of 3
All the data of TI DSP in this paper are obtained by CCS2 in the -o3 flag, fast simulator. The RS decoding codes are provided by TI Corporation[5] and the C code is not optimized. All the data of Imagine in this paper are obtained by Imagine simulator ISIM.
stream processor and its stream processing make bandwidth requirement distribute according to memory hierarchy (shown in figure 3(a)). The main data access is centered in LRF. It reduces the off-chip memory reference and solves the bandwidth bottleneck, so that it can increase performance greatly. sy ndrom
100.0%
bm
80.0%
chsrh
60.0%
forny
40.0% 20.0% 0.0%
arithmetic
comm
SP reg
uc
Fig. 2. (a)The utilization factor of main functional units (b) chsrh kernel schedule
. The utilization factor of functional units of Imagine can achieve a very high level (more than 90%) by exploiting proper ILP or DLP, as shown in Figure 2(a). At the same time, Figure 2(a) expresses that the program features of each kernel. Then the bottlenecks of kernels are: the reference of scratch-pad register in the syndrome kernel, communication in the bm kernel and computation in the chsrh kernel, that are accordant with the theory results. It’s helpful to extend the hardware of Imagine. Taking the chrsh kernel as an example, the main computing feature of this kernel is addition. Figure 2(b) shows a partial visualization of the inner loop of chsrh after scheduling and software pipelining (unrolling loop twice, and This schedule was created using iscd[2]). Each functional unit in an Imagine is shown across the top, and the cycles in the kernel are down the side. Rectangles indicate an operation on the functional unit and all the envelop-style operations are added during the scheduling process. The times of looking up tables are 16, which are accordant with the operations of SP unit in scratch-pad (SP) register shown in the figure. Addition units are almost filled and the multiplier and divide unit almost not used are spare. High local bandwidth supports the computation units to run with full loads. memory
1000
c6711 C code c64x C code c64x assembly optimizer Imagine StreamC/KernelC code
SRF LRF
100
100000 10000 1000
10
100 10
1
1 0.1
Syndrome
Peak
BM
chien
Forney
RS
Fig. 3. (a)Bandwidth hierarchy (GB/s) (b) Performances for RS (204,188) (cycles)
For comparison, we take TMS320C67x as a reference of general purpose processor and TMS320C64x as a reference of special purpose processor because of instruction GMPY4 for RS decoding. The TI DSP algorithm is similar to the Imagine version. Figure 3(b) presents that the running time (not including initial time) of each module on different chip simulators for RS (204,188) which is widely used for ADSL modem. Because Galois field multiply is the majority of the whole computation of RS
decoding, the execution difference of Galois field multiply in different processors is the key of performance gap. As a result, the performance gap between general processor and special processor are very clear. The Imagine has comparable performance to C64x.
4 Conclusion This paper discusses how to develop an efficient implementation of a complete RS decoder solution on the Imagine platform and compare the experimental results with several TI DSPs. We can find that the benefits that memory hierarchy of stream architecture and stream processing bring to stream applications are significant. It is a bandwidth-effect architecture which supports a large number of ALUs. This work has shown that stream processing is applicable to RS decoding. Researches show if application could be expressed in streams (data stream couldn’t be reused once flowing, there is perfect producer and consumer model) is very important to make use of stream architecture’s advantage. Typical stream applications including media processing, RS decoding, network processing and software defined radios have native stream feature, so they are best suited for stream architecture. Some classes of scientific problems are well-suited for stream processor[6]. However, Imagine processor doesn’t achieve high performance for application not well-suited for its architecture, like transitive closure[7]. Its complex programming is another shortcoming. Programmers need to organize data into stream, and write program at two levels. They need to, and pay more attention because of the visible memory hierarchy. The following work is going on extending the domains of stream application, and researching on stream architecture[8] and stream scheduling deeply.
References 1. Shu Lin, D.J.Costello, Error Control Coding Fundamentals and applications, 1983 2. Peter Mattson et al, Imagine Programming System Developer’s Guide, http://cva.stanford.edu, 2002. 3. Beginner’s guide to Imagine Application Programming, http://cva.stanford.edu, March 2002. 4. Imagine project, http://cva.stanford.edu/Imagine/project/. 5. TI, Reed Solomon Decorder: TMS320C64x Implementation, 2000. 6. Jung ho Ahn, W.J.Dally et al, Evaluating the Imagine Stream Architecture, ISCA2004. 7. Gorden griem, Leonid oliker, Transitive Closure on the Imagine Stream Processor, 5th workshop on media and streaming processors, San Diego, CA, December 2003. 8. Mei Wen, Nan Wu, Chunyuan Zhang et al, Multiple-dimension Scalable Adaptive Stream Architecture, In: Proc of Ninth Asia-pacific Computer System Architecture Conference, Springer’s LNCS 3189, 2004. 199~211 9. Nan Wu, Mei Wen, et al, Programming design patterns for the Imagine stream architecture, 13th National Conference on Information Storage Technology, Xi’an, China, 2004 10. Mei Wen, Nan Wu et al, Research of Stream Memory Hierarchy, 13th National Conference on Information Storage Technology, Xi’an, China, 2004