memory cost and possibility to achieve lower power will also be explored. The research ... Table 2: Behaviour of different costs of FEC algorithms. Algorithm. Behaviour of DP ... source codes, the first step of design will be application profiling to ...
Programmable Parallel Data-path for FEC Rizwan Asghar, Dake Liu Dept. of EE, Linköping University, SE-581 83 Linköping, Sweden {rizwan, dake}@isy.liu.se
Abstract – We are trying to implement a flexible forward error correction (FEC) engine with programmability and re-configurability. The design focus is to provide a platform which supports maximum of the available algorithms (like Viterbi, Read Solomon, Turbo, LDPC, Interleaving and De-interleaving) as well as any newly evolved algorithm. The initial study presented in this paper will provide the basis for the evolution of the low cost hardware multiplexing of multiple FEC algorithms. I. INTRODUCTION In telecommunication, forward error correction (FEC) is a system of error control for data transmission, where the sender sends the messages after adding some redundant data and receiver is able to detect and correct the errors. The advantage is that the retransmission may be avoided at the cost of higher bandwidth. We want to have research on a platform for the implementation of the available error correcting code algorithms. The goal of the research is to find a way to design and merge currently used FEC algorithms into a flexible ASIC module, to maximize the flexibility, and to minimize the silicon costs at the same time. Flexibility here stands for the ability to adapt the hardware to different algorithms and to configure the hardware towards different scale of an algorithm. In addition to the flexibility and low silicon cost the less memory cost and possibility to achieve lower power will also be explored. The research includes selection of algorithms for implementations, partitioning and fine tuning algorithms for hardware multiplexing, design memory architecture for parallel memory accesses and memory reuse. This paper presents the initial study done for the investigation of flexible FEC engine. In this paper Section II describes some background and Section [III – V] give the overview of the current situation and future challenges. The proposed implementation mothology have been discussed in detail in Section VI and the expectations and the final remarks are given in Section [VII – VIII]. II. Background The main categories of FEC are block coding and convolutional coding. Among these categories again there are many classes of algorithms, e.g. block codes includes ReedSolomon, BCH, Hamming etc and convolutional codes includes decoding through Viterbi algorithm. The relatively new class of iterated short convolutional codes is the Turbo Codes which is replacing convolutional codes. The turbo codes closely approach the theoretical limits given by
Shannon’s Theorem with much less decoding complexity than the Viterbi algorithm for the long convolutional codes. Error correction using the above mentioned algorithms is very demanding in terms of implementation. In design of programmable baseband processors [1] instruction level acceleration has been suggested e.g. Galois field operations for Reed-Solomon [2] and add-compare-select instructions for Viterbi [3]. However, even with such instructions the MIPS cost is still high. Many types of implementations of the forward error correction algorithms have been done in past, but most of them are ASIC based and few in programmable domain. This provides the least flexible environment as far as error correcting codes are concerned. III. Situation I Implementation of the error correcting schemes in ASIC has provided the ultimate performance for current wireless and radio applications, but the world is now changing towards re-configurability and programmability of complex systems. This gives motivation to the development of re-configurable and programmable FEC subsystems in order to meet the ultimate design portability and reusability. The required parameters for the implementation of programmable FEC subsystems are suggested below in Table 1, which can take care of ultimate requirements for the next years. Table 1:Future Requirements for Programmable FEC Engine Sr. No. Subsystem / Area Requirement 1. Performance ~ 10 GOPS 2. Resolution 3 ~ 5 bits 3. Computation Parallel 4. Memory Access Parallel 5. Programmability Yes (Low Level) 6. Power Low Power (10 ~ 100 mW) IV. Situation II Many algorithms and applications will be studies and explored. The implementation will be done through hardware multiplexing of multiple algorithms and applications keeping in view that any terminal does not uses all the algorithms at the same time. As a matter of fact it might happen that the implementation is not optimal for one type of algorithm or application, but over all, it will give the benefit in terms of different constraint parameters when compared to number of ASICs to do the same job. The low cost hardware multiplexing of multiple FEC algorithms can be achieved by partitioning and fine tuning the
algorithms. It is an iterative process over the available algorithms and extracting the useful functions and features which can be multiplexed. In multimode applications a number of ASIC cores are used on chip to achieve the requirement, but few of them work at the same time. The rest remain idle, but definitely consume static power or leakage power. The increase in leakage power due to new technologies will restrict the use of multiple cores on chip. In this situation the use of few multiplexed architectures instead of many cores may be very useful. V.
Review and Challenges
ASIC App-1
ASIC App-2
(a)
The initial study shows that the main challenge for the investigation will be the convergence of hardware multiplexing. The convergence of hardware multiplexing is not explicit. Thus by achieving the convergence of hardware multiplexing upto roughly 60% can provide the basis of having the platform which will provide better results instead of using ASIC cores. The other challenges related to memory access are listed below: • • •
Maximum reuse of memories. Design datapath and microarchitecture to achieve maximum utilization of the data precision in a memory word so that memory costs can be minimized. Parallel memory access algorithms for multiple parallel computing modes and multiple algorithms.
VI. Implementation Methodology The ASIP design methodology will be adapted to meet the set goals. By using the low cost hardware multiplexing of multiple FEC algorithms, the great benefit in terms of area, power, silicon cost may be gained as shown in Fig. 1.
ASIP with HW MUX
(b)
Figure 1: Benefit of HW Multiplexing The design steps to be used in implementation are mentioned below:
Looking at different algorithms related to forward error correction, the observation is that the behaviour of computing, memory access and related cost is very diverse. The summary of behaviour of different costs of some of the studied algorithms is given in Table 2. Table 2: Behaviour of different costs of FEC algorithms Behaviour Behaviour Memory Algorithm Configuration of DP of ADG LUT or Interleaving Soft bit Polynomial / De__ Parallel Computation interleaving Viterbi ACS array Trellis LUT Turbo Interleaving Shift Reg + Interleaving + Viterbi Counter + Viterbi Reed Register File Galois Field Solomon and RISC Like Operations FIFO Buffer Others Under Study
ASIC App-3
a. b. c. d. e. f. a)
Code / Application Profiling Architecture Exploration Architecture Selection Designing Instruction Set Micro-architecture Design Firmware Design Code / Application Profiling
There are many ways to understand the application behind the processor. Reading related text and reading related standard specification is one way, or the necessary and the first step. However, the knowledge achieved from reading cannot be quantitatively accurate. An accurate way is to understand the application through code profiling, which has been accepted as an accurate way to understand a system. After collecting enough application documents and source codes, the first step of design will be application profiling to understand and predict the running statistics of applications. Based on results from source code profiling, there should be sufficient understanding of the performance critical path in the source code and function coverage. The profiling will also expose the code structural and running behaviors, such as the coverage of arithmetic operations, costs of algorithms, and costs of memory accesses. In addition the opportunities of parallelization for further performance enhancements will be explored. After the source code selection and profiling the following questions will be answered. • • • • • • •
How do we define the scope of the instruction set? How do we predict the coverage of the future applications? How do we translate the function coverage to instruction set coverage? Which are the most MIPS cost functions and should be accelerated? Which are the most appearing functions and should be accelerated? What will be datapath architecture? What will be the memory and bus architecture to reach the minimum on chip memory cost?
These answers are of great importance for the highly optimized and efficient design in terms of performance, silicon cost and power. How to Profile? Selecting or designing a profiler is essential for profiling. A profiler is a software development tool which analyzes the source code, collects and reports the execution statistics of the source code. A profiling tool usually includes three parts, the code analyzer (static profiling), the instrumentation part and the part for run time statistics. The first step of the source code profiling is to analyze the source code (also called static profiling). The analysis exposes the code structure using task flow graph (TFG). The fine granule flowchart can be used to expose computing arithmetic in detail. The coarse granule flow graph exposes control behaviors of the source code by hiding arithmetic computing into basic blocks. After accumulating arithmetic operations in each basic block, worst case run time can be further identified by accumulating the total computing cost of each path in the flowchart. Dynamic behavior of the source code execution can be further exposed by running the source code with typical inputs. Instrumentation should be executed before dynamic profiling. Instrumentation (or probe) is to mark which execution should be counted by inserting additional counters into the original source code. The inserted counter codes shall not change the program function. Instead, the inserted code is only used to monitor the program behavior by counting the interesting executions, and gathering counted information to the profiling score, which is a log file.A profiling process can be further divided into five parts: • • • • •
Design or select a profiler Source code analysis (static profiling) Prepare and configure the profiling tool (instrumentation) Run dynamic profiling Analyse the results of static-dynamic profiling
The selection of language for profiling is another important aspect. Profiling can be performed on source code or on assembly code. Profiling on source code is fast but in general not accurate enough. The accuracy of the source code profiling depends on the programming style. Profiling on assembly code is accurate but slow and might even be impossible during the early design phase. IR (Intermediate representation) level profiling was investigated in recent years [4, 5]. IR is a language representation lower than C and above than assembly, it is used inside a compiler as intermediate language. During compiling, a compiler translates C code first to IR and then after code optimization, the compiler finally translates optimized IR to target assembly language. Because all micro operations must be exposed when translating C to IR and because IR contains only one operation within one line, accurate profiling is feasible and memory costs will be exposed during the profiling on IR level. The author believes that IR level profiling will be popular in the future for embedded processor design.
Profiling efficiency and profiling accuracy based on different languages are intuitively depicted in Fig. 2. Assembly language profile
Source code profile
Efficiency
Accuracy MATLAB
C/C++
IR
ASM
Figure 2: Different kinds of profiling b) Architecture Exploration The purpose of architecture exploration (also called design space exploration) is to decide the suitable architecture specific to the application that best suits for the class of application. The decision includes how many function modules are required, how to inter-connect these modules (relations between modules), and how to connect the ASIP to the embedded system. c)
Architecture Selection
Principally, there are so many different architectures to select and it is not an easy task to decide a suitable architecture for an application class. It is impossible to check and compare all possible architectures. That is why most architecture decision is based on experiences and drawbacks usually appear after selecting architecture. There are two general ways to make architecture decision, one way is to use reference architectures; another way is to generate a custom architecture dedicated for the task flow of the application. For the case of flexible FEC Engine, off course the general purpose processors and cores may be used if they are not overloaded already in a particular application, but they cannot provide ultimate performance. In order to investigate the selection of architecture for flexible FEC engine, the following will be the basic requirement: • • • • • • •
A medium size multi-port register file to accept and supply multiple data at the same time A wide bus between the register file and the on chip vector memory An on chip vector memory with multiple separate addressable physical memory blocks A wide bus between the on chip vector memory and the main memory A main memory with simultaneously accessible multiple physical memory blocks A flexible permutation network can shuffle data between the main memory and the on chip vector memory Finally, the most important thing: A strong programmer toolchain and methodology to utilize the hardware.
d) Designing Instruction Set
• • • •
Design of arithmetic instructions and accelerated arithmetic instructions for the functional algorithms and data quality controls. Design of memory access and addressing algorithms to supply enough data in time for all arithmetic computing. Design of program flow control Design of instructions for I/O access and supporting accelerators
The quality of the designed instruction set cannot be evaluated before a complete benchmarking and essential assembly code profiling. The toolchain for assembly programming should be prepared for assembly level benchmarking and profiling as soon as the assembly instructions are specified. e)
Micro-architecture Design
Microarchitecture design of an ASIP is to specify hardware implementation of assembly instruction set and its peripheral modules. The input of the microarchitecture design is the ASIP architecture specification and the assembly instruction set manual. The output of the microarchitecture design is the microarchitecture specification for RTL coding. The microarchitecture design can be divided into three steps. i.
Make partition of each assembly instruction into micro operations and allocated each micro operation to a specific hardware module
ii.
Collect all micro operations allocated in a hardware module and specify hardware multiplexing for RTL coding of the module. Do it for all modules
iii.
Fine tune inter-module specifications of the ASIP architecture specification and finalize the top level architecture
Similar to architecture design, there are also tow ways to design microarchitecture, one way is to use a reference microarchitecture; another way is to generate a custom architecture dedicated for the task flow of the application. f)
The Final expectation is a piece of silicon with proposed architecture as shown in Fig. 3.
CONFIGURATION
As soon as the architecture is decided, the instruction set of the architecture can be designed to meet the balance of requirements. The assembly instruction set design inputs are results of source code profiling (coverage and performance). The design will be function mapping to instructions and balance between performance and costs. The design output will be the assembly instruction set manual. Designing an instruction set includes:
VII. Expectation
Figure 3: Proposed architecture for Flexible FEC Engine Many aspects of the proposed architecture are to be analysed and explored. The datapath complexity will be handled for irregularity of data, data storage and data access. The vector computing will be explored by using one parallel datapath to handle many data types, many computing algorithms as parallel as possible with low silicon costs. VIII. Summary In this paper the starting point for the research of flexible forward error correction engine with programmability has been discussed. The initial study and some thoughts for the future research have been expressed. Based on these ideas further investigation can be done in this area.
[1]
[2]
[3]
[4]
[5]
Firmware Design
Firmware by definition is the fixed software in products. The software do not changed when the system is running. The firmware for the FEC related algorithms is not very difficult. It comprises of few lines for each algorithm.
[6]
REFERENCES Eric Tell, “Design of programmable base band processors”, Ph.D. Thesis, Linkoping University, Sweden, 2005 H. Michel Ji, “An optimized processor for fast ReedSolomon encoding and decoding”, Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.III.3097.III.3100, May 2002 Jeong Hoo Lee, Weaon Heum Park, Jong Ha Moon, and Myung H. Sunwoo, “Efficient DSP architecture for Viterbi decoding with small trace back latency”, Proceeding of the IEEE Asia-Pacific Conference on Circuits and Systems, pp. 129.132, Dec. 2004. Kingshuk Karuri, Mohammad Abdullah Al Faruque, Stefan Kraemer, Rainer Leupers, Gerd Ascheid, Heinrich Meyr, “Fine-grained Application Source Code Profiling for ASIP Design”, DAC 2005, June 13– 17, 2005, Anaheim, California, USA. Kingshuk Karuri, Christian Huben, Rainer Leupers, Gerd Ascheid, Heinrich Meyr, “Memory access microprofiling for ASIP Design”, DAC 2005, Proceedings of the Third IEEE International Workshop on Electronic Design, Test and Applications (DELTA’06). Dake Liu, “Embedded DSP Processor Design: Application Specific Instruction Set Processors/, Reading”, Elsevier (Mogen Kaufmann), to be published 2008