Semi-automatic Source Code Generation for the

0 downloads 0 Views 120KB Size Report
Institute for System Dynamics (ISD) ... source code generator, these parameters are translated into Verilog ... from Matlab. To the .... mented in Verilog and tested on an FPGA. ... [8] T. V. Ramabadran, “A Tutorial on CRC Computation,” in IEEE.
Semi-automatic Source Code Generation for the Hardware-Implementation of a Parallelizable and Configurable Error Correction Unit Jens Spinner and J¨urgen Freudenberger HTWG Konstanz University of Applied Sciences, Konstanz, Germany Institute for System Dynamics (ISD) Email: [email protected]

Abstract—Bose-Chaudhuri-Hocquenghem (BCH) codes are among the most used error correction codes. They are used in DVB-T receivers and are deployed for error correction in flash memories. Nowadays codes with large codeword lengths and large error correction capability are used which is a new challenge in hardware-development for the encoder and decoder. This article presents a parallel BCH error correction unit with configurable error correction capability. Furthermore, it describes how tools for source code generation can be used to support the development process.

I. I NTRODUCTION In this work we consider a BCH error correction unit for flash memory applications. The importance of flash memories as a non-volatile mass storage is continuously increasing. A flash memory consists of floating gates in which the information is stored. These floating gates can hold their state without a power supply. However, errors occur while the information is read. The error rate depends on the storage density, the used flash technology (multi-level cell or single-level cell) and on the amount of the read and write cycles [1]. In general the error statistic model of a flash memory can be assumed as a binary symmetric channel. Hence, an error correction coding (ECC) unit is required, which has the task of correcting the errors. Today, typically BCH codes (cf. [2]) are used for error correction [3], [1]. The error correction module in the flash controller must be designed flexible to satisfy the different error correction capabilities for a large bandwidth of different flash technologies and applications. The design of parallelized BCH encoder/decoder unit is described for instance in [3], [4]. In this article we also consider the design of a parallel BCH error correction unit. However, the proposed design is configurable for

Fig. 1.

Shift register for serial encoding

different error correction capabilities. Furthermore, we describe how tools for source code generation can be used to support the development process. With the hardware description of a BCH ECC unit there exists a generic source code, which controls the data flow. On the other hand, there exists specific source code which depends on the particular BCH code, i.e. the generator polynomial, the symbol alphabet and the error correction capability of the BCH code. With the tools of model driven architecture (MDA) [5], this BCH code depending source code of a hardware description can be designed on a higher abstraction level. Using a source code generator, these parameters are translated into Verilog code. A corresponding design process for a powerful parallel processing encoder unit is presented in this article. In section II a brief overview of the encoding process is given. The section III explains the implementation of an encoder that can be generated with methods explained in section IV. For the decoding process, section V explains how a syndrome module and a Chien search module can be generated. Overall results and conclusions are given in section VI. II. BCH E NCODER For encoding of cyclic codes, like BCH codes, an information polynomial u(x) is multiplied with a generator polynomial g(x). This can be implemented with a linear

feedback shift register (LFSR) whose taps correspond to the coefficients of the generator polynomial (see Fig. 1). The encoder shifts the information bitwise into the LFSR. After encoding the register contains the parity bits s(x). The length of the generator polynomial depends on the used alphabet size and the number of used minimal polynomials which determines the number of correctable bit errors t (error correction capability). Depending on the requirements this encoding sequence can be parallelized [6]. If a high throughput has to be achieved a higher parallelization is required. But the higher the parallelization is, the higher is the amount of required gates. In a partial parallel encoder the amount of information bits can be chosen arbitrarily. A useful choice for the parallelization degree is the size of a byte or a word. In this article a source code generator is presented which produces a parallel encoder in an efficient way. The generic part of the code which is excluded from the generation of the source code is concerning mainly with the control of the data flow as reading the information from a source and writing the parity to the memory. The generated Verilog source code is completely dependent on the generator polynomial and the parallelization degree of the desired implementation. Such a source code generation has the advantage that the generator polynomial and the parallelization degree can be described on a higher description level. Hence, it is easy to implement and adjust a large generator polynomial or different parallelization degrees. Generator polynomials for BCH codes can be obtained e.g. from Matlab. To the best of our knowledge, methods for source code generation are limited to cyclic redundancy check (CRC) codes [7]. III. PARALLEL PARITY C ALCULATION As already mentioned, the parity bits s(x) can be calculated by multiplication of the information u(x) by the generator polynomial g(x). [8] describes how the decomposition of s(x) for m different extensions s′ (x) can be formulated as s′ (x) = Rg (x)[t0 (x)xl + ... + tm−1 xl+m−1 ] + Rg (x)[s0 xm + s1 xm+1 + ... + sn−m xn ] (1) where l is the degree of g(x). b(x) in Fig. 2 is the information fragment that has to be processed in parallel with m bits. Figure 1 shows how each new information bit which is shifted into the LFSR is XORed with the last remaining bit in the shift

Fig. 2.

Parallel parity calculation

register. [8] introduces the simplification ti (x) = bi (x) + si+m−1 (x)

(2)

for a new information fragment. The first summand inserts the new information bit into the parity calculation and the second summand represents the feedback of the already calculated parity. The integration of the new information bits is done by multiplication of this information bits with the generator polynomial. This means that for the parallel calculation each information bit is combined with the shifted generator polynomial depending on the digit. The first information bit combined with the generator polynomial shifted by zero, the second information bit is combined with g(x) shifted by two and so on. The generator polynomial fractions are static for a given g(x). Therefore, a bitmask can be created for each parity bit which is combined bitwise with the information t(x) by AND gates. The resulting bits are combined by an m-input XOR gate (addition in the Galois field GF(2)). There are exactly degree(g(x)) bitmasks of size m. Each bitmask column represents a logic circuit for one LFSR tap. A schematic of the parallel encoder is depicted in Fig. 2, where the orange blocks represent the tap logic. A configurable bit-wise BCH encoder was presented in [9] where multiplexers are used to configure the LFSR taps for the different generator polynomials. Similarly, a configurable parallel encoder can be achieved by implementing the logic circuits for different BCH codes. During operation of the encoder unit, the required logic is selected with a multiplexer for each register tap. IV. AUTOMATED G ENERATION

OF

B ITMASKS

Bitmasks as they are described in [10] can be created by shifting symbols in the LFSR corresponding to the generator polynomial. This can be achieved by a script language like TCL which can easily handle lists. The

LFSR is modeled by a list where each element represents a bit location. Taking the input signal into account, the elements of the list are shifted according to the generator polynomial. Figures 3 to 6 present an example for m = 2 and a Hamming code with three parity bits [2]. At the beginning the shift register is in the initialization state (Fig. 3). The information bits b0 and b1 as well as the initial value or the feedback value from s0 to s2 are still not shifted.

Fig. 7.

Fig. 3.

Example for Hamming code - initialized state

BCH decoder scheme

As described in (2), b0 and s0 are combined and shifted into the LFSR. This is repeated with b1 and s1 (see Fig. 4 and 5). In the next step the sums sx + bx are substituted by tx (see Fig. 6). After m shifts the contents of the registers determines the bitmasks for the tap logic. V. BCH D ECODER

Fig. 4.

Example for Hamming code - shifting b0

Fig. 5.

Example for Hamming code - shifting b1

Let r(x) be the received data, i.e. r(x) is a polynomial that represents the read data. α is the primitive element of the Galois field. A BCH code is decoded in three steps [11]. First the 2t−1 syndrome values Si , i = 1, . . . , 2t−1 have to be calculated. This is done by evaluating r(x) for αi , i.e. Si = r(αi ). Based on these syndrome values the error location polynomial λ(x) can be determined using the Berlekamp-Massey algorithm (BMA) or the Euclidean algorithm. This error location polynomial has to be evaluated by the Chien search in order to get the error bit positions e(x). Finally the received information can be corrected using these error positions. In Fig. 7 this BCH decoder scheme is shown. A. Syndrome calculation For a hardware implementation this syndrome calculation also be expressed by S = r · HT

(3)

with  Fig. 6. Example for Hamming code - substitution and conversion into bitmasks for Verilog source code

 

HT =   

1 α .. .

1 α2

··· ··· .. .

αn−1 α2·(n−1) · · ·

1 α2t .. . α2t·(n−1)



  .  

(4)

The syndrome values can be computed either in serial by a shift register or fully parallel with a matrix multiplication as in (3) [1], where we use the parallel implementation. In order to design a configurable decoder, the syndrome module always calculates the maximum number of syndrome values which is determined by the highest error correcting capability. Because these values are calculated in parallel, the syndrome module does not require additional configuration logic. The two main tasks of the source code generator are creating the matrix multiplication logic and calculating the powers of the primitive element α which are the elements of this matrix. The meta parameters for the source code generator are the codeword size n, the error correction capability t and the primitive polynomial which determines the primitive element α. B. Chien search The error positions are determined by the roots of the error location polynomial λ(x) which is calculated using the BMA. The implementation of the BMA is not discussed in this paper. The Chien search successively evaluates the error location polynomial for different powers of the primitive element α [12]. The evaluation of the error location polynomial l

λ(α ) =

t X

λi α .

(5)

can also be implemented in serial or parallel [2]. The task of the source code generator is here to generate the multiplications λi αil and the summation. Given the bit error correction capability t and the parallelization degree m the number of instances of multiplications can be calculated by t · m. The number of summation instances is equal to m where each instance contains t single summations (XOR gates) for each correctable error which is implemented. Because all multiplications are performed in parallel, a configurable decoder can be designed by configuring only the summation. The number of multiplications is determined by the maximum error correcting capability. The summations are bit-wise XOR operations where the number of inputs depends on the selected error correction capability and can be configured using multiplexers. AND

slices 5542 1680 24398

registers 3888 1913 4189

LUTs 7693 2518 34691

Tab. I H ARDWARE RESOURCES

The generated Verilog source code was successfully synthesized for the Xilinx Virtex4 (xc4vlx200) FPGA. The resources needed for each module are listed in table I. This article shows that most logic elements of a BCH ECC unit can be generated automatically. In particular, the encoder, the syndrome calculation and the Chien search of the proposed implementation are based on generated Verilog code. In this experimental implementation the ratio of lines of generated code to the lines of generic code is about 14:1 for the encoder as well as for the syndrome module and about 11:1 for the Chien search. In the current implementation the BMA and the Galois field multipliers are not generated automatically. The code generation for this modules is subject of future research. R EFERENCES

il

i=0

VI. R ESULTS

encoder syndrome computation Chien search

C ONCLUSIONS

The proposed 8-bit parallel ECC unit was implemented in Verilog and tested on an FPGA. The BCH encoder and decoder can be configured for correction capabilities of 12, 24, 40, 48, 60, 72 and 96 bit errors.

[1] R. Micheloni, A. Marelli, and R. Ravasio, Error Correction Codes for Non-Volatile Memories. Springer, 2008. [2] S. Lin and D.Costello, Error Control Coding. Prentice Hall, 2004. [3] W. Liu, J. Rho, and W. Sung, “Low-Power High-Throughput BCH Error Correction VLSI Design for Multi-Level Cell NAND Flash Memories,” in 2006 IEEE Workshop on Signal Processing Systems Design and Implementation. IEEE, Oct. 2006, pp. 303–308. [4] C. Wang, Y. Gao, L. Han, and J. Wang, “The design of parallelized BCH codec,” 3rd International Congress on Image and Signal Processing (CISP), 2010. [5] P. Boulet, J.-L. Dekeyser, C. Dumoulin, and P. Marquet, “MDA for SoC Embedded Systems Design, Intensive Signal Processing Experiment,” in FDL03, 2003. [6] X. Zhang and K. K. Parhi, “High-Speed Architectures for Parallel Long BHC Encoders,” in IEEE Transactions on very large scale integration (VLSI) Systems, Vol. 13, No. 7, Jul. 2005. [7] M. Sprachmann, “Automatic Generation of Parallel CRC Computations,” in Design & Test of Computers, IEEE, 2001. [8] T. V. Ramabadran, “A Tutorial on CRC Computation,” in IEEE Micro, 1988. [9] T.-H. Chen, Y.-Y. Hsiao, Y.-T. Hsing, and C.-W. Wu, “An adaptive-rate error correction scheme for nand flash memory,” in 27th IEEE VLSI Test Symposium, may 2009, pp. 53 –58. [10] Perez, “Bytewise CRC-Calculations,” in IEEE Micro, 1983. [11] M. Bossert, Kanalcodierung. Teubner, 1998. [12] R. T. Chien, “Cyclic Decoding Procedure for the BoseChaudhuri-Hocquenghem Codes,” in IEEE Trans. Inform. Theory, 1964.