Dynamic Low-Density Parity Check Codes for ... - Semantic Scholar

6 downloads 0 Views 144KB Size Report
“bathtub curve” of failure rates classically begins with a very high rate of infant mor- tality, then smooths to a relatively low rate of failure for some time, and then ...
Dynamic Low-Density Parity Check Codes for Fault-tolerant Nanoscale Memory Shalini Ghosh and Patrick D. Lincoln {shalini,lincoln}@csl.sri.com Computer Science Laboratory, SRI International, Menlo Park, CA

Abstract. New bottom-up techniques can build silicon nanowires (dimension < 10 nm) that exhibit remarkable electronic properties, but with current assembly techniques yield very high defect and fault rates. Nanodevices built using these nanowires have static errors that can be addressed at fabrication time by testing and reconfiguration, but soft errors are problematic, with arrival rates expected to vary over the lifetime of a part. In this paper, we propose using a special variant of low-density parity codes (LDPCs) — Euclidean Geometry LDPC (EG-LDPC) codes — to enable dynamic changes in level of fault tolerance. Apart from high error correcting ability and sparsity, a special property of EG-LDPC codes enables us to dynamically adjust the error correcting capacity for improved system performance (e.g., lower power consumption) during periods of expected low fault arrival rate. We present a system architecture for nanomemory based on nanoPLA building blocks using EG-LDPCs, and an analysis of its fault detection and correction capabilities.

1

Introduction

The continued scaling of photolithographic fabrication techniques down to 32 nm and beyond face enormous technology and economic barriers. Self-assembled devices such as silicon nanowires or carbon nanotubes show promise to not only achieve aggressive dimensions, but to help address power and other gating issues in system architecture, while potentially helping contain rampant increases in fabrication capital costs. However, assembling high-quality large-scale nanoelectronic circuits (e.g., with Langmuir-Blodgett or related methods) has proven challenging. Among the major challenges are extremely high defect and fault rates in assembled devices. Apart from fabrication errors, nanoscale devices are also more prone to soft errors than microscale devices. Current-day microscale devices (e.g., gates, PLAs, memories) constructed using top-down lithographic techniques have error rates of less than 1%. But computing and storage components built using nanoscale elements (e.g., bistable and switchable organic molecules, carbon nanotubes, single-crystal semiconductor nanowires) may have an order of magnitude higher rates of faults (as high as 10%) [1]. Static defects can be handled using testing and reconfiguration [2], though this presents increasing challenges as technology scales. Because of high softerror rates, our envisioned nanomemories would also need to employ online error correcting codes (ECC), as most modern memory subsystems already do [3]. At the highest level, the work presented here aims to enable controlled reduction in fault tolerance during a part’s lifetime. We work toward enabling the engineering of defect and fault tolerance, traded against system power and other key parameters, because soft error rates vary over the lifetime of complex electronic components. The

“bathtub curve” of failure rates classically begins with a very high rate of infant mortality, then smooths to a relatively low rate of failure for some time, and then slowly rises as the part nears the end of its life cycle. In addition, environmental stresses may change the expected fault rate (temperature, radiation levels). Classically, one must design error correcting codes (ECCs) to tolerate the maximum failure rate expected over the entire lifetime of a device, resulting in “wasteful” tolerance of “too many” faults during the bulk of the component’s lifetime. But with dynamic coding, if we are able to diagnose and reconfigure out subcomponents after a time of highfault-rate infant mortality, and we expect fault arrival rates to be greatly reduced for a significant period of time, we would like to turn off some of the power-hungry ECC circuits. Later, if fault rates were to climb again, we would wish to turn those ECC circuits back on. At the nanoscale, we desire an error correcting system that has (1) high error detecting and correcting ability, to tolerate relatively high soft error rates; (2) sparse modularity, so that the encoders and decoders can be synthesized using simple modular hardware; and (3) dynamic error-correcting capability, to be able to trade off between error correction and system performance in the presence of variable fault rates. In this paper, we propose the use of low-density parity codes (LDPCs) for nanomemories [4] that satisfy all these properties. We propose the use of a variant of a particular type of LDPC code, Euclidean Geometry (EG) LDPC, where the H-matrix can be interpreted as an incidence matrix in a finite geometrical space spanned by m-tuples over GF (2s ). EG-LDPCs are decodable using a one-step majority decoder, thus making it unnecessary to have complex iterative decoders. EGLDPC codes are sparse, and their encoding and parity check matrices have a regular modular structure, which enables the error correction rate to be changed dynamically. The sparseness (enabling low fan-in circuit implementation) and modularity make the dynamic EG-LDPC code easy to implement using nanoscale hardware. Various nanodevices have been proposed in the literature (e.g., nanowire-based PLA, quantum nanodots) to serve as the basic building block of memory and computation elements. In this paper, we propose a design using nanowire crossbar arrays built using flow techniques [1], which can serve as memory cores, programmable PLA planes, and programmable crossbar interconnects. We analyze the properties of EGLDPC codes that make it suitable as ECC for nanomemory, and suggest how the coder and decoder circuits can be efficiently fabricated using nanoPLA memory units and gates. We also consider that the encoder and checker circuits themselves can suffer faults, since they may be fabricated out of fault-prone nanoscale components. We provide an analysis of fault detection and correction capacities of nanomemories with EG-LDPC codes, and design an overall system architecture based on nanoPLA building blocks.

2

Overall Model

Figure 1 shows the overall system architecture of the nanomemory with EG-LDPC ECC. During a write operation, the incoming word to be stored in memory is encoded by the encoder and the code word is stored in memory. During a read operation, a code word is retrieved from the memory and checked by the checker unit. Finally, the majority logic unit decodes the syndrome and does the error correction. The

Fig. 1. Overall architecture of ECC Fig. 2. Modular structure of encoder, deNanomemory. coder, and checker blocks.

controller unit controls the error detection and correction capability of the ECC unit. We now describe these different components and their implementation using nanoPLA components. Encoder and Checker: As explained in Section 1, each submatrix Hi in H can be expressed in the form Pi H1 , where Pi is a permutation matrix. Note that since LDPC is a linear code and H1 is in the standard form, the generator matrix G can also be expressed in a block-modular form as H [4], such that GT = (GT1 GT2 . . . GTγ ) and each submatrix Gi = Pi G1 . Figure 2 shows the modular design of the encoder circuit, the hardware implementation of G – a tree of nanoPLA XOR gates implement the encoder block G1 , and a reconfigurable nanoPLA crossbar [1] implements the permutation block Pi . The same modular structure of Figure 2 can be used for implementing the checker circuit. Decoder and Controller: The majority-logic (ML) decoder requires two key hardware units: XOR and majority, which can both be implemented using nanoPLA components. A controller circuit is used to generate the control signals for the selective enabling and disabling of the encoder, checker, and decoder components, to turn off unused hardware modules and thereby control the level of error correction in the dynamic code. Since we need only one controller for a complete memory bank, this module can be implemented using micro-level circuitry. Dynamic Error Correction: Since the H (G) matrix is modular, the controller can selectively enable modules of the ECC circuit at different times of the operation, thereby changing γ and giving the required dynamic error correcting capability. This selective enabling and disabling of the decoder, checker, and encoder blocks by the controller can give significant power savings during normal operations of the ECC nanomemory, particularly if failed components can be configured out of the system.

15 14

0.8

13

0.7

12

0.6

11 Gamma

Probability of Correct Operation

1 0.9

0.5 n = 32

0.4

9

n = 16 0.3

8

n=8

0.2

7

0.1

6

0

2

4

6

8

10 Gamma

12

14

16

Fig. 3. Probability of correct detection as a function of the number of dynamic EGLDPC encoder, checker and decoder units (γ) utilized, for m = 3, 4, 5.

3

10

5 0.01

0.02

0.03 0.04 Probability of failure

0.05

0.06

Fig. 4. Change of γ (for correct operation of the ECC-memory with > 99% probability) with different values of p.

Analysis and Experiments

Fault Tolerance: Let ǫe , ǫm , and ǫc be random variables signifying the number of errors in the encoder, memory, and checker blocks respectively. We assume that the majority logic decoder is fault free. According to Section 2, when γ components are enabled by the controller, the overall error correcting capacity of the ECC memory system (with ML decoding) is γ/2. Therefore, the majority-logic decoder will be able to correct errors as long as ǫe + ǫm + ǫc ≤ γ/2. When γ units are enabled by the controller, the code is a (γ, ρ)-regular Gallager code, that is, the check matrix unit H1 has a single 1 in each column and ρ 1’s in each row, while the encoder matrix unit G1 has ρ 1’s in each column and a single 1 in each row. Consequently, all XOR gates realizing G1 and H1 matrices are ρ-input XOR gates. Let pe and pc be the errors in one bit location of an XOR gate in the encoder and checker, respectively, and pm be the probability of a nanoPLA memory junction losing its charge. Assuming errors to be i.i.d. and the total  number of code bits to be n, we get Prob(e out of n code bits have errors) = ne (p)e (1 − p)n−e , where putting p = pj gives us the error probability in the memory, and so on. We see that the distribution of errors in the overall ECC memory system is the sum of binomial random variables. Therefore, the probability of the majority decoder not having any error is given by the cumulative distribution of the convolution of binomial distributions. Figure 3 shows how the probability of error-free operation of a memory system with EG-LDPC error correction changes along with different values of γ (the number of stages in the dynamic error control), for pe , pc = 1% (error rates for encoder and checker) and pm = 2% (error rate for memory). As shown in the figure, for an 8-bit memory, γ = 6 gives almost 100% probability of correct operation; the same high reliability is obtained using γ = 8 for a 16-bit memory and γ = 12 for a 32-bit memory.

Dynamic Coding: Along the bathtub curve, the probability of failure of components varies — high in the beginning, low during normal operations, and finally high again. Figure 4 shows how the γ value required (for > 99% probability of normal operation) varies with changing probability of failure of each component (in this case, we assume m = 5 and p = pe = pc = pm ). When p = 0.06, the required γ = 15, but as p decreases to 0.01, γ decreases steadily to 5.

4

Related Work and Conclusions

LDPC codes defined over other finite geometries have been studied lately [5], and hardware for performing efficient decoding of LDPC codes has been proposed [6], including at nanoscale [7]. The work in this paper is focused on ECC for nanomemory. Other error-correcting codes have been used at nanoscale – recently, a hierarchical fault-tolerance technique, using Hamming codes, was proposed for nanocomputing operations [8]. One important aspect of our current work is that the ECC circuit components (e.g., encoder, checker) can have faults in them, which are accounted for during error correction. Another scheme for handling faulty hardware modules was considered in von Neumann’s seminal fault-tolerant multiplexing scheme [9]. In the future, we would also like to perform our analysis with faults in the majority logic decoder circuit. In the nanodomain, asymmetric and unidirectional faults might occur – we plan to investigate such error models, study the prevalence of these types of errors through fault simulations in models of nanocomputing, and handle such faults using corresponding error-correcting codes, for example, asymmetric codes, unidirectional codes [4]. Acknowledgment: This material is based upon work supported by Harvard University Agreement #03-130068 under AFRL/DARPA prime contract #FA8650-06-C-7618.

References 1. Dehon, A.: Nanowires-based programmable architectures. ACM Journal on Emerging Technologies in Computing Systems 1(2) (July 2005) 109–162 2. Naeimi, H., Dehon, A.: A greedy algorithm for tolerating defective crosspoints in nanopla design. In: Proceedings of the International Conference on Field-Programmable Technology. (December 2004) 49–56 3. Ghosh, S., Basu, S., Touba, N.A.: Selecting error correcting codes to minimize power in memory checker circuits. Journal of Low Power Electronics 1(1) (April 2005) 63–72 4. Lin, S., Costello-Jr., D.J.: Error Control Coding. Pearson Prentice Hall (2004) 5. Tang, H., Xu, J., Kou, Y., Lin, S., Abdel-Ghaffar, K.A.S.: On algebraic construction of Gallager and circulant low-density parity-check codes. IEEE Transactions on Information Theory 50(6) (June 2004) 1269–1279 6. Murphy, G., Popovici, E.M., Bresnan, R., Marnane, W.P., Fitzpatrick, P.: Design and implementation of a parameterizable LDPC decoder IP core. In: Proceedings of International Conference on Microelectronics. (2004) 7. Beiu, V., Aunet, S., Nyathi, J., Rydberg-III, R.R., Djupdal, A.: The vanishing majority gate trading power and speed for reliability. In: Proceedings of NanoArch. (2005) 8. Rao, W., Orailoglu, A., Karri, R.: Architectural-level fault tolerant computation in nanoelectronic processors. In: 2005 International Conference on Computer Design. (2005) 533–542 9. von Neumann, J.: Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata Studies (1956) 43–98

Suggest Documents