17.5 Efficient In-Memory Computing Architecture Based ... - IEEE Xplore

3 downloads 13134 Views 2MB Size Report
Abstract. To solve the “big data” problems that are hindered by the Von. Neumann ... computing architecture, we demonstrate a parallel 1-bit full adder (FA) both ...
Efficient in-memory computing architecture based on crossbar arrays Bing Chen, Fuxi Cai, Jiantao Zhou, Wen Ma, Patrick Sheridan, and Wei D. Lu* Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, USA [email protected] Abstract To solve the “big data” problems that are hindered by the Von Neumann bottleneck and semiconductor device scaling limitation, a new efficient in-memory computing architecture based on crossbar array is developed. The corresponding basic operation principles and design rules are proposed and verified using emerging nonvolatile devices such as very low-power resistive random access memory (RRAM). To prove the computing architecture, we demonstrate a parallel 1-bit full adder (FA) both by experiment and simulation. A 4-bit multiplier (Mult.) is further obtained by a programed 2-bit Mult. and 2-bit FA. Introduction The ever increasing logic and memory capabilities of digital computers have historically been driven by the semiconductor device scaling. However, continued scaling is facing significant fundamental, technical and cost challenges. Additionally, as shown in Fig.1 Von Neumann computing architectures increasingly suffer from the Von Neumann bottleneck problem where significant energy cost and wire delay are caused by frequent data flow between the CPU and the memory. Alternative architectures such as computing in memory and computing with memory approaches have attracted tremendous interest. Recent developments of crossbar memory structures based on emerging devices such as RRAM [1-2], which offers fast operation speed, low power, high reliability and high density integration [3-4], may lead to creative and feasible solutions to efficient in-memory computing architectures. Several prior studies of computing based on binary RRAM devices have been proposed [5-6]. However, prior studies mostly focused on cell level demonstrations, while system level analysis and demonstrations are necessary to advance developments of efficient RRAM-based computing schemes. In this work, we perform a complete analysis of a RRAM crossbar-based in-memory computing architecture, and demonstrate a prototype highly parallel and reconfigurable computing system through experiment and simulation. Experimental The RRAM devices based on a two terminal Cu/Al2O3/poly-Si crossbar structure were fabricated on top of a SiO2/Si wafer. 50 •3 nm polysilicon (1000 Ω/square, ‫׽‬1e19cm boron doping level) was deposited using low pressure chemical vapor deposition (LPCVD). The bottom poly-Si electrodes were patterned by photolithography and reactive ion etching (RIE) using the photoresist as the etching mask. Immediately after a 1M HF dip to remove the native oxide, 5nm Al2O3 was deposited using an atomic layer deposition system at 150 °C. Copper top electrodes were then patterned using photolithography and liftoff processes. Au was deposited on top of the Cu as a passivation layer. [7] The active device area is 2μm×2μm. The electrical measurements were performed using a Keithley 4200 semiconductor analyzer. For system and circuit-level analysis, a compact model [8] was developed and used to simulate the array/circuit behaviors.

978-1-4673-9894-7/15/$31.00 ©2015 IEEE

Computing architecture and design principle A possible system-level integration protocol of the proposed computing architecture is illustrated in Fig. 2. Key to this approach is to parse the operations into more fundamental computing elements where all input/output can be stored in the system (discussed later) so that desired outputs can be directly read out of any given input instead of having to be computed. The system consists of RRAM crossbar arrays, read-in/write-in buffer, read out buffers, decoder, pulse-train generator and controller which manages the system operation. If the data-result table of the target function has already been stored in the system, the input signal/data from I/O will be fed into the read-in buffer and the corresponding results can be found from the array and output to the read-out buffer. If the data-result relation of a function has not been stored, a new table entry for mapping this function needs to be written into the array. In this case, corresponding programming pulse-trains will be generated and applied to the crossbar array to generate and store results of the new function for all input scenarios simultaneously. The process is then repeated. The newly stored function can then be reused at future read operations. During programing, the basic logic is NOR since it can be readily mapped onto the RRAM array and can implement any Boolean logic elements to be stored in the array. As shown in Fig.3, the basic wired-NOR logic can be achieved either in 1 step by applying Vdd and Vdd/2 on the output cell and input cells, respectively. In this case, the output cell will be programmed into either HRS or LRS depending on the input cell states (approach 1); or in two steps by first performing a wired-OR operation during read then programing the output cell into HRS if the output current is high or LRS if the output is low (approach 2). Approach 1 is verified experimentally using the RRAM arrays, shown in Fig.4. Here Vdd = 4V is applied on the output cell C and ½ Vdd (2V) is applied on input cells A, B. The value (high resistance state HRS = 0, low resistance state, LRS =1 ) of cell C after programming is determined by the values of the input cells A B, with  ൌ തതതതതതതത ‫ ܣ‬൅ ‫ܤ‬. The programed logic elements (input/output values for a given function) can be directly read out by a data matching method without additional decoders as illustrated in Fig. 5. First both the input data and its complement are applied to the corresponding WLs for all BLs, and the currents on the BLs are read out. Only the BL whose input cell values match the current input values will produce a current smaller than a predefined threshold current I0 and can be correctly identified. This process essentially performs an RRAM-based content-addressable memory (CAM) function. The desired result can then be directly read out from the identified address (BL) by applying a Vread to the output WL. By combining stored logic (analogous to local look-up tables) and direct read with the non-volatile storage capabilities of RRAM arrays, truly co-located memory and logic can be achieved and this in-memory computing architecture will offer fast operation speed, low power, high reliability, and high function density.

17.5.1

IEDM15-459

Device performance Different from memory applications, the Ron/Roff ratio and programming/read currents are most important parameters in this work due to the need of parallel read of many cells on one BL and the need to program cells in the whole array per programming step. The Cu/Al2O3/Poly-Si based RRAM crossbar array in this work shows superior performance for in-memory computing application such ultra-low operation current (< 100nA), large effective Ron/Roff ratio (> 10E3) good retention and uniformity (Fig. 7-9). The experimental I-V curve can be well-matched using a compact, dynamic device model [8], (Fig. 7), allowing realistic simulations of computing system at large scale. Circuit demonstration verification and discussion Here, a 1-bit full adder was demonstrated as a proof of principle of the proposed computing architecture. As shown in Fig. 10, the 1-bit FA is implemented on board by four 1x16 RRAM crossbar arrays. The logic steps and the pulse-trains used to program the stored logic are shown in Fig. 11a-b. The stored logic functions are verified by resistance mapping of the RRAM array (Fig. 10b), which shows the desired logic results can be correctly stored and read out using the proposed parallel write and read schemes (Fig. 10c). Fig.11 shows the simulated programing process of the 1-bit FA in an 8x16 crossbar array to highlight the details during system operation. The FA function is parsed into NOR operations in (a) and stored in the array using the pulse trains shown in (b). The cell values in the RRAM array after the programming stage are shown in (c), and (d) plots the current on 4 representative BLs during the programming stage highlighting the evolution of the cell states, verifying the feasibility of the proposed programming (function storage stage) scheme. After function storage, the desired output can be directly read out for any given input during the computing stage, shown in Fig.12 which shows the simulated current output during read operation using the data matching method. In this stage, an identification signal corresponding to the input and its complement is first applied to the input WLs (WL1~6). For example identification signal 020202 is applied to the first 6 WLs for input ABC=101. Here “2” in the address code is represented by a read voltage pulse 2V and “0” is represented by ground. If the signal matches the input configuration (and its complement) stored in a BL, the current in this BL will be low since all Vread pulses will be applied to devices at HRS and all 0V will be applied to devices at LRS. A low current I=nxVread/RHRS is obtained for the matched BL where n is the number of “0”s (at Vread) in the input and its complements. On the other hand, any other BL will show much higher current, because at least one input is not perfectly matched with the address code and will produce a high current of I=Vread/RLRS. (RLRS: resistance at LRS) As a result, the corresponding BL can be directly identified using a simple comparator of the output current without any address decoding schemes. As shown in Fig. 12 Input values 111, 101, 100, 000 are correctly identified as stored in BLs 1, 3 4, 8 (marked as green) and the corresponding output values for Ci and S are then correctly read out as 10, 01, 11 and 00, respectively. A 4-bit multiplier is further developed to verify that the programed arrays with different stored logic functions can be constructed to process more complex task, as illustrated in Fig. 13-15. The 4-bit Mult. is based on the same approach and

IEDM15-460

achieved by readout results from a 2-bit Mult. and a 2-bit FA. The 2-bit Mult. and the 2-bit FA are programmed using the same parallelism NOR logic discussed earlier, where the corresponding resistance mappings are illustrated in Fig.14. 1011x1101 was calculated as an example. Firstly calculations of 11x01 11x11 10x01 10x11 were obtained by reading out results from the 2-bit Mult. respectively. Secondly the products were shifted and added by the 2-bit FA. The correct result is then obtained, as shown in Fig. 15. More complex functions can be either parsed into such basic functions stored in small RRAM arrays or through more complex sub-functions which requires larger array to store the required data-results. In other words, proper tradeoff needs to be made for time (operation steps) and space (array size) in the crossbar array based computing architecture for complex functions. In addition, critical circuit performance parameters such as power consumption, speed and operation current were analyzed through simulation using the dynamic model with realistic device parameters. Fig. 16 shows the calculated average power consumption per operation during the programing and read out stages. Employing the data matching method significantly reduces the power consumption of the system and a low energy consumption of 3.67nJ was obtained for a complete 4-bit multi operation. Besides its effect on power consumption, the operation current of the RRAM is also critical for the circuit’s reliability, since many devices will be programed or read at the same time and the maximum current through a BL/WL in the worst case scales with the array size so a small programming is desirable. Fig. 17 shows the importance of Ron/Roff ratio in this in-memory computing scheme. If the wire resistance can be ignored, the maximum array size that can still operate reliably is found to be proportional to the Ron/Roff ratio of the RRAM cell. As a result, RRAM cells that offer low programming/read current, high Ron/Roff ratio and fast operation speed are desired for the proposed computing architecture. Summary Key achievements of this work: 1) A novel RRAM crossbar-based computing architecture is developed and experimentally demonstrated for low current, highly parallel and reconfigurable computing, particularly “big data” processing and low power ASIC; 2) core operations and design principles are developed for circuits design automation and verification; 3) 1-bit FA and 4-bit Mult are demonstrated by experiment and simulation to verify the proposed architecture. Acknowledgment The authors would like to thank Dr. Z. Mohammed and Q. Wang for helpful discussions. This work was supported in part by the AFOSR through MURI grant FA9550-12-1-0038 and by the National Science Foundation (NSF) through grant CCF-1217972. Reference [1] R. Waser et al, Adv. Mater. 21, 2632 (2009) [2] H.S. Wong et al, Proc. IEEE 100, 1951 (2012) [3] KH. Kim et al, Nano lett. 12, 389 (2012) [4] M.-J. Lee et al, IEDM2008, 85 [5] J. Borghetti et al. Nature, 464, p.873, 2010. [6] A. Siemon et al, Adv. Fucnt. Mater. 201500865, 2015 [7] S. Gaba et al. IEEE Elec. Dev. Lett. 35, 1239 [8] P. Sheridan et al, Nanoscale, 3, 3833 (2011)

17.5.2

Fig.1 Schematics of the Von Neumann bottleneck due to data exchange between memory and arithmetic units, and a solution based on crossbar in-memory computing.

Fig. 2 Schematic illustration of a possible operation flow. Locally stored logic and parallel programming and read out are developed for efficient computing.

Fig. 4 Measured current response of the NOR logic when applying ½Vdd on inputs A&B and Vdd on output C. The testing circuit is based on approach 1 shown in Fig.3

Fig. 7 Measured and simulated I-V of the device. Very low (< 100nA) programming current is obtained. Good matching is achieved by the compact model.

Fig. 3 Schematic of the wired-NOR logic implementation using approach 1 (left) and approach 2 (right).

Fig. 5 The proposed data matching method during function readout. Only one BL whose input cell values match the input can produce current smaller than I0.

Fig. 8 Measured on state & off state resistance distribution of 20 randomly chosen devices. Each device is switched 5 cycles by DC. Good uniformity and high on/off are achieved.

Fig. 6 (a) optical microscopic image of the fabricated RRAM array, (b) Schematic of the device & array structure. (c) Fabrication process flow of the Cu/Al2O3/poly-Si RRAM array.

Fig. 9 Measured retention behavior of the device on state & off state. The device is stable at 85oC.

Fig. 10 Experimental demonstration of the 1-bit full adder. (a) Photo of the board and the RRAM chip. (b) Measured resistance mapping after storing the function in the RRAM array for 4 input cases. (c) Measured BL current during read out process from the RRAM array. The correct stored input can be identified and the correct output can be read out from the array.

17.5.3

IEDM15-461

Fig. 11 Proposed programing process of the 1-bit full adder. (a) Logic functions used for the 1-bit FA. (b) Programming pulse trains applied on the WLs. (c) Resistance mapping after programming. (d) Response current on BLs 1,2,4,8 showing cell evolution during the programming stage.

Fig. 12 Simulated readout process of the 1-bit FA The logic results can be correctly located (middle panel, highlighted by the green windows) and read out (bottom panel).

Fig. 13 Operation scheme of the 4-bit Mult. through logic stored in 2-bit Mult and 2-bit FA.

Fig. 16 Simulated average power consumption per operation during read and programing processes for different operations. The data matching method significantly reduces power consumption.

Fig. 17 Current ratio during read out as a function of Ron/Roff ratio and array size. Higher Ron/Roff ratio enables reliable logic operations in a larger array.

Fig. 14 Simulated (a) resistance mapping of the programmed 2-bit Mult. and (b) resistance mapping of the programmed 2-bit FA. Correct data-results tables are stored and can be correctly read out.

Fig. 15 Simulated readout process of the 4-bit Mult, (a) current response during read of the multiplication term from the 2-bit mult. (b) Current response during read of the sum operation from the 2-bit FA. 1011x1101=100011 is demonstrated.

IEDM15-462

17.5.4