En nabling g Archittectura al Innov vations Using g Non-V Volatile e Memo ory Vijaykrishn nan Narayanan†, Vinay y Saripalli†, Karthik Sw waminathan†, Ravindhirran Mukund drajan†, † † ‡ Guangyu Sun , Yu uan Xie and d Suman D Datta †
Departm ment of Comp puter Science e and Enginee ering ‡ Department D of o Electrical En ngineering The Pennsylvania State U University
Email: vija
[email protected] (NVM)) technologies, such as Spin Torque Transfeer Magnetoresistivve Random Acceess Memory (ST TT-MRAM, alsoo referred to as MRA AM), Phase Chaange Memory (P PCRAM) and Ferro-electric Memorry (Fe-RAM) arre emerging as promising alteernatives for use in ffuture memory hierarchies. Theey show promisee to achieve the speeed of SRAM, thhe density of DR RAM, and the noon-volatility of Flashh memory in a ssingle technologyy.
A ABSTRACT T T The emergence of o non-volatile memory m technolo ogies such as Sp pin T Torque Transferr Magneto-resistive Random Access Memo ory R RAM and Phase Change Memorries provide new w opportunities for f aarchitectural inn novations. Whilee the zero off-sstate leakage, faast read access and d high densities of these mem mories make theem aattractive optionss as compared to o SRAM, their high h write energiies aand latencies ass well as their endurance are a concern. We W pprovide three diifferent architecctural techniquees that utilize th he S STT-MRAM chaaracteristics to enable new fun nctionalities. First, w we show how exploiting e the higher density off STT-MRAM in eembedded multii-tasked systemss can reduce th he context switcch ooverhead. Secon nd, we use the STT-MRAM to o create a reliab ble ccopy of SRAM M structures vulnerable v to radiation-induceed trransient errors to t improve reliaability. Finally, we w show a hybrrid ccache architecturre that uses a mix m of emerging TFET technolog gy aand STT-MRAM M technology. Our results ind dicate that activ ve leeakage is still a concern in STT--MRAM structurres.
Figure 1 shows the com mparison of diffferent memory ttechnologies with reegard to latency, energy and areaa. It can be obseerved that no single m memory technollogy is the best in all these dessign metrics. NVM ttechnologies proovide advantagess of small cell size, zero idle power consumption aand data retentiion ability in ppresence of supply loss as compareed to SRAM cells. All these alteernatives are as densse as DRAM, w while providing ffaster read accesss speed and no refreesh energy requiirements. In adddition, these NVM M memories are connsidered more rrobust to radiatiion-induced soft ft errors and power supply fluctuatiions [4, 5, 6]. T The maturity of the stacked 3D tecchnology has eenabled system m architects to exploit the advantaages of these NVM technoloogies in combinnation with traditioonal CMOS techhnology. Howevver, NVM technnologies also share cconcerns of incrreased write lateencies, larger wrrite energies and lim mited endurance.
C Categories and a Subjectt Descriptorrs B B.3.0 General.
G General Terrms: Design K Keywords E Emerging Techn nology, Reliabiliity, TLB, STT-M MRAM, PCRAM M, T TFETs.
11. INTROD DUCTION W With the continu ued scaling of trransistor dimenssions, the numb ber oof processors in a single chip co ontinues to increaase. While four to eeight cores in a chip c are quite co ommon, chips with w as many as 48 4 ccores [1] and 10 00 cores [2] hav ve been announcced. However, th he ggrowing disparitty between the on-chip logic and the off-ch hip m memory poses a significant ch hallenge to systtem performancce. C Consequently, memory m hierarchy y design is a crittical challenge for f thhe emerging chip multi-processo or paradigm. S SRAM and DRA AM cells have been the two main m convention nal teechnologies em mployed in thee design of prrocessor memo ory hhierarchies. How wever, technology y scaling of SRA AM and DRAM is inncreasingly con nstrained by fu undamental tech hnology limits to m mitigate the pow wer and memo ory walls. Non-volatile memo ory
Figuree 1. Comparison n of Memory Teechnologies (datta from [7]) The unnique features off NVM technologgies offer new oopportunities for arcchitects to reduuce check-pointting overheads [8], reduce contextt switch overheeads and enablee non-volatile coonfiguration for FPG GAs [9]. In alll these applicatioons, the non-volatile nature and theeir high densitiees are the key ffeatures exploiteed. Further, the sloow write speedd or the write energies are not critical impedim ments in these aapplications. Forr example, in ann FPGA, the write sspeed and enerrgy are not criitical as the frrequency of reconfiigurations is gennerally low. Hoowever, the abiliity to retain
or Permission to make digital or hard copies of all or part of this work fo personal or classro oom use is granted d without fee prov vided that copies arre not made or disttributed for profitt or commercial advantage a and that copies bear this notice n and the full citation on the first f page. To cop py otherwise, or rep publish, to post on servers or to redistribute r to listts, requires prior speccific permission an nd/or a fee. GLSVLSI’11, May y 2–4, 2011, Lausaanne, Switzerland.. Copyright 2011 ACM A 978-1-4503--0667-6/11/05...$10.00.
439
thhe state of the configuration when w powered down d obviates th he nneed for externall configuration memory. m A Architects have also started ex xploring memory y hierarchies th hat eemploy a heterogeneous mix off both emerging and convention nal teechnologies thatt provide large improvements in n performance an nd ppower-efficiency y as compared to o current system ms. These system ms eexploit the featu ures of NVM while masking th heir drawbacks by b ssupplementing th he NVM with other memory tecchnologies. As an a eexample, SRAM M memory cells provide very faast read and wriite aaccesses but do not n achieve den nsities similar to that of magnetiicR RAM (MRAM) cells. Consequ uently, replacin ng SRAM cach hes w with MRAM cacches can yield laarger and less leeaky memory su ubssystem. Howeveer, the higher write latenciess and energy of M MRAM when co ompared to SRA AM results in a design trade-o off w when considerring such rep placements. A heterogeneous aarchitecture that can steer writess to SRAMs wh hile retaining mo ost oother accesses in n MRAMs can capture c the best characteristics of thhe different teechnologies. An nother approach h that has beeen recently proposeed to reduce thee write overhead ds associated wiith M MRAM is to prrovide a tradeo off between the duration of daata retention and thee write energies//access times. In n [10], the autho ors sshow a techniqu ue to reduce thee write energy by b decreasing th he ddata retention to finite times. Succh tradeoffs can be used to desig gn eenergy-efficient memory structurres such as cach hes where the daata nneeds to be valid only for a fin nite time. Furtheer, the fast writtes aachievable by su uch data retention n tuning of MRA AMs enable snaapsshot copies of system state, enhan ncing the data reeliability.
Figuree 2: IPC off single proceess workload with the ponding multip process worklooad IPCs: The black bars corresp represeent each worklload running in n isolation with h no context switch overhead. Thee adjacent grey bars corresp pond to the perform mance of that workload when n run in conjun nction with each o f the other worrkloads indicatting performan nce loss due to conttext switch. The usse of MRAM in designing tthe TLB insteaad of using SRAM M based TLBs offfers multiple aadvantages. Firsttly, MRAM offers about three too four times hiigher density thhan SRAM TLB entries iin an iso-areaa structure. resultinng in more T Conseqquently, one can maintain the state foor multiple applicaations (processees) at the same time in the sam me die area. This elliminates the neeed to flush thee TLB, which iss one of the major rreasons for perrformance lossees during contexxt switches. Seconddly, since writess to TLBs (that aare expensive inn MRAM as comparred to SRAMs iin energy and laatency) are quitee infrequent and thee read latency of MRAM is similar to that of SRAM, MRAM M based TLBs ooffer similar perfformance. Finallly, the nonvolatiliity of MRAM can be exploitted to preferenttially retain entries that are acceessed more oftten across conntext switch boundaaries while inc urring zero staandby power w when gated. These properties makke an MRAM based TLB ann attractive proposiition for embeddded devices.
vel Inn this paper, wee provide three case studies thaat highlight nov aarchitectural inno ovations enabled d by non-volatilee memories. First, w we show how th he high-density of MRAM can n be exploited to im mplement largerr translation loo ok aside buffers (TLB) in an issoaarea design comp pared to SRAM M TLBs to mitigaate context-switcch ooverheads in mu ulti-programmed d embedded systtems. Second, we w sshow how MRA AMs can be ussed to enhance the reliability of ddatapath instruction execution when w exposed to radiation-induceed ssoft-errors. Finallly, we will show w that, contrary to popular belieef, thhe leakage pow wer of NVM rem mains a concern n, and propose an a eenergy-efficient hybrid memory y architecture co omposed of ultrraloow leakage Inteer-band Tunnel Field Effect (T TFET) Transisto orbbased memories and MRAMs.
We expplore two MRA AM based TLB B designs to evaaluate these benefitts over SRAM M based TLB Bs. We considder a pure technollogy replacemeent (higher caapacity due too increased densityy) and a partitiioned design w where TLB entrries of four differennt processes aree stored concurrrently (Figure 33). The pure technollogy replacemeent approach has little impaact on the contextt switch overhhead or the ovverall performaance of the applicaation in a mullti-tasking enviironment. In coontrast, the partitiooned TLB desiggn permits us tto simply switcch the TLB along w with the contexxt while allowinng the TLB entrries of other processses to be kept iin supply-gated mode consuminng no static power. Thus, in a 4-ppartition TLB, oonly 1 partition is active at any po int of time whille the other parrtitions are shut down. This results in significant eenergy savings. In addition, thhis approach avoids the performannce unfriendly flushing of thhe TLB at contextt switches and TLB “cold” miisses at the begginning of a new coontext.
22. Reducin ng Context Switch S Overrheads N Next generation embedded dev vices are expected to offer fullffledged multi-prrocessing capab bilities of curreent desktops. For F eexample, the reecently released iPhone 4 runss a multi-taskin ng ooperating system m (iOS4) whereeas iPhone 3 on nly allows multitaasking for a lim mited set of appliications pre-load ded on the devicce, aand previous veersions did not support any mu ulti-tasking at all. a H However, due to limited hardw ware resources in the embeddeed ssystems, contextt switch overheeads are a criticcal concern whiile pproviding multi-ttasking ability. T To support multii-process worklo oads, embedded systems typicallly sschedule applicaations to run on o the processsor core that are a sswapped in and out o based on the scheduling poliicy in use. When na pprocess is swapp ped out, its conttext in the TLB must be replaceed w with that of the newly scheduled d process. The overhead o incurreed oon account of co ontext switches is i significant in terms of its effeect oon performance (Notice ( the IPC degradation forr the multi-proceess w workload in Fig gure 2) for a multi-programm med workload of eembedded appliccations from the MiBench suite.
440
As an eearly demonstraator of these bennefits of NVMs, Dong et.al. [8] useed PRAMs to reduce the chheck-pointing ooverhead of exascalle systems. Theyy utilized the siggnificantly higheer bandwidth and loower latency pprovided by PR RAMs stackedd using 3D technollogy on a pprocessor as ccompared to disk based checkppointing systemss. This idea can be extended to enhance the reliabillity of processor data paths. A signnificant part of a contemporarry superscalar pprocessor is compossed of memorry elements. T These components include structurres such as regisster files, issue queue, reorder bbuffer, loadstore queues that aare critical to maintaining the correct architecctural state of the program exxecution. A keyy reliability concernn is the bit flipps in these mem mory elements aarising from radiatioon-induced trannsient errors. Thhese memory elements are more vvulnerable whenn they are highhly utilized sincce there are fewer errors that aree inconsequentiial to the corrrect microarchitecctural state.
F Figure 3: Comp parison of SRAM M based TLB with w partitioned d M MRAM based TLB. T In the MR RAM based TLB B, only 1 p partition is activ ve while the rem maining 3 partittions are sswitched off. O Our evaluation with w a multi-pro ogrammed worklload of embeddeed aapplications from m the MiBencch suite showed d a performance im mprovement of 24% when usin ng the partitioneed TLB design as ccompared to th he SRAM based TLB design.. The partitioneed M MRAM TLB design d allows the application n in the multipprogrammed env vironment to app proach the performance when it is thhe only applicattion executing (See ( single proccess bar in Figu ure 22). Further, we were able to obtain o leakage energy e savings of aaround 97% undeer iso-area conditions and 95% under u iso-capaciity cconditions.
We havve observed thaat there are perioods during the eexecution of an appllication when thhese memory struuctures have higgh utilization with litttle change in thheir state. We haave developed a predictor to identifyy these periods ((not discussed foor brevity) and hhave referred to thesse periods of eexecution as hhigh vulnerabilitty and low throughhput periods of an application. For example, ssuch periods are obsserved when highh latency L2 cacche misses occurr. To mittigate the effeccts of transient failures during these high vulneraability periods, w we perform a snnapshot copy off the data in differennt micro-architeectural structuress like Reorder B Buffer, Issue Queue and Load/Storee Queue into an MRAM buffer (See Figure 5). At tthe end of this llow-throughput period, we parttially restore all the non-modified ddata from MRA AM into the SR RAM buffer, mitigatting errors from m radiation-induuced soft errorrs on these entries.. Since the snappshot period is laarge, the high w write latency to the bback-up MRAM M structure is noot a significant pperformance impedim ment as compaared to completely replacing thhese SRAM structurres with MRAM M memories.
F Figure 4: IPC Comparison of Individual Components of m multiprocess workloads. w The adjacent bars denote th he p performance off the benchmarrk when run in n a multi-proceess w workload for the t SRAM TL LB, a technolo ogy replacement M MRAM TLB wiith no partition ning and the parrtitioned MRAM M T TLB.
33. Resilientt Architectu ures using NVM N N Non-Volatile memory m cells arre ideal candid dates for checckppointing the statee of the machinee for error resilieency. If an error is ddiscovered in the machine statee, the check-poin nt can be used to roll back the systtem to a known correct c state.
Figure 5: MRAM Snaapshot Copy
C Certain aspects of o NVM such ass PRAM and MR RAM are relevaant inn the context of resilient architectures. The MRA AM memory staate iss immune to so oft errors resultiing from radiatiion since the daata sstorage in MRA AM utilizes maagnetic orientatiion to store daata innstead of charrge. Further, du ue to their non-volatile naturre, trransient power glitches g or poweer loss does nott cause loss of th he ccheck-pointed staate. Finally, they y exhibit faster access a speeds an nd eenergy-efficienciies as compared to other storrage technologiies ssuch as Flash or disk arrays used d for check-pointting.
While a completely ssoft error tolerrant operating eenvironment cannot be guaranteed bby using this tecchnique, the vulnnerability of the struuctures is reducced significantlyy at a minimal pperformance loss. T This approach is comparablee to selective instruction replicattion and comparrison techniquess used currently on the basis that maany applicationss tolerate some eerrors. Our evaluuation using SPECO OMP benchmarrks indicates thhat the propossed scheme achievees better error-pprotection with llower performannce loss (as
441
energy--delay trade-offf diminishes forr CMOS-based circuits. At low suppply voltages, itt is possible to take advantage of the steep sub-thrreshold slope to deliver higher Ioon, while maintaiining a good Ion/Ioff ratio. We use a modified versioon of the cache aanalysis tool CACTII [16], in order tto evaluate the energy-delay performance of a TFET T-based L2 cacche. In order too overcome the problem of asymm metric conductionn in TFETs, we implement a 6-T SRAM Cell wiith virtual-grounnd from [17] in CACTI. In thee rest of this sectionn we discuss thhe architecture aand the utility of a hybrid TFET-M MRAM L2 cachhe.
sshown in Figuree 6 in the combined architectu ural vulnerabilittypperformance meetric) as compaared to selectiv ve duplication of innstructions in th he high vulnerabiility periods.
RAM and TFET T Table II. Comparison of MRAM, SR Technoologies.
F Figure 6. Exeecution Time-A AVF Product normalized to b baseline (no opttimizations to reeduce vulnerability).
SRAM (2 MB) MRAM (8 MB) TFET (2 MB)
44. Energy Efficient E Hy ybrid Cachee Memories S Several recent efforts e focus on exploiting the high density an nd loow-leakage of NVM N technologiies in designing cache hierarchiees. M Many of these cache c designs employ e a mix of o technologies to ccreate a combin nation of desirab ble features. Forr example, writtes ccan be steered aw way from NVM memories to SR RAM memories in a hybrid cache to t either, increaase endurance [11], [ reduce wriite laatencies [12] or reduce energy consumption c [13 3]. In contrast, th he hhigher density of o MRAM can provide higherr capacity for th he ccaches accentuatting performancce, while the lo ow leakage of th he M MRAM cells co ompared to SRA AM cells keep leakage power in ccheck.
Read Energy (nJ)
W Write Eneergy (n nJ)
Power Leakage P (mW W) BitPericells Pheral
@ Read@ 875MH Hz (cycles))
Write@ 875MHz (cycles)
0.62
0..62
1386
264
3
3
0.76
5
15
215
3
10
0.15
0..15
0.1
0.7
5
5
ET caches hav e the potentiaal to achieve eeven lower TFE leakagee than MRAM M memory w when powered on. This primariily results duee to the high leakage in thhe MRAM peripheeral circuitry w which is compoosed of CMOS circuits, as shown in Table I. Further, TFET T memories hhave lower mic write energ ies than MRA AM. However, TFET read dynam access times are inferrior to that of MRAM cells aas shown in Table I due to their low-voltage ooperation. A hyybrid cache TFET cache architeecture similar tto [13] with a majority of T mplemented. ways aand a minority of MRAM wayys has been im The TF FET hybrid cacche architecturee has a majoritty of TFET ways iin order to redduce the total leakage of the cache. The primaryy motivation o f this architectuure is to keep as much of write iintensive data iin the TFET waays as possiblee and hence reduce the number of write opeerations to thhe MRAM. Similarrly, this architeecture also aims to keep as muuch of read intensi ve data in thee MRAM wayss as possible, in order to reduce the number off read operationns to the TFET. In order to improvve the performaance and reducee the power of this hybrid L2 cacche, a cache maanagement poliicy has been im mplemented which can be describeed as follows: • Thhe cache controlller is aware of tthe locations of T TFET cache waays and MRAM M cache ways. W When there is a write miss, thee cache controlller first tries to place the data iin the TFET caache ways. Simiilarly, when therre is a read miss, the cache coontroller first triies to place the data in the MR RAM cache waays. • C Considering the hhigh probability that a processorr core writes daata to a specificc group of cachhe lines repeatedly, data in M MRAM caches is migrated to TFE ET caches if the same cache linnes are frequentlly written to. Daata in MRAM caaches will be miigrated to TFET T caches when they are accesssed by two suuccessive write ooperations. Due to the existencee of this data miigration policyy, the number of write acccesses from prrocessor cores too MRAM cachess can be reducedd. Further, in thiis TFET-MRAM M hybrid memoory we not only steer writes to the TFET wayys but also steeer the reads to the MRAM manner. waays in a similar m
Inn contrast to th he prior effortss that consider MRAMs as lo ow leeakage structures, we observee that the activ ve leakage of th he M MRAM peripherrals is still significant. Since thee MRAM cells by b thhemselves exhiibit very low leakage, supply y gating at fin ne ggranularity of caache lines witho out turning off peripheral blocks ddoes not providee significant energy savings. Consequently, C we w eexplore a new hybrid h design utilizing u ultra-lo ow leakage TFE ET ddevices in combiination with MR RAM.
F Figure 7. (A) > 60 6 mV/decade threshold t slope of a MOSFET (B) < 60 mV/deccade threshold slope s of a Tunn nel-FET Novel Inter-band Tunneling Field Effect Traansistors (TFET Ts) hhave been experiimentally demon nstrated with thee potential to sho ow ssub-60 mV/decad de sub-threshold d slope [14][15] (see Figure 7 for f thhe steeper sub-tthreshold slope of o TFETs). TFE ET devices can be b uused to achieve energy e efficient operation o at low w VCC, where the
442
F Figure 8 (A). The write intenssity to MRAM before and after u using the hybrid d TFET-MRAM M cache. (B). The T read intensiity to TFET beforre and after ussing the hybrid d TFET-MRAM M ccache. m memory sub-system along with MRAM techno ology can provid de aadditional benefiits. To this end, we evaluate ourr scheme using an a eembedded processsor-core operatting at 875 MHzz. We consider a 2 M MB, 8-way L2 cache c composed d of TFETs as th he base case. Th he T TFET-MRAM hybrid h L2 cache is composed off a majority of 7w ways of TFET memory m and 4 ways of MRAM M, since the arrea ffootprint of MRA AM is 1/4th of that t of the TFET T. Parallel SPEC CO OMP benchmark k applications are a used to evaaluate the TFET TM MRAM hybrid memory. The experimental e seet-up is shown in T Table II.
ure 9: The com mparison of IPC C among 2M TF FET cache, Figu 8MB MRAM cach he and TFET T-MRAM hyb brid cache (Norm malized to IPC of 8M pure MRA AM cache). T Table II. Configguration parameeters for Hybrid d TFETMRAM stud dy. Processsors: # of corres 8 Frequenncy 875 MHz Issue W Width 1 mory Mem L1 Cacche
Figure 8A sho ows that the inteensity of MRAM M-write operations iss reduced dramaatically by using g the TFET-MRA AM hybrid, whiile F Figure 8B showss that the intensitty of TFET read d operations is also reduced. As a result, the dynam mic power asso ociated with wriite ooperations to MR RAM cells is alsso reduced and the t performance is im mproved due to the lower writee penalty of the TFET. Due to th he reduction of read d operations to the TFET cellss the performance ccan also be impro oved. Figure 9 show ws the performan nce comparison between the 2M MB T TFET cache, 8M MB MRAM cache and TFET T-MRAM hybrrid ccache. On an av verage, the hybrrid cache structture improves th he pperformance by 11% (5.2%) compared to the t pure MRAM (TFET) counterp parts, with a max ximum performaance improvemeent oof 53% (21.5%) for the swim beenchmark. The increased i capaciity hhas a significant impact on reduccing misses for th his application. ows the power comparison. c Wee observe that th he Figure 10 sho tootal power of th he hybrid schemee is reduced by 90% compared to thhe MRAM-onlly cache, sincee the TFET-M MRAM hybrid is ccomposed of a majority of low leakage TF FET cache linees. H However, the to otal power con nsumption of th he TFET-MRAM hhybrid L2 cache is on average 5X X larger than thaat of the TFET L2 L ccache. The periipheral leakage of the MRAM M cache ways is ssignificant comp pared to the TFE ET array. This sttudy motivates th he nneed for future efforts e that either redesign the peripheral circuittry w with low-leakagee devices or emp ploy fine-grain tu urn off techniqu ues ffor the peripheraals not just the cache lines. Ho owever, these are a cchallenging task ks given that the significant leakage of th he pperipherals stem from the desiree to keep write latencies l low. An A aalternate approacch will be to stu udy the effect off reducing the daata retention time off the MRAM cells in conjunctio on with peripherral m leakage-effiicient structures. leeakage optimizaation to obtain more
TFET L L2 MRAM M L2 TFETM MRAM Hybridd Main M Memory
Private, 332 + 32 KB, 2-w way 64B B Line, 1-cycle write-bacck, 1 read/write pport Shareed 2MB, 8-way 64B line, writee-back, 1 read/w write port Shareed 8 MB, 8-way 64B line, writee-back, 1 read/w write port Shared 7-w way (2MB) TFE ET + 4-wayy (1MB) MRAM M 64B line, writee-back, 1 read/w write port 4 GB, 3300 cycle latencyy
Figurre 10. The coomparison of total power coonsumption amon ng 2M TFET cache, 8MB M MRAM cache aand TFETMRA AM hybrid cacche (Normalizeed to power oof 8M pure MRA AM cache).
443
[8] X. Dong, N. Muralimanohar, N.P. Jouppi, R. Kaufmann, and Y. Xie, “Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems”, In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09). ACM, New York, NY, USA
5. CONCLUSION This work has introduced architectural innovations that can be enabled by the use of NVM technologies. Our study also indicates that future systems can concurrently exploit the best features of multiple technologies in hybrid, heterogeneous memory structures. A great deal of promise for additional architectural innovations is envisioned as these technologies mature and more device/circuit knobs of these are exposed to architects. For example, an energy-efficient memory allocation scheme or implicit garbage collection may be possible by tuning the dataretention time of the NVM technologies.
[9] L. Torres, Y. Guillemenet, S.Z. Ahmed, “A Dynamic Reconfigurable MRAM based FPGA”, in Proc of International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA) 2010 pp 31-40 [10] C.W.Smullen, V.Mohan, A.Nigam, S.Gurumurthi, and M.R.Stan, “Relaxing non-volatility for fast and energyefficient STT-RAM caches,” to appear in Proc. of Intl. Conference in High Performance Computer Architecture, Feb. 2011
6. ACKNOWLEDGMENTS This work was supported in part by NSF grants 1028807, 0916887, and 0903432. We also acknowledge discussions with Bhuvan Urgaonkar for the context switch overhead section.
[11] P. Mangalagiri et.al., “A low-power phase change memory based hybrid cache architecture,” In Proc. of 18th ACM GLSVLSI, 2008.
7. REFERENCES [1] J. Howard et.al., “A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS,” Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International , vol., no., pp.108-109, 7-11 Feb. 2010
[12] X.Wu et.al., “Power and Performance of Read-Write Aware Hybrid Caches with Non-volatile Memories,” In Design Automation and Test in Europe (DATE), 2009.
[2] Tilera: TILE-GX Processor family. http://www.tilera.com/sites/default/files/productbriefs/PB025 _TILE-Gx_Processor_A_v3.pdf
[13] G. Sun, X. Dong, Y. Xie, J. Li, Y. Chen, "A Novel Architecture of the 3D Stacked MRAM L2 Cache for CMPs", HPCA, Feb. 2009
[3] W. Mueller et. al, "Challenges for the DRAM cell scaling to 40nm," Electron Devices Meeting, 2005. EDM Technical Digest. IEEE International , pp.4 pp.-339, 5-5 Dec. 2005 [4] L. O'Brien Quarrie, "Single Event Effects of Accelerated Terrestrial Cosmic Rays on Ferroelectric RAM", in Proc. of NASA Military/Aerospace Programmable Logic Device Conference 2008.
[14] S. Mookerjea, D. Mohata, R. Krishnan, J. Singh, A. Vallett, A. Ali, T. Mayer, V. Narayanan, D. Schlom, A. Liu and S. Datta, “Experimental demonstration of 100nm Channel Length In0.53Ga0.47As based vertical inter-band tunnel field effect transistors (TFETs) for ultra low-power logic and sram applications,” In Proc. IEEE Int. Electron Devices Meeting (IEDM), 2009, pp. 1-3.
[5] D.H. Yoon et.al, “FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors”, to appear in Proc. of Intl. Conference in High Performance Computer Architecture, Feb. 2011
[15] W.Y. Choi, B.G. Park, J.D. Lee and T.J.K. Liu, “Tunneling field-effect transistors (TFETs) with subthreshold swing (SS) less than 60 mV/dec” IEEE Electron Device Letters, vol. 28, no. 8, pp. 743-745, 2007.
[6] Everspin: The 16 Mbit MRAM chip (http://www.mraminfo.com/everspin-introduces-new-16-mb-mram)
[16] S. Thoziyoor, N. Muralimanohar, J.H. Ahn, and N.P. Jouppi, “Cacti 5.1”, HP Labs, Tech. Rep. 2008.
[7] Wolf, S.A., Jiwei Lu, Stan, M.R., Chen, E., Treger, D.M., "The Promise of Nanomagnetics and Spintronics for Future Logic and Universal Memory," Proceedings of the IEEE , vol.98, no.12, pp.2155-2168, Dec. 2010
[17] J. Singh, K. Ramakrishnan, S. Mookerjea, S. Datta and N. Vijaykrishnan and D. Pradhan, “A novel Si Tunnel FET based SRAM design for ultra low-power 0.3V Vdd applications,” In Proc. 15th Asia and South Pacific Design Automation Conf. (ASPDAC), 2010, pp. 181-186.
444