Persistent-memory awareness inoperating systems

Università degli Studi di Milano Bicocca Dipartimento di Informatica, Sistemistica e Comunicazione Corso di laurea in Informatica

Persistent-memory awareness in operating systems

Relatore: Flavio De Paoli Correlatore: Leonardo Mariani Relazione della prova nale di: Diego Ambrosini Matricola 031852

Anno Accademico 2013-2014

Persistent-memory awareness in operating systems. Copyright

©

2015 Diego Ambrosini. Some rights reserved.

This work is licensed under a Creative Commons se

Attribution

(CC-BY

4.0),

4.0

International

except

for

the

Licen-

following

items:

http://creativecommons.org/licenses/by/4.0/

- Figure 1.3: courtesy of TDK Corporation, see [76]. rights reserved. - Figure 1.4: courtesy of Fujitsu Ltd, see [71].

©

©

TDK Corporation, all

Fujitsu Ltd, all rights reserved.

- Figure 1.5: courtesy of Everspin Technologies Inc, see [66]. Technologies Inc, all rights reserved.

©

2013 Everspin

- Figure 1.6:

©

on the left, Figure 1.6a, courtesy of American Chemical Society, see [120, Page 241, Figure 1].

2010 American Chemical Society, all rights reserved.

on the right, Figure 1.6b,

© ©

©

Ovonyx Inc, all rights reserved.

- Figure 1.7: courtesy of John Wiley & Sons Inc, see [137, page 2632, Figure 1]. 2009 John Wiley & Sons Inc, all rights reserved.

- Figure 1.8:

see [52, Page 6946, Fig.

1], licensed under CC-BY agreement.

2013 Owner Societies, some rights reserved.

- Figure 2.4: see [60, page 3, Figure 1], licensed under CC-BY agreement. Oikawa, some rights reserved.

©

©

2014

- Figure A.1: courtesy of McGraw-Hill Education, see [57, page 247, chapter 5, Figure 5.5]. .

2001 The McGraw Companies, all rights reserved.

Abstract Persistence, in relation to computer memories, refers to the ability to retain the data in time, without the need of any power supply. Until now, persistence has been always supplied by hard disks or Flash memory drives. Regarding memories, persistence has been until now conceived as a slow service, while volatility has been thought in relation to speed, as it happens in DRAM and in SRAM. Such a dichotomy represents a bottleneck hard to bypass. The panorama of memory devices is changing: new memory technologies are being currently developed, and are expected to be ready for commercialization in the next years. These new technologies will oer features that represent a major qualitative change: these memories will be fast

and

persistent.

This work aims to understand how these new technologies will integrate into operating systems, and to which extent they have the potential to change their current design. Therefore, in the intent to gain this understanding, I have followed these goals throughout the work: - to analyze the economical and technological causes that are triggering these qualitative changes; - to present the new technologies, along with a classication and a description of their features; - to analyze the eects that these technologies might have on models that are currently used in the design of operating systems; - to present and summarize both the opportunities and the potential issues that operating system designers will have to manage in order to use conveniently such new memory devices; - to analyze the proposals found in scientic literature to exploit these new technologies. Following the structure of the title, the rst chapter is focused mainly on memory devices, whereas the second chapter will be centered on operating systems.

iv

The rst chapter, initially, tries to grasp the causes of the expected technological change, beginning with economical observations. Subsequently, the chapter contains some considerations about how dierent but complementary aspects of the economical relation are urging the semiconductor industry to nd new memory technologies, able to satisfy the increasing demand of features and performances. Afterwards, the paper shifts its focus on current technologies and their features. After a brief summary of each specic technology, a short description about the issues shared among all current charge-based technologies follows. Then, the reader nds a presentation of each of the new memory technologies, presented following the order of the ITRS taxonomy related to memory devices: rstly are presented the ones in a prototypical development stage (MRAM, FeRAM, PCRAM), then followed by those in an emerging development stage (Ferroelectric RAM, ReRAM, Mott Memory, Macromolecular and molecular memory). The second chapter aims, in its rst part, to understand the extent to which current funding models (Von Neumann model and the memory hierarchy) are inuenced by the new technologies.

As far as the computational model (fetch-

decode-execute) does not change, the validity of the Von Neumann model seems to hold.

Conversely, as far as it concerns the memory hierarchy, the changes

might be extensive: two new layers should be added near to DRAM. After these considerations, some additional observations will be made about how persistence is just a technological property, and how a specic model would be necessary to explicit how an operating system uses it. Afterwards, there will be a description of the use of non-volatile memory technologies such as Phase Change RAM inside fast SSDs. Even if this approach is quite traditional, the scientic literature explains how faster devices would require a deep restructuring of the I/O stack. Such a restructuring is required because the current I/O stack has been developed concentrating on functionality, not eciency. Fast devices would instead require a high eciency. This second chapter will then present the most appealing use of persistent memories: either

se

as storage class memory, either in replacement of common DRAM,

in tandem

with DRAM on the same memory bus. This approach has

per

a higher level of complexity, and under the umbrella of SCM there are many

viable declinations of use. Firstly some preliminary observations common with all the approaches are made. Then, two easier approaches are presented (no-change and Whole System Persistence).

Finally, the approaches that aim to develop

a persistent-memory aware operating system will be introduced:

most of them

uses the le system paradigm to exploit persistence into main memory. The paper proceeds in presenting rstly some peculiarities of the current I/O path used in the

v

Linux operating systems, remarking how caching already moved persistence into main memory; afterwards, some other considerations about consistency are made. Those observations then are used to understand the main dierences between standard I/O in respect with memory operations.

After a brief presentation of

some incomplete approaches proposed by the author, a framework to classify the thoroughness used by the dierent approaches follows. The paper continues by reporting the eorts of the Linux community and then introduces each specic approach found in literature:

Quill, BPFS, PRAMFS,

PMFS, SCMFS. Concluding the part about le system, there will be some remarks about integration, a mean to use both le system services and memory services from the same persistent memory. Finally, persistent-memory awareness into user applications, along with a brief introduction of the two main proposals coming from two academic research groups will be presented.

Abstract (italiano) Il concetto di persistenza, relativamente alle tecnologie di memoria, si riferisce alla capacità di mantenere i dati anche senza la necessità di alcuna alimentazione elettrica. Sino a oggi, essa è stata prerogativa esclusiva dei dispositivi di memorizzazione lenti, quali ad esempio gli hard disk e le memorie Flash. La persistenza è sempre stata immaginata come una funzionalità intrinsecamente lenta, mentre la volatilità, caratteristica tipica delle memorie DRAM e SRAM, è sempre stata associata alla loro velocità. Tale dicotomia è tuttora un limite dicile da aggirare. Il panorama delle memorie tuttavia sta subendo dei cambiamenti strutturali: nuove tecnologie sono in corso di sviluppo e l'industria dei semiconduttori ha in programma di cominciarne la commercializzazione nei prossimi anni. Questi nuovi dispositivi avranno delle caratteristiche che rappresenteranno un rilevante cambio qualitativo rispetto alle attuali tecnologie: la più signicativa dierenza è che queste memorie saranno veloci

e

persistenti.

Il presente studio intende proporre un'analisi di come tali nuove tecnologie potranno integrarsi nei sistemi operativi, e di quali entità potranno essere le ricadute sulla progettazione degli stessi. Vengono perciò arontate: - un'analisi delle cause economiche e tecnologiche di questi cambiamenti; - una presentazione di ciascuna delle nuove tecnologie, assieme a una loro classicazione e a una breve valutazione delle loro caratteristiche; - un'analisi degli eetti che queste nuove memorie possono avere sui principali modelli usati per lo sviluppo dei sistemi operativi; - una panoramica sulle opportunità e sulle problematiche potenziali che gli sviluppatori dei sistemi operativi dovrebbero tenere in considerazione per sfruttare al meglio tali tecnologie; - una rassegna delle varie proposte presenti in letteratura per usare al meglio le nuove memorie persistenti.

vii

In stretta connessione al titolo, la prima parte del lavoro è incentrata principalmente sui nuovi dispositivi di memoria persistente, mentre nella sua seconda parte si focalizza sui sistemi operativi. Nel primo capitolo si approfondiscono le cause di questi cambiamenti tecnologici, a partire da alcune considerazioni di natura economica; proseguendo, viene mostrato come le dierenti (seppur complementari) necessità dei produttori di semiconduttori e dei loro consumatori, stiano progressivamente spingendo la ricerca verso nuove tecnologie di memoria capaci di soddisfare le sempre crescenti richieste di prestazioni e funzionalità. L'attenzione viene poi spostata sulle attuali memorie e sulle loro caratteristiche. Dopo una breve descrizione di ogni tecnologia, viene svolta una breve analisi di alcuni dei problemi comuni a tutte le tecnologie di memoria basate sulla carica elettrica. Vengono quindi presentate le nuove memorie, seguendo l'ordine proposto dalla tassonomia della ITRS (International Technology Roadmap for Semiconductors): prima sono descritti quei dispositivi la cui produzione è già cominciata ma il cui grado di maturità del prodotto è ancora iniziale (MRAM,FeRAM,PCRAM), mentre successivamente vengono mostrati quei dispositivi il cui stato di sviluppo è ancora alle prime fasi (Ferroelectric RAM, ReRAM, Mott Memory, memorie macromolecolari e molecolari). Il secondo capitolo, nella sua prima parte, aronta le tematiche sull'eventualità che tali memorie possano intaccare la validità di alcuni modelli fondamentali per lo sviluppo dei sistemi operativi, quali il modello della macchina di Von Neumann e la gerarchia di memoria. Viene sottolineato come la validità del modello di Von Neumann resti immutata.

Tuttavia, si evidenzia come tali memorie apportino delle

modiche importanti alla attuale gerarchia di memoria, la quale vedrebbe l'aggiunta di due nuovi livelli sotto quello relativo alle memorie DRAM. Dopo queste valutazioni, ne sono proposte di ulteriori rispetto al concetto stesso di persistenza: viene sottolineato come esso sia sostanzialmente una proprietà di alcune tecnologie, e di come sia necessaria una modellizzazione che espliciti come il sistema operativo intenda utilizzarla. Successivamente, si fornisce una descrizione di come le memorie persistenti (ad esempio le PCRAM) possano essere impiegate per costruire degli SSD più veloci. Sebbene un tale approccio sia piuttosto conservativo, viene evidenziato come una simile soluzione richieda una profonda modica dei meccanismi usati per eettuare le operazioni di I/O. L'attuale gestione degli I/O è infatti concentrata sull'oerta di numerose funzionalità, mentre la sua ecienza non è stata curata nel tempo: questo nuovo tipo di memorie tuttavia richiede un'alta ecienza del software.

viii

Il secondo capitolo procede presentando la modalità d'uso più interessante delle nuove memorie persistenti: nel bus di memoria, in sostituzione alle comuni DRAM, oppure al loro anco. Un simile approccio ha un grado di complessità superiore, e può essere declinato in molte dierenti modalità d'uso. Vengono svolte alcune osservazioni preliminari, comuni a tutti gli approcci; successivamente vengono presentati quelli più semplici (nessun-cambio e Whole System Persistence). In seguito sono introdotti quelli che prevedono la modica dei sistemi operativi per realizzare una reale consapevolezza d'uso delle memorie persistenti: la maggior parte di essi sfrutta il paradigma del le system per ottenere tale scopo.

Vengono presentati alcuni dettagli dell'attuale gestione dell'I/O in

ambiente Linux, sottolineando come tramite il caching, la persistenza si è già spostata dai dispositivi lenti alla memoria principale; vengono fatti poi ulteriori approfondimenti circa la consistenza dei dati.

Queste osservazioni quindi sono

usate per comprendere le principali dierenze tra le operazioni di I/O e quelle di memoria e per approcciare eettivamente le modiche al sistema operativo. Dopo una breve presentazione di alcune soluzioni non consolidate, sono successivamente introdotti degli elementi valutativi per comprendere l'ecacia e la profondità di quelle analizzate nel seguito. Il lavoro continua nella presentazione delle proposte documentate sia dalla comunità degli sviluppatori Linux, sia dalla letteratura scientica:

Quill, BPFS,

PRAMFS, PMFS, SCMFS. Concludendo la parte riguardante i sistemi operativi, sono fatte delle osservazioni sul concetto di integrazione, ovvero di un metodo per permettere l'uso condiviso delle le memorie persistenti da parte del kernel e da parte del le system. Inne, il lavoro si conclude toccando l'argomento della consapevolezza della persistenza nelle applicazioni, per valutare la proposta di permettere alle applicazioni un uso diretto delle memorie persistenti.

Contents Copyright notes

ii

Abstract

iii

Abstract (italiano)

vi

Contents

ix

List of Figures

xi

List of Tables

xii

Glossary

xiii

Introduction

1

1

Technology

3

1.1

3

1.2

1.3

2

Generic issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1

An economical view . . . . . . . . . . . . . . . . . . . . . . .

1.1.2

A technological view

14

Technology the present . . . . . . . . . . . . . . . . . . . . . . . .

19

1.2.1

Mechanical devices

. . . . . . . . . . . . . . . . . . . . . . .

19

1.2.2

Charge-based devices . . . . . . . . . . . . . . . . . . . . . .

21

1.2.3

Limits of charge-based devices . . . . . . . . . . . . . . . . .

Technology the future

26

. . . . . . . . . . . . . . . . . . . . . . . .

27

1.3.1

Prototypical . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

1.3.2

Emerging

32

1.3.3

From the memory cell to memories

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Operating Systems 2.1

3

. . . . . . . . . . . . . . . . . . . . . .

Reference models 2.1.1

40

42 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Von Neumann machine

. . . . . . . . . . . . . . . . . .

43 43

x

2.2

2.3

2.4

2.1.2

The memory and the memory hierarchy . . . . . . . . . . .

45

2.1.3

A dynamic view in time

. . . . . . . . . . . . . . . . . . . .

53

2.1.4

Viable architectures . . . . . . . . . . . . . . . . . . . . . . .

54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

2.2.1

Fast SSDs

Preliminary design choices . . . . . . . . . . . . . . . . . . .

57

2.2.2

Impact of software I/O stack . . . . . . . . . . . . . . . . . .

60

Storage Class Memory: operating systems

. . . . . . . . . . . . . .

66

2.3.1

Preliminary observations . . . . . . . . . . . . . . . . . . . .

67

2.3.2

No changes into operating system . . . . . . . . . . . . . . .

71

2.3.3

Whole System Persistence

72

2.3.4

Persistence awareness in the operating system

. . . . . . . .

74

2.3.5

Adapting current le systems

. . . . . . . . . . . . . . . . .

83

2.3.6

Persistent-memory le systems . . . . . . . . . . . . . . . . .

85

2.3.7

Further steps

. . . . . . . . . . . . . . . . . . . . . . . . . .

90

Storage Class Memory and applications . . . . . . . . . . . . . . . .

94

. . . . . . . . . . . . . . . . . . .

Conclusions

100

A Asides

105

A.1

General

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A.2

Physics and Semiconductors . . . . . . . . . . . . . . . . . . . . . . 106

A.3

Operating systems

. . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B Tables

113

Bibliography

118

Acknowledgments

135

List of Figures 1.1

The ubiquitous memory hierarchy . . . . . . . . . . . . . . . . . . .

1.2

ITRS Memory Taxonomy

. . . . . . . . . . . . . . . . . . . . . . .

16

1.3

The Flash memory cell . . . . . . . . . . . . . . . . . . . . . . . . .

22

1.4

Ferroelectric crystal bistable behavior . . . . . . . . . . . . . . . . .

29

1.5

Magnetic Tunnel Junction

. . . . . . . . . . . . . . . . . . . . . . .

30

1.6

Phase Change memory cell . . . . . . . . . . . . . . . . . . . . . . .

32

1.7

RRAM memory cell . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

1.8

ElectroChemical Metallization switching process . . . . . . . . . . .

36

2.1

The Von Neumann model

44

2.2

The memory hierarchy with hints

. . . . . . . . . . . . . . . . . . .

47

2.3

A new memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . .

55

2.4

The Linux I/O path

76

A.1

Field eect transistor . . . . . . . . . . . . . . . . . . . . . . . . . . 111

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

15

List of Tables B.1

Top 10 technology trends for 2015 . . . . . . . . . . . . . . . . . . . 113

B.2

Performance comparison between memories . . . . . . . . . . . . . . 114

B.3

4K Transfer times with PCM and other memories . . . . . . . . . . 115

B.4

Bus latency comparison

B.5

HDD speed vs bus theoretical speed . . . . . . . . . . . . . . . . . . 116

B.6

Persistence awareness through le systems

. . . . . . . . . . . . . . . . . . . . . . . . 116 . . . . . . . . . . . . . . 117

Glossary ACID

Atomicity, Consistency, Isolation, Durability

ATA

AT Attachment (I/O interface standard)

BE

Bottom Electrode

BIOS

Basic Input Output System

BTRFS

B-TREe File System

CB

Conductive Bridge

CBRAM

Conductive Bridge RAM

DAX

Direct Access and XIP

DDR

Double Data Rate (DRAM technology)

ECC

Error Correcting Code

FLOPS

Floating Point Operation Per Second

FTL

Flash Translation Layer

FUSE

File system in USEr space

GPU

Graphics Processor Unit

HMC

Hybrid Memory Cube

HRS

High Resistance Status

L1

Level 1 cache

L2

Level 2 cache

L3

Level 3 cache

xiv

LRS

Low Resistance Status

LVM

Logical Volume Manager

MFMIS

Metal-Ferroelectric-Metal-Insulator-Semiconductor

MIM

Metal-Insulator-Metal

MTD

Memory Technology Device (Linux subsystem)

NVM

Non-Volatile Memory

PATA

Parallel ATA

PCI

Peripheral Component Interconnect (I/O interface standard)

PMC

Programmable Metallization Cell

RAMFS

RAM File System

RS

Resistive Switching

SATA

Serial ATA

STT-MRAM

Spin-Transfer Torque Magnetic RAM

TCM

ThermoChemical Mechanism

TE

Top Electrode

TER

Tunnel Electro-Resistance

TLB

Translation Look-aside Buer

TMPFS

TEMPorary File System

VFS

Virtual File System

WAFL

Write Anywhere File Layout (le system)

XFS

X File System

XIP

eXecute-In-Place

ZFS

Z File System

Introduction Computers are programmed to execute tasks, somehow manipulating data: they should have a mean for retrieving such data, manipulating it, and nally store it back when tasks are completed. Similarly to our human experience, the devices used in computers to store data and to retrieve it, are called memories. Although memory

naturaliter

refers to the ability to remember data in time,

computer science distinguishes between persistent and volatile memories, depending on the length of that time:

in fact, some memory devices are dened as

volatile, whereas some others are dened as persistent. The former class of devices can't store any data permanently between power-o events, and henceforth that data gets lost when power is o.

Conversely, memories belonging to the latter

class feature the ability to remember the stored data in time, regardless to the 1

power status: in these memory devices data persists in time . Persistence and volatility are just properties of each specic memory technology, and have always been present in computing devices: for example, punched cards in the 50s oered persistence. Today, it is supplied by hard disks and SSDs. Conversely, volatility has always been present in the main memory. This scenario remained almost unchanged throughout the years: since volatile memories were fast and expensive, whereas persistent ones were slow and cheap, persistence has been always relegated to I/O devices. This work nd its

raison d'être

on the fact that the semiconductor industry is

preparing to produce and sell new memory technologies whose features are much dierent from those seen until now. In time, this particular industry has continued to enhance its oer, producing memories gradually better from one generation

1 Actually, the evaluating parameter to certify a memory as a persistent memory is the ability to store data correctly for at least ten years.

2

to the next one: they extensively took advantage of the benets oered by tech2

nology scaling, reaching a continuous increase both in memory densities

and in

their performances. Despite these enhancements, however, the fundamental features of memories remained almost the same throughout the years: changes were almost always quantitative. The new technologies promise instead to be both fast and persistent, giving to engineers the choice to use them in I/O devices or in the memory bus: in either ways, such memories would represent eectively a major qualitative change. Henceforth, the term persistent-memory refers to those technologies that are able to oer persistence into

main

memory. Usually, such

memories are also referred to as non-volatile memories (NV). The rst chapter of this work will focus on the specicities of these new devices. Operating systems are developed to use conveniently and eectively computers and are carefully designed to use at best each feature oered by their hardware. Volatility of the main memory is probably one of the most important assumptions that have always inuenced the development of operating systems. Since these new technologies could move persistence to main memory, scientists and researchers are trying to understand what aspects of the operating systems would need to be modied in order to to adapt conveniently to such a major change.

These

modications would then lead to persistence-aware operating systems: the eorts currently made by the scientic community to manage this transition are the subject of the second chapter.

2 Density, when referring to memory devices, refers to the quantity of bits per area achieved by a given technology. capacity.

Without changing the area consumed, a better density means higher

Chapter 1 Technology Prior to analyze computing models and operating system issues, it could be useful to present the technological changes that are expected:

this rst part aims to

describe the memory technologies used in current computing systems and presents those that we could use in the future.

1.1

Generic issues

Here follow some generic considerations about the economical and technical aspects that can be useful to gain a better understanding about the peculiarities of both the current memory technologies and the ones that are currently competing to become the ones used tomorrow.

The reason of these observations is to share

with the reader a sort of framework that could help to evaluate the causes that are conducting to technological changes in memory devices and to understand the expectations that are relied on them.

1.1.1

An economical view

Throughout the world, each economical activity produces and distributes goods and services that are then sold to consumers (oer side). In turn, human needs, leveraging the demand of goods produced by businesses, are the engine of each economical activity (demand side).

4

The same relation exists in the semiconductor market: computing is indeed realized by the semiconductor industry, in turn embodied by a myriad of rms that compete to survive, earn money and reach a leading position on the market (oer side). Consumers then buy semiconducting devices that satisfy their needs (demand side). With the intent of gaining a better insight about the reasons that are triggering a qualitative change in the memory devices panorama, I will focus briey on some aspects of both the oer and the demand side of this tight relation.

Oer side The need to pursue Moore's prediction to exponential When talking about trends in semiconductor industry, the most frequently cited one is usually expressed as Moore's law (i.e. the number of transistors in integrated circuits grows exponentially). I will follow this tradition, believing that its use still oers a useful insight on computer industry itself. To be exact, it has to be said that semiconductor industry is trying to update Moore's law with a so-called More than Moore forecast [5].

Nonetheless, also the More than Moore approach is

still exponential, albeit equivalently. Moore's law, asserting the exponential growth of computational power, is an economic conjecture, not strictly an economic law.

Moore rooted its thoughts

on its experience in the semiconductor industry, and on the fundamental principle according to which each economical activity has the primary goal of maximizing its protability. He observed how, in the semiconductor industry, each two years, its maximum protability point coincided with both: - A doubling of the transistor number into integrated circuits; - A corresponding price fall of each transistor (cost per function). This double result, started from the inception of the semiconductor industry, continued to occur until nowadays.

The former permitted the exponential increase

of computational power, whereas the latter permitted to transform computational resources from a rare commodity to a widespread consumer good. From a business perspective, throughout the years the pursuing of Moore's trend has guaranteed to the computer industry high revenues and the much desired maximization of prots. This has been possible because Moore's funded his

5

thoughts precisely on the protability of the semiconductor industry [107]. Quite roughly, industry must optimize protability, thus industry must follow Moore's law, as long as this result is achievable. This point is much discussed by analysts:

achievability

is a great concern.

A classic question is if this exponential trend could continue also in the future: despite the use of the word law, this exponential growth is not a guarantee, it is instead the result of continuous research eorts, technological advancements, achieved know-how, and so on. Until now, a series of technologies have guaranteed to the computer industry the achievability of the exponential growth, but the question about whether in the future some other new technologies will permit the same pace is currently open. These concerns are not new: throughout the years, this question has been raised many times:

despite the concerns, Moore and many other analysts afterwards,

have been right to forecast the continuing of its trend until now. Even if Moore itself admits that No exponential is forever, he however explains that computer industry is trying to

delay

its end forever [54]. As a consequence, throughout the

years, the lifecycle of dierent technologies has been carefully managed to permit this continuous delay. As depicted into reports from the International Technology Roadmap for Semiconductors (ITRS, see section A.1), the semiconductor industry has currently the rm expectation to be still able to delay the end of exponential growth for many years: they are expecting that equivalent Moore will hold fast also at least up to an impressive 2025 [83]. Through the years, achievability and delay were made possible by using two complementary strategies:

- lengthen as more as it is possible the life of current technologies (at least as long as production is protable); - promote research eorts to design, develop and at last produce next-generation technologies, in order to be ready to step to a better technology when current ones become obsolete or less protable.

These two goals are the two alternative approaches that inspire each research eort made into universities, research laboratories, and industries. Since the 50s,

6

for example, the former strategy has permitted to reduce incredibly the size of 1

transistors , reaching today that of just few nanometers. Usually the latter strategy, surely more challenging, does produce new technologies: as an example, hard disk during the 50s and Flash memories during the 80s were the osprings of this strategy.

Demand side A selection among current trends and needs As the oer of the industry must comply with the needs of the customers in order to be sold, a brief analysis of current needs of semiconductors customer would be valuable. Often, these needs do inuence profoundly the oer, forcing producers to adapt quickly to new challenges. As depicted into table B.1, current technology trends are focusing principally on the following areas:

Computing everywhere: price of electronic devices is continuously decreasing (following the cost per function), thus facilitating their spread.

Without

lingering too much about the important role of computing in almost every human activity, the use of computing resources is further spreading: today portable devices like smartphones and tablets are fully edged computers, just as laptops or traditional workstations. Wearable devices are just another step into the direction of computing everywhere. Another aspect of the same trend is the spread of smart logic into a plethora of simpler devices:

washing machines, alarm systems, thermostats, home

automation systems, TVs, and many other appliances used every day by millions of people. Even in much smaller devices computing is being oered as a standard feature: smart cards, networks of sensors, RFID devices, microdevices, all have some grade of computational power. One of the causes of such a widespread use of computing resources lays on a simple observation: if millions of transistors build up a CPU having both a relevant computational power and a relevant price, few transistors can be

1 i.e. scaling: pass from one technology node to a smaller one, reducing the feature size.

7

used to build far simpler electronic devices still having (reduced) computational power at a much lower price.

This in turn permits to industries to

select the needed balance between cost and functionality of their products.

Internet of Anything: this point follows the preceding one, as people need connectivity along with computing: without it, in a world where information and interaction is almost in real time, computing resources would be useless. So, just as computing resources are spreading across an innite number of devices, the same devices are increasingly becoming able to connect to various networks. Focusing on the Internet, analysts are expecting an exponential growth of devices using it: from the smallest devices to the biggest data center, connectivity to the Internet is fundamental. As a consequence of this ceaseless growth, both the network trac and the overall amount of data produced by each device increase.

Data Centers growth continues: just as the falling of the cost per function facilitates and urges the widespread of cheap and ubiquitous devices, the same cost reduction permits to concentrate bigger masses of computing power: this is the case of data centers, as the ones currently built and used by companies like Microsoft, Google, Amazon and Facebook to provide to their customers cloud services. Relating to data centers, analysts are expecting both:

- growth in the size of big data centers, reecting the increase of use of 2

data center services (Xaas

patterns, social networking, hybrid cloud

patterns, and so on); - growth in the number of big data centers, as businesses are increasingly using co-location [68].

Some scientic, academic, and government institutions are trying to build exascale-level supercomputers [12, 67, 70] in order to be able to solve huge computing tasks (i.e. simulation and modeling). An exascale supercomputer would be able to reach at least one exa-FLOPS performance. Such eorts

2 Anything (X) as a service

8

too, go in the same direction of more computing power and more storage volume into data centers. Finally, Big Data too falls into this category of trends and is somehow related both to bigger data centers and to exascale computing. Computer Desktop Encyclopedia refers to it as the massive amounts of data collected over time that are dicult to analyze and handle using common database management tools. The data are analyzed for marketing trends in business as well as in the elds of manufacturing, medicine and science. The types of data include business transactions, e-mail messages, photos, surveillance videos, activity logs and unstructured text from blogs and social media, as well as the huge amounts of data that can be collected from sensors of all varieties [65].

Massive amounts of data need huge databases and huge

data storage, as those found only in big data centers. Similarly, the trends about predictive analytics and of context-rich systems, taken from table B.1, fall in this category too.

Security and safety concerns increase: This is both a current trend in technology and a consequence of the points just described. Since: - computing devices are spreading; - connectivity capabilities are spreading in every area where computing devices are used; - elds of application of such computing devices are increasing, touching very sensitive ones, as those related to human health and medical science; - each computing device generates data and the whole amount of data generated each year is drastically increasing; - the use of connected services is increasing, with the eect of placing a huge volume of data on the cloud, then the eorts to protect such devices and their data will be signicant, as those to engineer the safest ways to use them.

9

The ones described above are some of the major technology trends currently noticed by analysts. Anyway, just because each trend is ultimately a specic pattern of use of bare hardware resources, those trends must translate into more specic requests placed to the semiconductor industry:

in the end, the semiconductor

industry produces transistors and integrated circuits. Given the trends just seen, the requests currently focus on these areas:

Speed: people are demanding an ever increasing amount of information retrieved in real time.

The use of web search engines, social networks and cloud

computing platforms is extensive: they expect their queries to be answered extremely fast. In order to be up to these expectations, information technology businesses need technologies allowing extreme speed. People expect to use fast personal devices too (laptops, tablets, smartphones, and so on).

Computational power and parallelism: people demand computational power, not only speed. Tasks performed by modern smartphones increase continuously both in number and complexity: hardware must have enough computational power to fulll every request. Moreover, people expect many tasks to be executed concurrently, henceforth further increasing the demand of computing performances.

Data centers too do not dierentiate from this

need: they are asked to solve many and concurrent requests of continuously increasing complexity.

Power Eciency: while until a few years ago this matter was not of primary importance, nowadays it is indeed pivotal. Power eciency is fundamental both in the domain of small devices and in that of the biggest realities. While it is simple to understand the need for power eciency in a smartphone or (innitely more) in a modern pacemaker, this issue arises also relating to data centers. As an example, one of the Google data center is adjacent to a power plant and its use of electrical power totals the huge gure of 75 MW [29].

Moreover, it is reported that data centers account for the

all electricity use worldwide and for the

2%

1.3%

of

in the U.S. As data centers are

expected to increase both in number and in dimensions, this issue increases further in its signicance.

10

Expectations on memories As this work is focused on memories, it is important to remark how each of these requests directly inuences the features that memories should have to satisfy the aforecited needs. I referred until now to those needs mainly as something related to computing devices, treated as a whole. However, while the eective use of each device has its own eld of employment, each of them performs its specic job just executing computations on some data: they dier accordingly to the properties of the data upon which computations are made (i.e.

the data managed from a

phone is voice, whereas data managed by a Internet router are IP packets). Data is indeed the the object of each computation.

While theoretically the simplest

devices (for example, sensors) could just manipulate simple data and transmit it without the need to store and to retrieve it, memory is nonetheless found in almost all computing devices. Since most of them must retrieve and store data in order to eectively perform a computation, the speed of each retrieval and each store represents a upper limit of the total speed of a computation. This observation evidences how the relation between computations and memory (as a whole, for the time being without making any distinction) is strict: as the market asks for

speed, this request naturally reects

on memories. The same happens with the other requests: both those related to computational power and

power eciency

do reect naturally to memories.

A further fundamental feature considered when evaluating memories, is

sity :

den-

since their purpose is the maintaining of data through time, a critical aspect

is that of how much data can be contained into each memory chip.

Given the

technology trends just presented, this issue is expected to increase in its centrality: increasingly complex computations need to manage increasing quantities of data. The pervasive and increasing presence of computing devices too generates incredibly high amounts of data, increasing the mass of data that potentially should be stored somewhere. Bigger and bigger data centers, along with exascale supercomputing, need to manage, store, retrieve huge amounts of data too. Big data intrinsically refers to the need to manage a huge and ever increasing amount of data. All these requirements urge the semiconductor industry to develop denser

11

memories to manage this growth.

The need of persistence A further remark should pertain to persistence: as persistent memories represent the triggering reason of this work, the question about whether either the semiconductor industry or the market are demanding persistent memories arises naturally. Persistent-memory devices would result to be just perfect to be employed as a storage medium. Until now storage has always deeply suered from the classical dichotomy fast-volatile and slow-persistent: storage has always been slow. Moreover, computers, sooner or later, must use a storage layer to save permanently the data that they manipulated. In this exact point, as computing devices have the need to use storage, computations pay a high price in latency: storage slowness represents a upper limit of performances of storage-related operations. Referring to hard disks, while their capacities have increased more than 10,000fold since the 80s, seek time and rotational latency have improved only by a factor of two, and this trend is expected to continue [112, 115]. So, as the need of data storage increases, the problems related with storage speed are expected to increase too. Fast and persistent memories would thus give the opportunity to overcome these limitations, permitting to storage to become, at last, extremely fast: this achievement would represent a major innovation in many computing areas. As an example, a likely area of use among the many ones, would be into data centers where caching servers are extensively used to speed-up data retrieval from storage. Persistent memory would represent at least a big simplication opportunity as the caching server would become unnecessary: in turn, this simplication would represent an important opportunity of savings in cycles and energy and ultimately, in money. From an engineering standpoint, moreover, persistent memories would represent a useful element of choice to engineers, whilst today there is no choice: either speed or persistence.

Consequently, they would have the opportunity to build

devices better suited to the high and frequently changing needs of the market. A last observation pertains to the reality: as a matter of facts, the most of the technologies that are currently under extensive research are persistent.

12

Memories - demand and oer While the semiconductor industry is currently producing high volumes of DRAM and Flash memories, as well as high volumes of hard disks, these industries nonetheless are steadily preparing a transition towards new technologies: the current ones suer from some limitations that are already challenging. Analysts and the semiconductor industry itself are expecting that in the next years, those limitations will become overwhelming, eventually frustrating both the oer and the demand side of the economical relation:

The oer: current memory technologies have always beneted from technology scaling: it suces to think since how long technologies as hard disks (1956) and DRAM (1968) are on the market. However, current memory technologies are reaching a point where just scaling is no more easily achievable: while the reasons will be unfolded subsequently, the fact is that the semiconductor industry is expecting that such technologies (DRAM, Flash, hard disks) will result increasingly dicult to produce and to enhance, thus losing appeal.

The demand: current memory technologies are either fast and not dense or not fast and dense. This fact, although frustrating, has always been perceived as a matter of fact. However, since currently the market is increasingly demanding both speed and density, current technologies are becoming increasingly unt to fulll such requirements. Moreover, just considering an increase in only speed or density, this goal too is becoming increasingly hard to reach. The issue about power eciency is simply loosely considered into current technologies: unfortunately, each of them is power hungry. As an example, currently DRAM accounts for the 30% of the total power consumption in a data center [30]. Summarizing, since current technologies are evaluated to reach soon their limits, the semiconductor industry is currently searching for new memory technologies that would allow both the continuation of the exponential growth and the scaling trend for a long time. Those technologies, that will be subsequently referred to as prototypical and emerging, have the potential to succeed into this fundamental goal. Among them, some will prevail against others; some others maybe will be

13

never produced, whereas others will eventually succeed and become mainstream. In the case of a successful technology, anyway, that technology will have assured both the maximum achievable protability to the producing businesses (as a mix of right timing, ease of production, low costs, minimum cost per function, good knowhow, etc.)

and the fulllment of the requests coming from memory consumers,

being them individuals or organizations.

14

1.1.2

A technological view

Leaving the economical considerations behind, here are introduced the taxonomies used to present current and future technologies, as well as the technical parameters used to present the peculiarities of each specic technology.

From a hypothetical perfect memory to the memory hierarchy As it happens in any scientic eld, a given resource is evaluated by measuring the score of some evaluating parameters; thus, if there was a perfect resource, it would maximize the score of each feature in a free variable manner. However, referring to the real world, most of the times some of the features are not free variables, but they are instead dependent on each other: some improvements in some of the variables are often at the cost of some other variable. Since no perfect resource does exist, the same happens with memories; if however a perfect memory existed, it would maximize each of the following features: - quantity; - speed; - low cost; - low power consumption; - data retention; - write endurance. In real memories however, some of these features (especially quantity, speed and cost) are mutually dependent and one is usually increased at the cost of the other: fast memories are expensive and slow memories are cheaper, the quantity depends from the compromise between speed and cost. The reason of this correlation lays on the fact that technologies that focus on speed consume a bigger area of chips than those that focus on quantity. In turn, this inuences data density per area: slower technologies achieve better data density than the faster ones, achieving thus a better cost per bit function.

15

A modern computer (as well as a modern data center) can run and solve the same class of problems of a Turing machine [36].

In Turing machines memory

exists in form of a innite-long tape containing innite cells. Even if memory in that model is very simple, it is indeed innite. Somehow relating to that model, also in modern computing systems, no matter their size, memory (as a whole, not relating to memory concept of the Von Neumann model) can be though as innite. For examples, disks, tapes, DVDs, can be indenitely added, switched and removed, creating eectively, although indirectly, an innite memory. The memory hierarchy, as presented in gure 1.1, represents visually the consequence of the issues just presented: needing an innite memory and having a limited amount of money, necessarily computers are engineered to use few fast memories, and a lot of slower but cheaper memories with the intent of both maximizing performances and capacity, while minimizing costs.

Figure 1.1: The ubiquitous memory hierarchy

Current memory technologies will be presented subsequently following the structure of gure 1.1, starting from the base of the pyramid, reaching gradually the top. Dierently, the new ones will be presented without referring to the memory hierarchy pyramid: the taxonomy drafted by ITRS will be used instead,

16

as represented into gure 1.2, believing that it is more helpful to classify the technologies that will be soon presented.

Figure 1.2: ITRS Memory Taxonomy

The memory cell and its performance parameters Electronic technologies that do not use mechanical moving parts are dened commonly as solid state technologies. In the context of solid state memory technologies a key concept is that of a memory cell. A memory cell is the smallest functional unit of a memory used to access, store and change the information bit, encoded as a zero or a one. Each memory cell contains: -

The storage medium and its switching mechanics: where the bit is encoded and the mechanism to execute the switch between 0 and 1.

-

The access mechanism: the mechanism to select the correct memory cell.

This concept could be used, although less usefully, also in the case of mechanical technologies: in this context, the denition of memory cell refers to the storage medium only, since in mechanical technologies usually both the switching and the access mechanics are shared among the whole set of memory cells.

17

Referring thus to solid state memory technologies, each dierent technology have a specic memory cell, with specic performances. The parameters usually measured to compare a specic technology with the others are:

F

- feature size (length:

-

µm

or

- cell area (measured in square - read latency (time: - write latency (time:

µs µs

or

nm);

F 2 );

ns);

or

ns);

- write endurance (scalar: maximum write cycles); - data retention (time); - write voltage (V); - read voltage (V); - write Energy (pJ/bit or

fJ/bit);

- productions process (CMOS, SOI, others); - conguration (3-terminal or 2-terminal); - scalability.

Inside the memory cell Despite the fact that each specic memory technology has its own memory cell details, dierent technologies can share similar approaches in some engineering aspects : the following insights could be useful to the reader in order to better follow the terms and the descriptions of each specic technology found subsequently.

The storage unit and the write/read logic: the storage unit is responsible of the eective data storage and retention. Each dierent technology denes, along with the storage unit, the mechanics (and the logic) to write and to read the data.

There are two fundamental methods used to store data in

solid-state memory technologies:

18

-

As 3-terminal storage units: this approach uses modied Field Ef3

fect Transistors (FETs), to store data . Data is stored modifying (raising or lowering) the threshold voltage of the transistor, thus inuencing the current passage between source and drain electrodes, while the potential dierence applied to the gate electrode is maintained xed. In such devices, reading is usually performed by sensing the current ow at the drain electrode, applying a potential dierence to both the source and the control gate:

depending on the value of the threshold volt-

age previously set (and thus depending on the value of the bit stored), current ow is permitted or avoided. -

As 2-terminal storage units: technologies using this approach usually build each storage unit as a stack of one or more dierent materials enclosed between two electrodes (terminals).

Storage units built this

way usually have a resistive approach: data is read sensing if a probe current passes (if SLC) or sensing how much current passes (if MLC) between the two electrodes. The writing mechanics however, depends on specic technologies: some technologies must execute the write process using additional logic (i.e.

standard MRAM, see section 1.3.1),

whereas some newer ones feature the writing mechanics directly embedded into the storage units (i.e. all ReRAM technologies, see section 1.3.2).

In particular, this last class of newer devices presents a con-

guration similar to that of fundamental electric devices like resistors, capacitors and inductors, all featuring the 2-terminal approach, and are sometimes referred to as memristors (see section A.2). Sometimes, a specic physical property can be used to implement devices in both the 2-terminal and the 3-terminal congurations, as it is the case of memory cells built using ferroelectric properties (FeFETs: 3-terminal, FTJs: 2-terminal).

The access control: a memory uses memory cells as building blocks, but there must be a way to select single memory cells in order to execute a read or a

3 FETs have a source, a drain and a gate electrode; see section A.2

19

write. Usually, two alternative approaches can be used: -

Active matrix: a transistor is used to access the storage unit. These technologies are usually referred as 1T-XX technologies where 1T stands for 1 transistor (the controlling transistor) and the XX parts depends on a specic technology. This is the case for DRAM, a 1T-1C technology (C stands for capacitor);

-

Passive matrix (crossbar): the storage unit is accessed almost directly, with at most the only indirection of a non-linear element, used to avoid half-select problems.

Destructive vs non-destructive reads: Some technologies suer from the destruction of data contained into memory cells when reading is performed: these type of read operation are called destructive reads. Such technologies usually have additional logic for executing a re-write of the same data after the read operation to prevent the data loss. Obviously this is a issue that engineers would avoid where possible, thus preferring those technologies whose read operations are non-destructive. Needless to say, writes are intrinsically always destructive.

1.2

Technology the present

The memory technologies that are currently mainstream and can be used by engineers into computing devices will now be presented.

For each description the

reader nds a brief explication about the specicities of each technology, as well as a short dissertation about the claims of the scientic community about their limitations.

1.2.1

Mechanical devices

Magnetic tapes, CDROM, DVD, BLU-DISC Even if not very interesting in the context of this work, it would be worthwhile to cite the presence of these devices as they represent the lower part of the memory

20

hierarchy.

These devices are specically engineered to store data permanently

through I/O operations at the lower cost possible.

At this level of the memory

hierarchy performances are not as important as storage capacity.

Hard disks Data is stored persistently into a magnetic layer applied on the surface of one or many rotating discs. This technology is similar, in vinyl music discs:

mutatis mutandis, to the one used

a moving head moves following the tangent of concentric

rings to read stored data. Data is retrieved (or written) by the head sensing the magnetic eld coded into the magnetic layer. Data transfers are I/O bound, data is stored and accessed in blocks, reading is not destructive. Even if rst hard disks appeared in 1956 (IBM RAMAC [79]), this technology is still alive and vital: it oers high storage density, long data retention period, low price per bit.

Hard disks often are equipped with some amount of cache

memory, needed to raise their performances.

This same eort to improve the

performances of hard disk has led to Hybrid Hard Disks, i.e. hard disk with a Flash cache [42].

fast performing

These products are attempting to approach

Flash-like performances at the price of a common hard disk. The memory hierarchy pyramid describes at a glance both the advantages of hard disks and their most noticeable shortcoming: high densities at the cost of slow speed.

Despite the age of this technology, it seems that scaling is still

viable as the densities per square inch are still increasing: while current ones are about one

TB

per platter, newer technologies are promising to be even higher

[97, 88]. Unfortunately, hard disks have always suered from slow transfer rates and very high (milliseconds) latencies; moreover, these undesirable aspects cannot be bypassed without workarounds: the mechanical nature of the hard disk is an intrinsic limit (the head has to physically reach the position to execute reads and writes). The mechanical nature of HDDs has other drawbacks: the rotation of the platters, the move of the head on disks, the read or the write process can all be a source of failures, either due to a physical breakage, or to external sources as vibrations and accidental falls [133]. As for power eciency, hard disks consume a high amount of electrical power: consumption could be from around

1.5W

to

21

around

9W

4

each . Consequently, referring to the needs previously described, hard

disk do not seem to comply with those high claims: while they seem to be just perfect for long term and high volume data retention, they seem to lose appeal when high throughput, low latencies and power eciency are needed.

1.2.2

Charge-based devices

This class of devices, all solid-state, use the electric charge of electrons to store bits into the memory cells: hence their name. In each description are summarized both the technological aspects that enable the bit storage and the switching mechanics, along with a short summary of the specic issues of each technologies. Common issues of this class of devices are instead specically treated afterwards.

NAND and NOR Flash memories This technology is based on the ability of building enhanced eld eect transistors to achieve the desired behavior, as it happens in EPROMs and in EEPROMs [50, pp. 9-4]. Memory cells using this technology feature a persistent storage achieved through a 3-terminal conguration: data is stored modulating the threshold voltage of the enhanced FET transistor. These transistors are similar to standard FETs, except for the fact that there are two gates instead of just one (control gate and oating gate). The control gate (the upper one) is the same as in FET technology, whereas the oating gate (the lower one) acts as an electron vessel: being made of a conductive material, it can contain oating electrons thanks to the insulator layer that encloses it (oxide). The threshold value (and hence the 5

contained data) is modied lling or emptying the oating gate with electrons . In a SLC conguration, a memory cell with high threshold voltage would be in a programmed status (0, non conducting, vessel full); vice versa, if the threshold was low, it would be in a erased status (1, conducting, vessel empty). Reading is performed as previously described about 3-terminal memory cells, and the read process is not destructive, as it does not imply neither the erasing neither the

4 Consumption of, respectively, a consumer low power HDD [109], and of a enterprise HDD [105].

5 The lling is permitted by FowlerNordheim Tunneling or Hot Carrier Injection, whereas

usually emptying is achieved through FowlerNordheim Tunneling. See [50].

22

programming processes. Flash technology is congured as a 1T technology, as each memory cell has exactly one transistor.

Figure 1.3: The Flash memory cell.

©

TDK Corporation

Depending on how the Flash memory cells are linked together, NAND Flash or NOR Flash are produced. Whichever the case, before being programmed each cell must be in the erased state. In both congurations, erasing operation is slow (milliseconds) and expensive (high power) and is performed in groups of many bytes, called erasesize. Reading is not a destructive operation, NOR ash can be either I/O bound or memory bound, NAND is only I/O bound. Research eorts in this technologies are extensive, even if this technology can be considered as mature: the use of a modied eld eect transistor as memory devices dates back to papers appeared in 1967 [138].

Those studies conducted

thereafter to the development of EPROM and EEPROM, whose main principles are further exploited in Flash technology, conceived by Dr. Masuoka's researches in the rst years of the 80s [49, 124]: NOR ash was ocially presented in 1984, NAND in 1987 [98]. Flash memories can be used to build a large number of memory devices: when used to build SSDs, performances are surely better than those of hard disks. Latency of Flash memory cells is between the order of tens and that of hundreds of microseconds: in particular Flash SSDs oer a signicant speedup in comparison with common hard disks, usually oering better latencies and higher throughput; their power eciency improvement on hard disk is yet to be veried: while commercial documentation claims without doubts a sure power benet over HDDs,

23 6

eective gures found in datasheets are indeed less clear . Flash memories, being solid-state, do not have any mechanical part moving into: this avoids mechanical failures.

Surely Flash memories represent a rst approach in the direction

requested by customer's demands. However, Flash memories are far from being perfect. Common issues are:

Cell wearing, low endurance: currently in the order of

104 ∼ 105

write cycles.

This problem roots both in the write/erase process and on the materials used to build the Flash transistor: both erasing and programming are achieved using high energies in order to force the electron passage through the thin oxide insulating layer (oxide tunnel). Such process gradually produces damages into the oxide layer, eventually causing the loss of its insulating properties: since the oating gate is made of a conductive material, the damage cause the loss of the contained electrons [61, 13].

In order to guarantee a long

life to devices employing Flash memories, wear leveling strategies must be adopted in order to distribute writes and erases across the whole set of cells and avoid the concentration of these operations on few ones.

Low reliability: NAND Flash conguration is the most used into SSDs principally for the better achievable density of this conguration. This, anyway, has a cost in reliability: NAND Flash devices suer from both read disturbs and read disturbs when reading and writing; moreover, this type of devices exits from factories without the guarantee of having all the cells in optimal status. For these reasons, ECC functions are needed when using Flash devices, especially when NAND conguration is used. Such functions usually increase the complexity of either the hardware or the software, thus they have a cost [102].

Complex writing mechanism: Flash memories have to follow a much involved mechanics: in order to be programmed, each cell must be in erased state. Moreover, erases are expensive and erase sizes are quite important (8K to 32K): this forces each SSD to have large reserves of erased blocks in order

6 For example, when comparing a consumer HDD with a consumer SSD both built in 2.5 inches sizes, it is apparent the gain of SSD when idle or in standby mode, whereas in read or write mode, the SSD can consume more energy (∼

3W

vs

∼ 1.75W,

see [96, 109].)

24

to speedup writes.

Moreover, since NAND Flash is I/O bound, transfers

are made at least per blocks, thus increasing the ineciencies when small amounts of data are transferred even in presence of small changes. All these issues are usually managed into software layers (either in the operating system or in SSD rmware) called Flash Translation Layers (FTLs), whose job is to hide such problems to the computer and to eciently manage the needed wear leveling, error correction and, generally, failure avoidance. However, while these layers do eciently succeed in simulating a standard hard disk, all this complexity is expensive and could easily be a potential source of performance loss.

Dynamic RAM DRAM technology is the main technology currently used by computing systems (from the smaller smartphone to the biggest supercomputer) to implement what Von Neumann called the memory in his well-known model.

Dynamic RAM

memory cell currently consists in one control transistor and one capacitor (1T-1C technology). These cells are then organized in grids. When line is opened through the transistor, charge is free to exit the capacitor if it is charged (meaning that its value is 1, 0 if the capacitor is empty). Due to this design, the reading operation is destructive: destruction of data is anyway avoided with a re-write in case the cell was charged at cost of some additional complexity.

Again, due to the just

explained design, this memory is volatile because capacitors discharge fast: the memory cells need to be refreshed to retain their data, i.e. the capacitors have to be recharged periodically (typical average time is

64ns

each capacitor): when the

computer is o, all data is lost. Data read and write operations are fast (memory latency in the nanosecond range, which means one or two order of magnitude less than the CPU speed) and each byte is directly addressed from the processor. Surprisingly, a memory composed of capacitors that needed to be continuously refreshed was present in a machine called Aquarius built in Bletchley Park (UK) during World War II [45]. However, the 1T-1C design dates back to 1968 when Dr. Robert Dennard registered US patent no. 3,387,286 [24] and improved previous design that required more components. The limits of DRAM are to ascribe principally to the low density and to the

25

high energetic cost, as explained before.

The scientic literature claims that as

density increases, also total refresh times of each chip do increase, causing high overheads in both latency and bandwidth [37]. Another issue related to DRAM is the fact that the growth in density can't stay ahead with that of CPU cores: while in the last years CPU cores doubled every two years, DRAM density doubled only every three years, thus losing memory per core ratio [55].

Even if the speed of

DRAM is high, its latency remains a bottleneck for the even higher speed of the CPU: the improvements in latencies in time have been minimal (only 20% in ten years), and this slight improvement trend is expected to remain the same also in the future. Just when the demand for speed is so pressing, this technology seems to have diculties to sustain the needed performances asked by current processors.

SRAM Static RAM is the fastest and most expensive memory used outside the CPU cores: usually SRAM is located on the same die of the CPU, anyway as near as possible to its cores and it is used mostly to serve as hardware data and instruction cache. It is frequent to nd SRAM in other caches, as those of hard disks, routers, and other devices. These memories are volatile, although the design does not require 7

refreshing as it happens to DRAM. Classical design is 6T ; this is a major issue since it increases the cost of a single memory cell and limits the density and the scalability.

CPU Registers CPU registers represent the highest level of the memory hierarchy. These memories are completely integrated into the CPU core, are used at full speed and are set aside from the standard addressing space: registers are directly accessed by name.

Usually registers are used to store temporary data between load and

store operations.

Since they are fully integrated into CPU cores, their number

is very limited: each more register means less space for computational functions. Information stored into registers is lost when power is o.

7 To build a SRAM cell are used 6 transistors. See [72].

26

1.2.3

Limits of charge-based devices

Besides the specic limits of each technology just described, charge-based technologies share some common issues: the most important problem currently faced by researchers and engineers is the technical concern related to scaling: since currently, feature sizes have reached 28nm in DRAM and 16nm in NAND Flash cells, researchers are concerned about how long still smaller sizes are achievable [94, 108]. In fact, at these little sizes: - memory cells are very close to each other: the risk of cell to cell interference is high; - the total number of electrons that can be eectively stored either in a capacitor or in a oating gate is little: if electrons are too few, current technologies cannot sense correctly their level.

Moreover, in case of capacitors, small

capacitors mean small charge and this, in turn, means higher refresh rates; - each functional element of the memory cell is very little and very thin, thus the risk of electron leakage is higher. Semiconductor industry is currently trying to pursue the goal of extending as long as possible the life of these technologies.

Such eorts have many common

aspects between charge based memories; the most used approaches are: -

Better materials: this approach permits to obtain better properties while maintaining unchanged the cell design. This is for example the case of highk

8

materials and production processes. A similar approach is used in Flash

memories, when the oating gate conductor is substituted with a insulator (dierent from the oxide) able to trap electrons inside it: this particular approach permits a better resilience to tunnel oxide damages, as well as a better isolation against cell-to-cell interferences. -

3D-Restructuring: this approach is indeed a major modication to the structure of each memory cell, even if the functional logic of each technol-

8 k is the dielectric constant. High values permit better insulation and electrical properties [101].

27

ogy does not change.

3D-stacked semiconductors are object of extensive

research eorts, as this particular approach would permit an important lifecycle lengthening of both DRAM and Flash technologies.

The main ad-

vantage of 3D structures is that memory cells can be stacked up vertically. Vertical stacking permits both an optimized use of the chip area (density optimization) and it allows higher distances between memory cells (in order to avoid cell-to-cell interference) [108]. Currently, Samsung is already producing SSDs using 3D vertically-stacked Flash memory cells that use charge trap technology instead of the classic oating gate [96]. Referring to DRAM, 3D-restructuring is currently applied into a prototypical technology called Hybrid Memory Cube [78]. The promises of this particular technology are high: speed and performances near to those of the CPU, high power eciency and a much better density. Despite these extensive eorts to delay the retirement of charge-based technologies, the semiconductor industry has nonetheless the rm expectation that sooner or later, these technologies will become too dicult to produce and enhance. Following this expectation, the technologies subsequently presented are the object of lots of research eorts that aim to obtain, nally, products able to be both successfully used as a replacement of current technologies and to respond completely to the continuously higher expectations of the market for the years to come.

1.3

Technology the future

In the next paragraphs are presented the technologies that will compete to become the ones used in next-generation memory hierarchy. Before delving into details of each specic technology, most of these technologies share some common features: -

Charge-based approach is being dismissed: instead, the resistive approach is preferred. Technologies like DRAM and Flash memory use electrons containers (capacitors or any material able to retain electrons) to encode the information bit into the memory cell.

However, this approach

has the disadvantages just shown. Research is thus preferring the resistive

28

approach: the information bit is encoded as a property of a specic switching material: high resistance or low resistance status. There is therefore no need to store electrical charge: electrons are used just to check the memory cell status. This approach permits better performances and better scalability. -

Persistence: each of these new technologies is not volatile; they are able to remember data when power is o, as it happens for SSDs or HDDs. Some of them still cannot guarantee a long retention time, but these are problems related with the early engineering and development stages: these new memories are engineered to be persistent.

-

These technologies use the RAM word into their name: this is indeed a clue of the fact that these technologies are both approaching the speed of RAM (sub microsecond speeds) and seem just perfect to be implemented into byte addressable memories, as it happens in common DRAM.

-

Density is expected to be higher: than that of DRAM. Some of these technologies, especially those in prototypical stages of development, have some problems to achieve such a goal, as their area in square features is too high. Generally, however, these new technologies promise a better density (featuring the very reduced area of

-

4F 2 ,

see table B.2).

Endurance is limited: most of these technologies have a limited endurance respect to that of current DRAM, but better than Flash, as the most wearable technology features at least

-

109

write cycles.

R/W asymmetry: the most of these new technologies feature dierent timings between read (faster) and write (slower) operations, as it happens with Flash memories.

1.3.1

Prototypical

Technologies in a prototypical development stage are already being commercially produced even if technology is not mature. Production volumes are low and, as a consequence, prices are high.

Prototypical products are often used in niche

markets, and usually suer from the fact that some of the evaluating features have

29

still not reached the target levels. The research eorts undertaken to obtain better performances, densities and, more generally, a product ready to be produced at high volumes, are usually extensive.

Ferroelectric RAM (still not resistive) FeRAM, or FRAM, uses ferroelectricity (see section A.2) to store the information bit into a ferroelectric capacitor, able to remember its bistable polarization status in time.

The capacitor acts as a dipole, whose polarization is changed under

the eect of an electrical eld.

One of the two polarization status is logically

associated to a 0 value, whereas the other is given a logical meaning of a 1. As a consequence of this switching mechanism, this memory still doesn't use the resistive approach. This technology uses the active matrix conguration to access

Figure 1.4: Ferroelectric crystal bistable behavior.

©

Fujitsu Ltd

the capacitor through a transistor: it is a one transistor one ferroelectric capacitor (1T-1FC) architecture. Similar to the mechanism used in DRAMs, reading is destructive: to my knowledge, in literature the process of reading is never described neatly, except from the fact that its destructiveness is remarked. Current issues related to this technology are bound to scalability and the used production process: as remarked in table B.2, feature sizes are still at the 180nm

30

node, and the area of 22F

2

is such high that scaling becomes problematic. An-

other problem is related to the cell wearing, since the process of switching in time degrades the cell's performances (dipole relaxation). Currently this memory is used in some embedded computing devices.

MRAM and STT-MRAM This technology, also referred as Magnetic RAM, uses memory cells built in a one transistor one magnetic tunnel junction (1T-1MTJ) architecture (see section A.2). The MTJ element acts as a resistive switch (as the ones shown subsequently in ReRAM, PCRAM, FTJ RAM), encoding a bit as a dierent resistance status.

Figure 1.5: Magnetic Tunnel Junction.

©

Everspin Technologies Inc.

The technology to program MTJ elements changes between classic MRAM design and STT-MRAM design. In classic design are used magnetic elds induced by currents passing into nanowires [66].

Otherwise, STT-MRAMs use the Spin

Transfer Torque eect to change the magnetic polarization of the free layer [4, 119].

The memory cell is read sensing the resistivity of the MTJ element, detecting whether current passes: the read operation is thus not destructive since the status of the free layer does not change upon reading. Even if some of the ideas used in MRAM date to those developed in magnetic core memory in the 40s and 50s [86], MRAM is based on new researches in physics undertaken from the 1960 about what is now called spintronics [63, 139], culminated in 1988 with the discovery of the Giant Magneto-Resistance by Albert Fert and Peter Grünberg [100]. First patent of MRAM technology dates to 1994 (IBM - US patent no. 5,343,422).

31

Phase Change RAM (PCRAM PCM) During the 50s, Dr. B. T. Kolomiets conducted his research work on chalcogenide glasses, verifying their ability to change from an amorphous state to a crystalline state under the eect of heating [43]. The crystalline state presents low electrical resistance and high light reectivity, while the amorphous state presents the opposite features: high electrical resistance and low light reectivity. The ability to change state as just depicted was subsequently referred as phase change. Phase Change RAM indeed exploits the resistivity dierence between the two states assumed by a phase change material (the most commonly used compound is GeSbTe, or GST) to encode the logic 0 and 1 levels. Each memory cell in PCRAM consists in a chalcogenide layer enclosed between two electrodes. The top electrode is directly attached to the calchogenide layer, whereas the bottom electrode is linked to the phase change layer via a small conductive interconnect surrounded by an insulator.

The conductive interconnect is responsible of the chalcogenide status

switch by means of Joule eect, functioning as a heater.

Each memory cell is

programmed using either a fast pulse of a high current to melt the crystalline status to the amorphous one, or using a longer pulse of lower current to induce the growth of crystals, modulating the use of the heater.

Cell reading is performed

using a low probe current, sensing its ow, which depends on the cell's resistivity. As a consequence of this design, reading is not destructive. Phase Change RAM represents a very promising, yet prototypical technology. First articles about electrical switching of phase change materials date to 1968 [113] and a rst prototype of a phase change memory was presented by Intel in 1970 [94]. However, since it was expensive and power greedy, this technology starved during the 70s and 80s, until current phase change design was developed, exploiting the research eorts made during 80s and 90s in optical storage: phase change materials are the key to obtain writability and re-writability of CDs and DVDs [131].

Even if phase

change technology still has to be improved to reach expected performances and eective protability, working prototypes have already been produced by Micron, Samsung, IBM; phase change memories are used also in some very high performing PCI Express SSDs [1].

32

©

(a) PCM switching mechanics

©

(b) PCM mushroom type

Am. Chem. Soc.

Ovonyx Inc

Figure 1.6: Phase Change memory cell

1.3.2

Emerging

Besides PCRAM, MRAM and FeRAM, new emerging technologies are being currently developed, the most known among them is ReRAM. Those emerging technologies (briey depicted afterwards) represent a real jump in sophistication respect to the ones just explained.

Obviously, knowledge and know-how are cu-

mulative, and new technologies are developed only thanks to the eorts made in preceding times; nonetheless, the quality and depth of scientic and technical knowledge needed to reach the goal of commercial (and protable) products using the technologies later described, is impressive: chemical mastery, quantum physics, nano-ionics, materials science, nano and sub-nano scale production processes, etc. Except for ReRAM, other technologies are still embryonal.

Ferroelectric Memory ITRS inserts into this emerging memory category two dierent technologies, both based on ferroelectricity: Ferroelectric FET technology and Ferroelectric polarization ReRAM (Ferroelectric Tunnel Junction technology FTJ).

Ferroelectric FET or FeFETs, resemble to standard MOSFETs using a ferro-

33

electric layer, instead of an oxide layer, between the gate electrode and the silicon surface [27]. One polarization status permits current passage between source and drain electrodes, whereas the other does not. A memory based on FeFETs would have memory cells very similar to those of Flash technology since it would also be a 1T technology. Writing would be achieved by changing the polarity applied on the control gate, whereas reading would be sensed probing the current passage between the source and the drain. The reading process therefore would be not destructive. Real FeFETs usually employ an insulating layer between the ferroelectric and the semiconductor in order to achieve better performances. This need 9

has led to various congurations, as MFIS or MFMIS . First attempts to develop FeFET technology were made during late 50s and 10

a rst patent using this approach was issued in 1957

. However, more than

50 years after the rst patent, this technology has proved to be denitely un-trivial, suering from some still unaddressed problems, the biggest of which being the data retention loss. New approaches to this technology are investigating new organic ferroelectric materials. If improved, this technology would be interesting when applied to a DRAMlike memory: such a solution would be much more scalable, since it would not need a capacitor, hence reducing the minimum feature size.

Ferroelectric Polarization ReRAM FTJ ReRAM uses memory conguration similar to commonly called ReRAM, whereas the memory cell uses FTJ technology to encode information bits persistently and to permit nondestructive reads (in contrast to classic FeRAM technology, see section A.2). As said into 2013 ITRS report on emerging memory devices, Although earlier attempts

11

date back to 1970, the rst demonstrations of TER came in

2009 [84]. This memory technology is thus in a very early stage of develop-

9 Respectively, Metal-Ferroelectric-Insulator-Semiconductor and Metal-Ferroelectric-MetalInsulator-Semiconductor.

10 US patents 2,791,758 and 2,791,759 and 2,791,760 and 2,791,761. 11 To model tunnel electro-resistance (TER), the mechanism used into FTJ (ed.)

34

ment and literature papers reects this. No industry player is yet producing such a technology, even when referring to prototypes.

Resistive RAM (ReRAM) Redox memories The resistive term suggests that this technology, as for the case of MRAM, uses resistivity to encode data into memory cells: each cell has a RS (Resistive Switching) element, responsible of the eective storage of data. This element, in a SLC (single level cell) approach, would encode a zero as a non-conducting state (high resistance status HRS RESET status), whereas a one would be encoded as a conducting state (low resistance LRS SET status). Each memory cell acts as a building block of bigger memories, usually displaced in a grids, as it happens for DRAM. Resistive RAM (or Redox memory, as classied by ITRS) indeed is a generic term representing a series of dierent strategies adopted to induce the resistance switching of the RS element by means of chemical reactions (nanoionic reduction-oxidation eects) [137]. Whatever each Figure 1.7: RRAM cell.

©

John Wiley & Sons

RS

the

specic

element

capacitor-like

MIM

is

switching

generally

process,

built

as

a

(Metal-Insulator-Metal)

structure, composed of an insulating or resis-

tive material I (usually a thin lm oxide) sandwiched between two (possibly dierent) electron conductors M.

Ms are sometimes referred to as top elec-

trode (TE) and bottom electrode (BE). These RS elements can be electrically switched between at least two dierent resistance states, usually after an initial electroforming cycle, which is required to activate the switching property [145]. Resistive RAM represents a class of very promising, yet emerging, technologies for the next-generation memory. Expectations are very high, even when compared to other prototypical technologies: high endurance, long retention, extreme speed and reliability, very low power consumption, high scalability with relative ease of production.

Moreover, scientists and researchers claim high degree of improve-

35

ments are achievable. Resistive RAM technologies are tightly linked to research eorts made on thin lm oxides especially during the 60s and 70s. First patent about a memory technology using a memory array of cells containing bistable switchable resistors 12

dates back to 1973

. Anyway, as pointed out in Kim's paper [40], that technol-

ogy starved until late 90s and 2000, when a newer approach was proposed and undertaken.

Current scientic literature about these technologies well describes

the emerging nature of ReRAM: most of the papers focus on the core research side. Recurring topics are the need to model correctly the atomic behavior of the MIM compound during the resistive switch, the need to understand thoroughly the interactions between the materials used into the RS, the need to develop better laboratory tools and techniques to analyze precisely the resistive switch mechanics. On the other hand, scarce are the divulgative papers and few are the papers dedicated to implementors and computer scientists. Nonetheless, the most part of the big electronic industry players are currently developing prototypical memories based on redox memory cells, and some startups companies are holding new intellectual properties based on ReRAM projects. Even if the protability and the related commercialization of these products seems to be quite far to come, reliable samples have already been produced using current production processes [77]. Before presenting each specic switching mechanism, it could be useful to specify the meaning of the following terms as they are used frequently in subsequent descriptions: -

lamentary / non-lamentary: in this context, lamentary means that the change in resistivity into the RS element is achieved creating a lamentary conductive link between the two conductors (M), i.e. the passage of current is not uniform into the RS element and the most part of the resistive material continues to act as an insulator. Conversely, non-lamentary mechanisms achieve the resistive switch uniformly into the resistive material (I), i.e.

the current passes through the whole volume of the resistive material

[84, section 4.1.2.1]; -

unipolar / bipolar: suggests whether the specic mechanism uses either

12 U.S. patent 3,761,896.

36

one xed polarity between the two electrodes or the polarity is inverted among them in order to switch the cell status.

In case of unipolarity the

current will have to be somehow modulated in order to produce the status switch.

In case of bipolarity the mechanism is simpler: polarity inversion

causes the switch between states [84, section 4.1.2.1].

There are four dierent approaches to redox memories: each one use one specic combination of the alternative features just presented.

ElectroChemical Metallization: mechanism (ECM), sometimes referred as Electrochemical Metallization Bridge, Conductive Bridge (CB) or Programmable Metallization Cell (PMC). This technology uses the lamentary bipolar approach: one of the two electrodes is electrochemically active, whereas the other electrode is electrochemically inert.

The I material is a solid elec-

trolyte, allowing the movement of charged ions towards the electrodes. The change in resistance depends on the creation of a conductive path between the electrodes, under the eect of an electric eld among them.

©

Figure 1.8: ElectroChemical Metallization switching process 2013 Owner Societies

These are the reactions veried under the eect of an electric eld with (suciently) positive potential attached to active electrode (RESET to SET transition):

-

Oxidation: the material used into active electrode loses electrons and +

disperses its ions (M , cations) into the solid electrolyte (M

ze );

−−→ Mz+ +

37

-

Migration: the positively charged ions move towards the low potential electrode under the eect of the high electric eld;

-

Reduction and electrocrystallization: on the surface of the inert electrode takes place the reduction process, where electrons from the electrode react with the ions arriving, forming a lament of the same metal of the active electrode, growing preferentially in the direction of the active electrode (M

z+

+ ze −−→ M).

The memory cell thus retains its SET status until a sucient voltage of opposite polarity causes the opposite reactions, leading to the RESET status. Such an approach in memory production is currently pursued by Nec (Nanobridge Technology), Crossbar (PMC), Inneon (CBRAM).

Metal Oxide Bipolar lamentary: Valence Change Mechanism (VCM), besides ECM, uses anions movement to reach the resistive switch feature. It relies on defects into crystal structures (usually oxygen vacancies), positively charged, and on the ability of anions to move into holes through the I element. Referring to the redox feature, in this context reduction refers to the act of recreating the original crystalline structure, lling a vacancy (usually acquiring oxygen anions), whereas oxidation refers to the eect of vacancy creation (usually losing oxygen anions).

Reduction or oxidation have the eect of

changing atomic valence of the atoms building the crystal structure where this change happens, hence the name of Valence Change. The resistive switch is induced under the eect of an electric eld: in one polarity, it creates a conductive tunnel of accumulated vacancies, whereas the other polarity has the eect of restoration of anions at their place. Currently, Panasonic and Toshiba are developing ReRAM memories in their laboratories, and samples have already been demonstrated [84].

Metal Oxide Unipolar lamentary: ThermoChemical Mechanism (TCM) represents another approach used to create a lamentary conductive link between the two electrodes of the MIM compound. This approach is somehow

38

similar to the one used in Phase Change RAM: instead of a dierent polarity of an electrical eld as it happens in VCM and ECM, in TCM a modulation between current and voltage (not impulse time as in PCM) is used to induce SET-RESET and RESET-SET transitions, maintaining a xed polarity. The I will not prevent completely the current passage: when in RESET status, the current ow will suer high resistivity; conversely, in SET status, the resistivity will be low. To obtain a RESET-SET switch a limited current under a high potential electrical eld is used : the limited current passage induces the Joule eects that, in turn, triggers a redox process (similar as that of VCM) that creates a lamentary breakdown of the oxide (I), leading to a conduction channel between the electrodes and to an immediate resistivity decrease. Conversely, to obtain a SET-RESET switch a high current with low voltage is used: this current has the eect of breaking the conductive link, as if it was a traditional household fuse. For this reason, TCM is also referred to as a fuse-antifuse mechanism. This approach seems still far from maturity: ITRS does not report any big electronic rm endeavoring this technology.

Metal Oxide Bipolar non-lamentary: the last class of redox-based approaches uses a non-lamentary strategy, sometimes referred to also as interfacial switching. Resistance switch mechanism is triggered by eld-driven redistributions of oxygen vacancies close to Insulator-Metal junctions. ITRS refers to this technology as the less mature approach besides other ReRAM.

Mott memory Researchers are investigating the feasibility of memory cells using the Mott transition eect as resistance-switching mechanism . Such memory cells could be congured either as modied FETs (as it is the case in FeFETs) or as MIM compounds (as it happens in ReRAMs).

Relying both on documentations from ITRS and

on clues searched on the web, as well as on scientic literature, research in this technology seems to be at a very early stage and there are no information about

39

produced prototypes. Research eorts are still concentrated on chemical and physical properties of Mott insulators.

Carbon Memory Some researchers have proposed to use carbon as a new material to build resistive and non-volatile memory cells.

Investigated congurations are to be listed

as bot 2-terminal and 3-terminal memory cells.

In this approach, memory cells

would exploit some of the physical and electrical features of carbon allotropes (diamond, graphite and fullerene), especially that of graphite (graphene and carbon nanotubes are the more common examples).

Some approaches would use

the transition between a diamond-like state (insulating) and a graphite-like state (conducting) as the switching mechanics. Others would use local modications in carbon nanotubes inducing a resistance switch; others again would use an insulating diamond-like carbon between conductors as in the case of electrochemical metallization to induce electrically a conductive graphite-like lament. Research on carbon allotropes in this eld is more mature than that of other emerging memories (such as Mott memory for example): started in 1859 by English chemist Benjamin Collins Brodie, with the discovery of the atomic weight of graphite, knowledge on this material grew throughout the last century. Carbon Memory is another memory technology at an embryonal state of development.

Macromolecular Memory Macromolecular technologies focus, as in the case of Redox memories, on MetalInsulator-Metal compounds. The material between the two electrodes is a layer of polymer, that must have a resistive switching ability. The term macromolecular is however quite general, as ITRS reports that many polymers are being currently investigated and some have shown dierent behaviors that could be used to build new memory technologies: some have ferroelectric features, whereas other feature the formation of metallic laments. However, the status of these research eorts is embryonal.

40

Molecular Memory Molecular memory technologies represent another research eld that is still at a very early stage.

Such technology would be based on single molecules or little

clusters of them. These molecules would be used into resistive switches elements to store the information bit.

As in Redox memories, current would be used to

reach a resistive switch of the molecule, and reading would not be destructive. The promises of this technologies are very high since theoretically each memory cell could reach the dimensions of a single molecule: rst studies on molecular memory report exceptional power eciency and high switching speed. ITRS admits however that still many research eorts have to be made in order to gain an adequate understanding of this technology.

Other memories Besides those ocially included into the ITRS taxonomy, other technologies are being investigated by researchers, laboratories and industries. Among these others technologies, Racetrack [114], Millipede [134] and Nanocrystal [18] memories are worthy to be here reminded.

1.3.3

From the memory cell to memories

Memory cells, as those that have been presented so far, are the building blocks of actual memories. Each memory cell can contain at least one single bit of information, and MLC technologies permit the encoding of more bits, usually two or three. Then, memory cells are assembled together to provide bytes, cache lines, pages, and so on. Since the byte or the block addressability depends on the way in which cells are linked together, as it is the case for Flash cells, decisions of engineers about this topic are pivotal, as they inuence the way in which those technologies will be used: block addressable memories t natively to I/O devices; vice versa, byte addressable ones are well suited to be used both on the memory bus and on I/O devices. Regarding the technologies that have been presented so far as prototypical and emerging, it seems that engineers are expecting to connect each cell in such a

41

way as to allow byte addressability. In fact, as subsequently explained, one of the most followed expectation about these memories is that they will be used attached to the memory bus, which requires byte addressability. One last remark about these technologies should focus on their eective performances.

Figures about performances are not always easy to nd.

Moreover,

semiconductor producers are sometimes reluctant to give extensive informations about their product. Actually, they could evaluate as counterproductive disclosing such data, as gures could reveal issues that they prefer to hide and manage eventually using software layers (through rmware or FTLs, for example). Despite these remarks, in table B.2 there are some gures about current, prototypical and emerging technologies that could be used for a rst comparison. Following those gures, I would just underline how it seems that FeRAM and STT-MRAM could suer from problems about scalability, as their cell area is

2 excessively high (the goal for semiconductor producers is 4F ). Moreover, it seems that FeRAM is being produced with very dated production processes (180

nm),

this fact could be a clue that this technology seems to be starving. Among the prototypical ones, Phase Change memories seem to be the only promising high densities. Performing a rough comparison between the prototypical and the emerging technologies, it is apparent how the promises of the emerging ones are much higher: better scalability, density, performances and endurance. While these products are still not into production or, at best, produced only in low volumes, researchers are nonetheless wondering extensively how these new memories would inuence operating system's operation and their design; to permit their analysis, they have made the following assumptions: these memories will be persistent, byte addressable, denser than DRAM and faster than Flash memories. The issue about the impact of persistent memories on operating systems is therefore the topic of the next chapter.

Chapter 2 Operating Systems Until now I have focused on the technical features of persistent memories, somehow relating to just the rst part of the title.

From now on, the focus will shift to

operating systems, and I will try to present at best their design issues related to persistent memories. All the examples made hereafter follow the UNIX paradigm and, specically, are based on the Linux operating system:

even if the same principles and ap-

proaches are similarly used into other classes of operating systems (i.e. Windows), a specic paradigm is nonetheless necessary to maintain some concreteness; thanks to the open-source nature of Linux, the access to its internals is easier: I will thus take advantage of it. Here follow some preliminary observations about the models that could be inuenced, or even changed, under the pressure of new memory technologies. Afterwards, as persistent memories can be used either in a fast SSD or directly attached on the memory bus, there is a presentation of each of the approaches starting from the former one, indeed the more conservative.

Each operating system can be conceived both as an extended machine and as a resource manager [130].

In the former perspective, an operating system is

responsible for hiding to the user all the complex details related to the hardware by providing an abstract machine simpler to use and to program, whereas in the latter one the operating system is responsible for the management of all the

43

resources available on a specic computing system. In either viewpoints, the operating system is, most of times, a software product acting like a glue between the hardware and the programs (and nally the user). The relation between hardware and software is somehow porous: even if each of them represents a specic research domain, they are nonetheless inseparably related. It could easily happen that advances (or dierent approaches) in software engineering would urge changes in hardware design, and vice versa.

However,

since operating systems are conceived primarily to permit the most protable use of hardware resources, it is not only licit, but rational too, to question whether new hardware technologies do have the potential to inuence software, and to what extent. Scientists, researchers and developers are claiming that the new technologies just presented will urge deep changes in the operating system engineering.

2.1

Reference models

Each science uses models, abstracting from specic details of problem instances in order to describe synthetically and generically the problems themselves. Models are indeed valuable: they permit an elegant representation and resolution of problems, acting as a useful frame within which scientist, engineers and developers can build real solutions. Changes in the founding models usually trigger changes in a sort of chain-reaction: it happens in mathematics, physics, and computer science is no exception.

In particular, operating systems have some fundamental mod-

els used as a reference. Since these new memories have some features that main memory never had, researchers are thus trying to understand to what extent such features will urge changes into current operating systems models.

2.1.1

The Von Neumann machine

Von Neumann's model is probably one of the most important used in computer science: it describes how computations are executed into computers. Its model, shown in gure 2.1 is in eect quite simple: memory (the memory), along with a processing unit (the control = Control Unit + ALU) and with a input/output function (the I/O), all connected by a single bus, compose a complete computing

44

system. Instructions are fetched from memory, then decoded by the control unit, operands are retrieved from memory, execution is performed in collaboration with the ALU, and results are nally stored into memory.

Figure 2.1: The Von Neumann model

While in the past real implementations of computing devices was very close to that of the model (the hardware design of the PDP-8 minicomputer, for example, was very close to Von Neumann's model), today, real computer systems are no more similar to it, and current architectures are much more complex: a standard workstation could use many input/output devices attached to many dierent buses, could use more CPUs, and a single CPU could have many cores, and so on. Moreover, computing systems have evolved in time to oer an ever increasing set of features: multitasking, multithreading, networking, parallel computations, virtualization, and many others. However, despite this complexity, the founding models still are the same as when computing started to become a reality: CPUs still perform computations using a fetch-decode-execute cycle based on the Von Neumann Machine model. In this model, each of the functional units (control, memory, I/O) has a specic role and specic tasks, not shared among each other: as long as the execution model do not change and each functional unit remains distinct from the

45

others, the model should hold fast.

In time, due to performance reasons, some

portions of memory have been brought close to the control through the use of L1, L2 and L3 caches; nonetheless, even if closer, control and memory remain still logically separated. As briey outlined in the next paragraph, a dierent conclusion would apply in the case that control merges with memory. Referring to persistent memories, the most challenging hypotheses of use occur when they are placed on the memory bus (see section 2.3). However, such a use is almost identical to that of common DRAM. Therefore, if engineered as such, faster, denser, even persistent memories would not change the basics of the model. Studies about memristance and memristors could have the potential to seriously strain the Von Neumann model (see section A.2). Memristor-like memory cells could be used to build recongurable processors or logic functional units: this would result in a merge between the memory and the control [15, 140]. Some researchers are endeavoring the use of memristive memories to build neuromorphic chips and their use in neural networks studies [126]. These are however currently futuristic scenarios. With the intention to remain concrete, the focus of this paper will not investigate anymore aspects related to potential changes to the Von Neumann model, giving as granted its aliveness for the years to come. Instead, the Von Neumann model will be used as a reference model in background.

2.1.2

The memory and the memory hierarchy

It could be useful to restrict the focus on the memory side: after all, new memory technologies natively relate to it. Although not properly a model, I wish to recall here the memory hierarchy, since it is indeed a neat and synthetic representation of how current computing systems do implement memory. When talking about the memory hierarchy, the word memory is quite dierent from the meaning used in the Von Neumann model, hence the quote for the latter. In this scope, memory represents generically a place where computing devices can store instructions and data, either temporarily or persistently. Consequently, the memory hierarchy represents both the memory (in the upper part) and a portion of the I/O (in the lower part) of the Von Neumann model. The memory hierarchy represents at a glance, as already stated, the funda-

46

mental relationship between speed and density: however, some other relevant information is hidden inside it. In particular, information about the speeds of each level is not apparent, nor is information of where volatility ceases and persistence starts. Moreover, a dynamic view of the changes in the hierarchy in time could be useful. New solid state memory technologies, as those presented above, are ideally continuing an innovation path started with the upcoming of Flash memories. Before them, the conguration of the memory hierarchy had remained almost the same for about thirty years (50s 80s): it was built of registers, caches, RAM, hard disks, tapes (or punched cards). Although the performances had changed in time, such changes were in absolute terms, while the relative values and the structure itself had remained almost unchanged. Picture 2.2 represents again the memory hierarchy. To carry more information, these hints have been added:

- access time has been added on the right (as power of ten negative exponentials of seconds); - a thick gray margin represents the border between volatility and persistence; - The border between memory bound and I/O bound devices has been pinned up on the left; - a dashed line represents both:

the border between symmetrical and asymmetrical read/write timings; the border between (near to) innite and limited endurance. The following further facts are to be stressed about gure 2.2:

rstly, the

six-fold orders of magnitude gap between hard disks and RAM is apparent; secondly, the thick gray margin and the I/O limit coincide: fast memory is volatile, whereas slow memory is persistent. Fast memory is accessed with load and store instructions from CPU, whereas slow memory needs complex access mechanisms (I/O). Finally, fast memories have symmetrical performances and suer no wearing, whereas slow memories suer from limited endurance and from asymmetrical performances.

47

Figure 2.2: The memory hierarchy with hints

The memory hierarchy carries within it also some clues about the problems that arise from every item: in a perfect world, the memory hierarchy would be innitely large, but at. In an operating system viewpoint, the perfect memory would oer:

- CPU speed; - persistence; - native byte addressability; - symmetric performances in reading and writing; - technological homogeneity; - innite endurance.

Unfortunately, real memories cannot have all these good features: each layer of the memory hierarchy oers only a subset of them. As a result, each level of the memory hierarchy also depicts the problems that an operating system have to manage in order to use it eectively.

48

In the next paragraphs it will be attempted a brief summary of the main techniques adopted in time by operating system designers to overcome the limits of each layer of the hierarchy.

Speed issues CPU speed is a luxury commodity enjoyed only by registers.

Descending from

layer to layer, access time increases exponentially. This means that each access to a memory of a given level has a cost in time. A well-engineered operating system will try to minimize the cost and to maximize performances. Unfortunately, not only the cost of memory accesses increases for each level downwards, but historically, the gap between speeds of the memory and I/Odriven memories, such as hard disks, tapes, punched cards, CDs, has always been very large; today it is milliseconds vs nanoseconds, a six-fold orders of magnitude delta. This fact has always been a problematic limit. To circumvent this limit, developers usually adopt a large set of strategies; among others, caching and interrupts will be analyzed.

Caching Every access to slower memories other than RAM has a denite cost: in order to minimize it, it is better to store as much as data as possible in RAM. This approach would be wasteful if indiscriminately used: the faster the memory, the less its density, the more the need to use it optimally is. In association with data space locality and time locality considerations, however, operating systems can use eciently various levels of caches, thus permitting the processor to work on data very fast: this way, both memory operations and very slow I/O operations benet from an important speed boost. This approach is of paramount importance to gain a performance incrementation: it is therefore used in an innite number of software applications (operating systems, database engines, applications, and so on) and hardware (routers, switches, hard disks, SSDs, GPUs, and so on). Moreover, this technique is fundamental in big data centers to oer fast performances, as it is the case for Facebook and its 1

use of server clusters using Memcached

1 www.memcached.org.

to serve faster requests from web servers

49

(front-end), interposing between them and databases (back-end) [58].

Another

2

famous example is Redis , used in cloud services oered by Amazon [62]. The use of caching however has its own cost:

- software and hardware are more complex: code for caching management and accountability, or chip regions devoted to cache management; - data is copied from the original data location and thus more memory is consumed; - data modied into caches have to be, sooner or later, updated in the original location; - multiple caches need to comply to cache coherency rules.

The benet experienced on systems that use caching is easily calculated with the formula recalled in a the famous article by Wulf and McKee [143], where the total average time is calculated as follows:

Tavg = hc + (1 − h)M Where

M

h is the hit rate, (1−h) is the miss rate, c is the time to access the cache and

is the time taken to access memory. In case, this simple formula is expandable:

Tavg = xc + yM + zH x is the hit rate, y is the probability of a memory access, z is the probability of a HDD access, whereas once again c is the time to access cache, M is the time taken to access memory, H is the time taken to access a given I/O device (x + y + z = 1). Where

As just mentioned, caching has however a cost in complexity.

Whilst hardware

caching (as it is the case with L1, L2 and L3 CPU caches) has a cost primarily expressed in hardware complexity (higher price), software caching (as that used for example in the Linux page cache) induces a complexity growth of the software. This added complexity is thus accountable for some extra work made usually by 3

the operating system , henceforth measurable in cycle times and, consequently, in

2 redis.io.

3 Database systems usually perform caching autonomously, avoiding the operating system intervention

50

some additional time (latency) and energy (power). Caching advantage, respect to its cost, is however tremendous in the case of standard HDDs: in a standard othe-shelf Linux storage stack, software accounts for just 0.27% of I/O operational latency (0.27% x 1 ms = 2.7

µs

) [129]. The software portion responsible of cache

management is only a part of the entire I/O stack: other parts refer to system call management, device drivers, memory management, and so on. Supposing that the cost of the entire path traversal was entirely due to caching, it would be anyway tremendously useful: microseconds vs milliseconds. Taking the opportunity given by caching, the preceding observation can be further generalized speaking more generally of software: software layers built upon slow I/O devices do generally have a little impact on performances.

As an example,

surveys made on LVM in Linux highlight the fact that it adds only a 0.03% of software latency and 0.04% of energy consumption in case of a disk based storage stack.

Interrupts, or asynchronous execution: interrupts permit to optimize the CPU usage by suspending processes that requested slow operations. The wake-up of the suspended processes is triggered by the hardware emitting an interrupt, signaling this way to the operating system that its operations is nished.

This

approach does not technically avoid the slowness of I/O operations, but takes it into account, hence permitting as a result to use the whole computing system much more eciently. This technique is fundamental: almost every computer uses it. However, it does also has a cost in complexity.

The operating system usually has to perform at

least a context switch to suspend the process, has to start I/O and to schedule another process. This sequence, highly simplied, is complex and time (and energy) consuming: many sources refers this process as weighting about 6

µs

. Once again,

this value is only a little percentage of the cost of waiting for a I/O (to HDD) request to complete: microseconds vs milliseconds.

Hardware, data, failure models I will now focus on some observations regarding models derived from the memory hierarchy:

while some of these considerations might seem obvious, the aim of

51

such deepening is to let emerge which ones are the assumptions implicitly given as granted into the design of current operating systems. Moreover, some of the following distinctions could result useless if considered in relation to the classical memory hierarchy: practically every operating system uses the same models, since the memory hierarchy on which they are funded is the same. Conversely, I nd such distinctions valuable in a change perspective: the reach of persistence into the higher layers of the hierarchy do oer many degrees of exploitation, not just one. The need of some classication tools could thus help the analysis. Each operating system, at least implicitly, uses a hardware model and a data model: the former would specify (generically) what are the main functional units that it can manage, whereas the latter would, in turn, specify a series of choices about how data is managed. In particular, as a further deepening, the data model would describe, among others, both the specic design choices related to volatility and those related to persistence.

These last design choices could be referred as

persistence model: the part of the data model related specically to persistence and its management. Another set of choices, transversally related to all of the previous models (hardware, data, persistence and volatility), is the one about failures. Data inside memories is subject to a long series of potential threats: power losses, hardware failures, electrical disturbs, programming errors, memory leaks, crashes, unauthorized accesses, and so on. These problems are well known by operating system designers, who take countermeasures, decide which class of problems are managed, which ones are instead ignored: these design choices could described as the failure model. The current data model and the current failure model are deeply based on the properties extrapolated from the classical memory hierarchy, which have been taken for granted for many decades, and still they are: developers have always engineered operating system consequently. Current data model is quite simple: persistence is delegated to I/O devices, whereas registers, caches and RAM are volatile. Data is stored into hard disks and SSDs using les located into le systems. Talking rather generally, the classical failure model, while guaranteeing security both in memory and in persistent devices, focuses on safety in memory and on consistency in persistent devices. Safety is preferred in memory as a consequence of its speed and its volatility: the goal is

52

to set policies to avoid corruption, while assuming the risk that such events could happen. Consistency is instead needed in persistent devices as a consequence of persistence itself: consistency permits the eective and correct preservation of data in time (errors reaching persistent memories become precisely,

persistent errors ).

Moreover, the slow speed and the complexity of I/O operations of current persistent memories (hard disks and SSDs) further exacerbates the need of consistency: the slowness increases the likelihood of a power failure during a I/O operation, thus it is important to design I/O operations to permit data survival even after such events.

Strategies as le system checks, journaling, logging, transactional

semantics, are all designed to minimize problems on data in persistent memories after a power failure event. Continuing to refer to failures in computing devices, there is a substantial dierence between errors in memory and those into I/O devices: in memory the potential sources of errors could be many, whereas in I/O devices errors are almost always caused by power failures and hardware errors, not by software: as noted in Chen's article [19], memory has always been perceived as an unreliable place to store data. Firstly, this is due to its volatility, but secondly to the ease of access and modication of its content: people do know that operating system crashes can easily corrupt the memory. On the other hand, I/O driven memory devices have always been perceived as reliable places to store data not only because of their persistent behavior: since the I/O stack is slow and complex, it is unlikely that a faulting condition can successfully perform a correct I/O operation with wrong data.

Moreover, just because I/O operations are slow and complex, it is

easy to add additional security (in software) by means of transactional semantics or some other similar techniques. These observations hold still today: persistence has always been thought as being reliable, whereas the opposite happens with volatility.

Unfortunately, if persistence reaches the memory bus, this property

would not hold any more: this aspect too should be taken into account or at least, acknowledged.

53

2.1.3

A dynamic view in time

All the aspects just outlined refer to the classic memory hierarchy model: this model, however, has started to change after the advent of Flash memories: a level was added, thus reducing part of the big gap between slow HDDs and fast RAM memories. Such an insertion led to the current memory hierarchy. The point is now to imagine the future conguration of the memory hierarchy, using as clues the promises of the new technologies presented in the rst part. Rather generically speaking, new technologies promise to be: -

Faster than Flash: but slower than RAM or eventually, as fast as RAM. The speed would be anyway more next to that of RAM than that of Flash: the order of magnitude is still in tens of

ns.

-

Denser than RAM.

-

Persistent.

-

Longer-lasting than Flash: the endurance is better than Flash, but worse than RAM

4

.

-

Natively byte addressable.

-

Suering from both read/write asymmetries and cell wearing.

Trying to imagine a next-generation memory hierarchy, these memories would collocate natively between RAM and SSDs. The promise about density is coherent with the pyramid logic: the hypothesis of a higher pyramid would be henceforth legitimate. Before trying to sketch a next-generation memory hierarchy, some other considerations are needed: - about the future of RAM memory, caches and registers; - about the use of byte addressing on the memory bus or block addressing on other paths.

4 Phase Change technology, the one suering most from cell wearing, have a four orders of magnitude better endurance (10

9

cycles) compared to that of Flash (10

5

cycles).

54

Referring to RAM, the Hybrid Memory Cube technology has been described as a viable enhancement of current DRAM technology: it is thus conceivable that also RAM shifts up in the hierarchy, getting closer to caches and registers. However, these technologies are all volatile. It could be theoretically feasible to build registers and caches with FeFETs, thus transforming them in at least semi-persistent memories (FeFET transistors have been demonstrated to remember their status only for some days) [48]; this approach represent however a scenario far in time, taken into account in literature very few times [136].

The models presented in

the next paragraphs will thus continue to give for granted the volatility of higher layers of the memory hierarchy. Regarding to the addressing technique, the new non-volatile memories t beautifully the byte addressing schema, thus their use in the memory bus seems the most natural choice.

Conversely, Flash memories do integrate natively into the

block addressing schema, at least for what concerns NAND Flash. Even if it is feasible to adapt a block-native memory to a byte addressing schema, it is nonetheless intricate [59]. The opposite is however simpler: byte addressing memories can 5

be adapted to be block addressed at cost of some added hardware complexity . The question about the use of NVMs either as slower RAM or as faster SSDs

The former approach will be referred to as Storage Class Memory: non-volatile memories placed on the memory bus. is thus licit.

All that said, next-generation memory hierarchy might appear as in gure 2.3 .

2.1.4

Viable architectures

This representation depicts at a glance all the possibilities that engineers might have in the future to build real computer systems: not all of these layers are indispensable.

For example, smallest devices would have no RAM and no HMC,

but only few registers, a tiny cache, and a very low-power non-volatile memory on the memory bus (to be used both as RAM replacement and as storage); such a conguration could permit engineers to build battery-less devices with advanced memory capabilities [39]. Another implementation might use new memories just

5 This approach is the one used with Moneta [16] and explained in section 2.2.2

55

Figure 2.3: A new memory hierarchy

to build a faster SSD, maintaining all the other components of a classic memory hierarchy. Yet another implementation would use both HMC memory and Storage Class Memory to oer a dual mode fast-volatile and slow-persistent hybrid memory. The next paragraphs will present the main models of usage of the new memories taken into account in literature. Researchers are studying and are trying to model the eects of the coming of persistent memories both on the I/O side and on the memory bus side: the models will follow either the former or the latter approach. Each scenario has the potential to improve greatly current computing performances, but there are also important issues, as it will soon explained. particular, even the use as is of either: - a hypothetical SSD featuring a speed close to that of RAM; - a hypothetical persistent and dense DIMM on the memory bus;

In

56

on a system running a o-the-shelf operating system would result problematic. It has to be stressed that operating systems are developed to be a well-balanced system on certain assumptions made by designers and developers: one of the main assumptions made by designers in modern operating systems is the conguration of the memory hierarchy like the one of the classical pyramid and its consequences, i.e. the data model and the failure model. Using a metaphor, operating systems behave as a weight-lever-fulcrum mechanical system in an equilibrium state when the assumptions do occur.

Compliance with the assumptions assures that the

fulcrum stays in the right point. Un-compliance would result in a fulcrum shift and, consequently, in a loss of equilibrium. Still following the metaphor, the eorts made by researchers on NVM-aware operating systems are similar to the re-equilibration of a mechanical system that lost its equilibrium, thus modifying the weights placed on both sides of the lever. Firstly, the easier approach will be analyzed, i.e. the use of new non-volatile memories as bricks to build a very fast SSD. Afterwards, the reader will be introduced to the various Storage Class Memories approaches proposed either by developers, or by researchers. Before delving into the specicities of each approach, it is useful to note here the inversely proportional relation that arises (in this context) between the ease of setup of a test environment and the quantity of eorts required to adopt changes in software. Ironically, whereas a Fast SSD is tougher to engineer, develop, prototype and test [64], it is however much simpler (although surely not trivial) to conceive and model software changes in order to drive it conveniently. On the other side, as the reader can verify afterwards, in a SCM context it is much easier to setup a test environment (for example, non-volatile memories can be emulated using just normal DRAM); despite this ease of test, it is much more complex to develop a complete and ecient solution to the raised challenges.

2.2

Fast SSDs

This approach is the more conservative one, since the only change in the memory hierarchy would be the presence of a new I/O device, running faster than common SSDs. Such a solution would not inuence neither the standard data model, neither the standard failure model, as the only anomaly would be its speed. This approach,

57 6

moreover, would be in continuity with the path started with SSDs . The availability of new solid state memory technologies as Phase Change RAM, would permit manufacturers to build and sell SSDs featuring much higher speeds than the ones of current NAND Flash-based SSDs. The speed of PCM is about 100

ns

in write mode, whereas in read mode the speed is about 12

ns.

This speed

is 50 times worst than DRAM when writing and 6 times when reading. Despite the speed decrease in comparison with DRAM, these memories would be however very fast respect to common NAND Flash memories (∼100

2.2.1

µs

, see table B.2).

Preliminary design choices

Before delving into operating system issues related to faster-than-Flash SSDs, it can be worthwhile to linger on some hardware issues as:

- the I/O bus; - the SSD choice vs a MTD-like choice.

These aspects and the related choices establish a sort of framework that operating systems must take into account, thus inuencing their internal design.

The I/O bus Since SSDs, as HDDs, use a I/O bus to transfer data, engineers will have to make some choices about the bus used in those products.

The speed of these

new technologies justies the concern of whether the bus is able to sustain the performances of the SSD. Drivers design will subsequently follow the choices made by engineers. Table B.4 shows some gures about data transfer speed of some of the most important buses; the rst two are I/O buses, whereas the last ones are memory buses (it can be observed that there is a gap of one-two orders of magnitude in data transfer speed between the two bus classes).

6 This approach is convenient also as a learning tool: in the eort to build with a new technology a device otherwise quite common, the focus could be xed on gaining an adequate know-how about the peculiarities of those new technologies.

58

Current alternatives in I/O bus are SATA and PCI Express. In order to evaluate the two alternatives, gures about hardware features are fundamental, but these are not the only factors that must be taken into account.

Other factors

inuencing a choice are:

- protocol overhead: for example 8b/10b encodings are much less ecient than 7

128b/130b ones ; - potential of further technical improvements; - scalability; - ability to adapt to virtualization schemas; - ability to adapt to multi-core and multi-processor requests; - quality of the I/O stack and the potential to improve it: hardware features do decide how device drivers work.

A well-built bus will permit to develop good drivers, whereas a problematic bus will force developers to bypass problems in software, thus raising the software complexity. Speaking for a while only from a hardware standpoint, SATA has been conceived to be used with standard HDDs, as an improvement on standard PATA: this fact still inuences its behavior. In comparison to HDDs (given a very low 1ms access time in both read and write), SATA can theoretically execute a 4K transfer in about 6.83

µs

, or 146 times faster than the speed of the hard disk to service

the same amount of data (table B.5). Such a dierence makes SATA appear as a innite-fast channel to transport data to HDD. Transition from a SATA HDD to a good performing SATA SSD does present dierent proportions though: some SSDs [96] do oer 550 MB/s sequential read speed and 520 MB/s, thus reaching very close the theoretical speed of 600 MB/s. Supposing that a 4K data arrived to the SSD could be just written in bulk in one write cycle (about 0.1

ms

or 100

µs

),

this time would be only 14-15 times slower than the transfer time: the proportion is very dierent from that of HDDs.

These observations alone would justify to

7 XXb/YYb where XX is the payload, whereas YY is the transfer size (XX≤YY).

59

conceive SATA technology as not being suitable for SSDs faster than Flash. The gure of 0.9 times presented in table B.4 conrms the same hypothesis. In case SATA were used as the chosen bus for a PCM SSD, it would perform well only in case of 4K writes performed byte per byte; in every other condition, it would perform near to its limit, whereas in case of 4K read performed in groups of 64 bytes each, SATA would behave as a bottleneck.

These gures are deduced from bus

theoretical limits but sometimes implementations are slower, and this would aggravate the problem. Finally, since new memory technologies are just new, there is a high margin of improvement of their performances: a bus used at its limit from the beginning would promise thus to aect and waste every technological improvement. Even if these observations are quite rough, they follow the path already undertaken by scientists, researchers, storage manufacturers and technicians, who reached the same conclusions: SATA is being abandoned, preferring instead PCI Express as the bus to use fast SSDs [104, 91]. The motivations of this choice are not only rooting in current hardware features but also in the other factors cited previously:

PCI Express has higher speed and lower overhead, is scalable, has

an appealing road-map towards future improvements, is usable eciently by both virtualized environments, multi-core and multi-processors systems, and so on. Engineers have taken this choice focusing on current NAND Flash technologies: the scenario of faster memory technologies is taken into account, but still as being quite far in time. This fact underlines that they evaluated SATA to be obsolete even for NAND Flash. New SSDs using PCI Express have already been released on the marked, and this trend is expected to increase steadily in the next years(examples are the Fusion-io SSDs, the Apple SSD in Mac Book Pro, the Plextor M6e). It is thus likely to expect next-generation PCM SSDs to appear as PCI Express SSDs. Finally, giving a last glimpse to table B.4 and B.5, PCI Express does not perform with the same proportion of SATA in comparison with HDDs: it could thus also happen to it to act as a bottleneck; this limit is however shifted forward in time thanks to its improvement potential, which is much better than that of SATA (for example, PCI-Express generation 4 should appear this year).

60

SSD vs MTD-like choice Similar to what happened with Flash memories, a choice must be made about whether the internals of memories are hidden or not to the other parts of a computing system and, ultimately, to the operating system.

The same issue arose

with Flash memories: used into SSDs, all internals are completely hidden to the system, and they are employed just as common hard disks; on the other hand, Flash memories could also be connected directly to the computing system without any intermediation. In the former case, all issues related to Flash technology (wear leveling, cell erase before re-write, error checking) are managed by a controller acting as an interface between the bus and the Flash chips: this controller implements a Flash Translation Layer (FTL), in order to present to the system just a block device. In the latter, those issues must be managed by the operating system, which must thus take charge of implementing in software all the functions of a Flash Translation Layer, as it happens in Linux with MTD devices. PCM technology has actually a better endurance than Flash. Moreover, PCM cells do not need to be erased before re-writes. even higher endurance.

Newer technologies will oer

These observations should permit to build simpler and

thus faster translation layers. However, the increased speed also requires the same translation layer to be extremely fast, in order to not aect memory performances. Start-gap wear leveling technique is one of the suggested approaches to be used into these translation layers [37, 118]. The need to have a translation layer extremely fast would suggest the path of a hardware implementation, thus supporting the choice of the SSD approach; this option will also permit to hide conveniently the read/write time asymmetry. Examples presented hereafter use this same approach: it seems that the MTD one not to be currently investigated.

2.2.2

Impact of software I/O stack

Researchers and students of the Non-Volatile Systems Laboratory of UCSD University have conducted a series of thorough and interesting studies about the consequences of fast SSDs on operating systems since at least 2010.

In particular,

their observations about operating systems were the ospring of the experience gained while developing two prototypes of fast PCIe-based SSDs and of the ef-

61

forts made to exploit at most their performances. Both the prototypes have been built upon FPGA architectures: the rst one used common DRAM to emulate the behavior of new-generation non-volatile memories [16], whereas the second used eectively Phase Change memories [1]. Researchers claimed performances of 38 for a 4KB random read and 179

µs

µs

for a 4KB random write: these gures are in

line with those of table B.4, given the fact that software time is included (here read and write are meant as complete operating system operations). Their experience is valuable: much of the considerations made about changes in operating system design to exploit fast SSD do come from their work. Two articles written by UCSD scholars in particular [17, 129], describe respectively the initial eorts (the former) and the nal conclusions of their work (the latter).

In the rst article, there is an accessible description of the various sce-

narios that they tested to evaluate the performances of new memory technologies as PCM and STT-RAM. The description about the testing environment used to model the behavior of PCM and STT-RAM (at that time still not available on the market) is indeed very interesting: they used common DRAM along with a programmable memory controller to introduce latencies compatible with those of PCM and STT-RAM. The solution adopted is remarkable, since a programmable memory controller can permit to measure (as they did in their study) how performances are aected in presence of read/write latency asymmetries and when those latencies increase. They evaluated and measured performances and latencies of:

- a standard RAID solution; - a state-of-the-art PCI Express Flash; - a NVM-emulated PCI Express SSD (this in particular became the basis for their Moneta and Onyx projects); - a DRAM portion used as a ramdisk to emulate future NVM Storage Class Memory (this in particular is rather an introductory approach to SCM, since it does not take into account nor persistence, nor safety; however, the rst focus of this model is to describe performance problems concerning the software I/O stack, not a complete discussion about SCMs).

62

The most important achievement of their work is the evidence of the huge impact on latency and throughput to account to the software I/O stack: as the speed of the memory device raises, this impact raises. It is underlined that using the ramdisk environment, the cost of the system calls, le system and operating system are steep; they prevent the ramdisk from utilizing more than 12% of the bandwidth that the DDR3 memory bus can deliver. When looking at the FPGA solution they veried that the I/O software stack was responsible of an important performance fall; in particular, they veried how the le system was responsible

µs

of an important latency increase (about 6

per access). Also, they observed too

how the le system internal design inuences throughput: they veried that ext3 le system was responsible of a 74% reduction in bandwidth, whereas this impact was much lower when using XFS. These observations are discussed thoroughly in [129] and many other papers in literature do use similar observation to analyze software I/O stack. In particular, besides the description of the developed solutions, the two following charts, that summarize easily the increasing impact of software as speed increases, are included.

Hard drive 0.3 SATA SSD PCIe-Flash PCIe-PCM DDR 0

19.3 21.9 70.0 94.1

10 20 30 40 50 60 70 80 90 100 Chart 2.1: I/O software stack impact on latency (percent)

Cost of software in latency jumps from around 20% in case of SATA or PCI Express Flash-based SSD, up to 70% of a PCI Express PCM-based SSD. In the case of a SCM the cost would account to an impressive 94%.

Causes Talking about the causes of such ineciencies, as it has already been stressed previously, the common assumption that has always been given as granted by every

63

Hard drive 0.4 SATA SSD PCIe-Flash PCIe-PCM DDR 0

96.9 75.3 87.7 98.8

10 20 30 40 50 60 70 80 90 100 Chart 2.2: I/O software stack impact on energy

operating system developer is that I/O devices are slow. This simple assumption induced developers to:

- focus on oering functionality in software to alleviate hardware deciencies (this is the case, as already cited, of page cache, buer cache, LVM, and so on). Such a functionality, given a slow device, has a minimal cost in latency and bandwidth; - not bother too much about the eciency of the software layer because software would account only to a minimal part of the I/O: since devices are slow, eorts to develop ecient software would result only in a little improvement; such improvements would thus have a highly unfavorable cost/benet proportion. Instead, eorts on safety, correctness and security were better rewarded.

Unfortunately, these assumptions are the philosophical roots that cause the I/O software stack to perform so badly when the device gets faster. Indeed, the two charts just shown suggest that also with common Flash SATA SSDs software becomes a problem. Following these observations, researchers and developers agreed on the need to identify which parts of the I/O stack are mostly responsible of latency and energy cost. This analysis is the rst step to proceed forward in making decisions about the best strategy to enhance the I/O stack behavior. Researchers did rstly some of the conceptual observations that will now be presented. O-the-shelf I/O stacks are developed as modular stacks usually employing a generic block driver that works in conjunction with a device-specic driver. This

64

design permits to virtualize and standardize the access of programs and kernel to specic devices. However, most of the times it is the kernel that is responsible of all the aspects regarding both storage access and storage management. The kernel does not only set the access policy (space allocation, permission management) but it is also responsible for the policy enforcement.

This extensive use inside the

kernel of both policy set and policy enforcement can lead to ineciencies. The second general observation is bound to the generality of the I/O stack: whereas its design allows a great exibility, its generality does not permit to implement all of the optimizations that could improve it at most, thus sacricing some opportunities. Whilst, if the devices are slow, these opportunities are not so signicant, they become valuable opportunities in the case of fast devices. These observations alone could suggest some improvement areas, as the need of specic (not generic) I/O stacks devoted to fast memory devices and the necessity to prevent whether possible, the intervention of the kernel in the management of I/O accesses. Investigating further these observations, researchers have identied these hot areas:

- I/O request schema can hide bottlenecks: I/O requests in Linux use a I/O scheduler to collect and issue at proper time I/O requests. This approach permit exibility but adds latency (about 2

µs

).

- interrupt management is expensive, especially when requests are small. Interrupts are intrinsically complex and expensive procedures (they add at least 6

µs

of latency): in the case of a small request, the time between its

issue and its servicing can be shorter than the time between a sleep and a wakeup. Moreover, a fast device using interrupts would frequently issue interrupts themselves: the more the demand of little I/O operations, the more the time lost just for interrupt managing. Finally, it is necessary to underline that, usually, interrupts

must

be managed. The presence of many interrupts

due to a fast device can ironically sacrice system responsiveness precisely because of the device speed; - the le system is one of the causes of added latency for each I/O request (about 5

µs

);

65

- the cost of entering and exiting the kernel is high (in case of small requests about 18% of the total cost).

Solutions The insights just explained inspired the strategy, subsequently referred to as RRR, pursued by researchers from UCSD while trying to optimize Linux I/O stack to use eciently their prototypes of PCI Express SSDs:

Reduce: to eliminate redundant or useless features, to avoid those parts of code bad performing with fast memories.

As examples, they avoided the use

of Linux standard I/O scheduler, thus preferring direct requests.

Another

solution they undertook is the choice of preferring the spin alternative to interrupts in case of small requests [146].

Refactor: to restructure the I/O stack in such a way that eorts are distributed between the actors (applications, operating system and hardware). For instance, separate policy management from policy enforcement, preferably assigning the former to the operating system and the latter to the hardware, where possible. As examples, in Moneta the following refactoring tasks have been implemented : development of user-space driver to avoid kernel entering and exiting, virtualized hardware interface to permit each application to issue requests directly, hardware permission checks in order to relief the kernel from policy enforcement.

Recycle: to reuse the parts of software already created, where feasible. For example, reuse some of the functionality oered by le system tools and le system themselves. 8

The NVM Express group

is currently pursuing a similar approach [91]: their

goal is to develop a new standard host controller interface to be used upon PCI Express, conceived to be adopted by PCI Express SSDs. While the PCI Express specication sets the standards referred to the lower layers of communication between the CPU and a compliant device, NVM Express specication ts in a higher

8 www.nvmexpress.org .

66

level of abstractions that compliant drivers must follow. Their focus is the development of a I/O stack able to support and exploit fast SSDs, thus deriving the maximum benet from the PCI Express bus. Currently Windows Server, Linux and VMware already oer NVM Express drivers. NVM Express documentation reports the latency and performance gains obtained using their I/O stack instead of a standard one: although documentation never refers directly to the ReduceRefactor-Reuse approach, their work seems just to be following the same approach, somehow certifying its eectiveness.

2.3

Storage Class Memory: operating systems

The use of non-volatile memories directly attached on the memory bus represents both a big opportunity for next-generation computing systems and a tough challenge: persistence in the memory bus, along with a promised better density than that of DRAM, are the opportunities. Persistence in memory represents an opportunity because I/O operations, even if much faster than in the past (as it happens in the case of NVM Express SSDs or of the Onyx prototype), are intrinsically slower than memory accesses. Persistencerelated operations issued at memory speed would then permit an extremely fast storage (and retrieval) of data. Density too represents an opportunity:

more density would permit both to

lower the cost of storage into memory and would instead permit to manage at high speed a bigger amount of data. The challenges, instead, are due principally to the side eects of persistence into the memory: a thorough exploitation into the operating system is dicult, as it requires a complex re-design of some of its major parts. Other issues do exist though.

One of them is heterogeneity: in the case that SCM was placed along

with common DRAM on the same memory bus, the operating system would have to decide how to use at best both the memories. A design with SCM only would be architecturally simpler. Other challenges are the need of wear leveling along with the need to cope with r/w asymmetries.

67

2.3.1

Preliminary observations

An analysis and a classication of the proposed approaches to use SCM follows in the next paragraphs.

However, before presenting each specic proposal, it

would be here worthwhile to focus on some aspects shared by all the approaches subsequently described.

Wear leveling and r/w asymmetry While the most important issues, namely persistence and heterogeneity, are each dierently managed in each specic approach, the issues that arise from cell wearing and r/w asymmetry are instead common whatever the approach.

About wear leveling: it is licit to forecast the intervention of hardware engineers on memory controllers [23]: as new technologies have dierent memory switching mechanics, timings and electrical needs, it is thus likely that CPUs will need new memory controllers to drive new memory technologies. As the need of new memory controllers already exists, it would be easier and cheaper to add more functionality in hardware than it usually is. As an example, it could be implemented in hardware a fast wear leveling schema as that of start-gap already cited. Other schemas are also proposed in [144, 46, 148, 20]. Further needs of wear-leveling could then be reached with the support of software. The most part of the articles about SCM do not deal with this issue, as the most of the times it is assumed that this issue is managed by hardware.

About the r/w asymmetry: a mitigation in hardware is feasible with the utilization of either a classic SRAM cache, or even a cache built with FeFETs: in either ways the engineering must be careful, since a bad implementation would aect persistence. Anyway, this issue is often completely ignored in literature. As the r/w asymmetry is anyway a feature of new memory technologies, I assume that it is given for granted that r/w asymmetries are exposed to the software. This approach is perhaps preferable: maintaining the sight over the whole panorama of NVMs, memristive memory devices promise to oer a r/w asymmetry much smaller than that of PCM, thus

68

mitigating drastically this issue. This aspect would be anyway easy to test and analyze using DRAM along with a memory emulator to get projections about changes in latency and bandwidth upon changes in timings values [17].

From now on, both these issues will be ignored, as if they were managed by hardware or, eectively, ignored.

Background literature It might now be the right time to observe that, in the background of most part of each and every solution presented hereafter stand some research eorts made in the past years that inuenced more than others subsequent works. These studies, focused on topics about le systems, persistence and caching, were all carried out during the 90s: they were, from a generic computer science viewpoint, not only interesting, but they did anticipate some issues that are pivotal in SCM context. Among the articles most cited in literature, the following are worthy to be mentioned:

The Rio File Cache: Surviving Operating System Crashes [19]: this article has been written in the intent of describing a computer system that uses a RAM I/O le cache (RIO) with the aim to make ordinary main memory safe for persistent storage by enabling memory to survive to operating system crashes [. . . ] reliability equivalent to a write-through le cache, where every write is instantly safe, and performance equivalent to a pure write-back cache. The intent of the writers was to execute every I/O operation directly on the in-memory cache facility, using the classical storage just as a backup facility. They proposed to protect the le cache with extensive memory protection and sandboxing in order to avoid its corruption upon operating system crashes, and to allow only warm reboots in order to avoid memory leaks when system is re-booted. The article claims that the proposed approach would proof even more safe against crashes than classical I/O to devices like HDDs and SSDs (their solution reached a crash probability to corrupt the le system of just 0.6%). Interesting gures about incidence of crashes into le system corruption are

69

supplied, supporting the common sense according to which disks are more reliable than memory: their gures however shows how the increase in corruption incidence in a memory without protection is only slightly greater than that of common HDDs (just 1,5% probability instead that of a writethrough cache, 1.1%). This articles is an evidence of the fact that as early as in the 90s, researchers set sight on storage kept entirely in memory: this topic is today a primary need in big data centers and in large database systems.

Moreover, this article contains ideas similar to those used into the

Whole System Persistence approach (see section 2.3.3). Trials and testing made to measure system crash eects are still usable today to design a correct mechanism to use persistence into main memory as those proposed in the next paragraphs regarding le systems and applications (see respectively sections 2.3.4 and 2.4).

File System Design for an NFS File Server Appliance [35]: this article is a technical report issued by NetApp, and it explains the design choices made into their WAFL le system (Write Anywhere File Layout), used into their storage appliances. WAFL used extensively shadow paging to obtain data consistency and fault tolerance, while oering a robust snapshot facility to easily and eciently manage backups. The ideas described neatly into this article are used extensively as a reference in many other articles about le system design and they certainly anticipated the times, since newer le systems as ZFS [111] or BTRFS [121] use approaches similar to those rstly developed for WAFL. Finally, NetApp's appliances running WAFL did use a non-volatile RAM (battery-backed) to maintain immediately available the operation log after crashes: this solution somehow implicitly uses the nonvolatile memory paradigm.

The Design and Implementation of a Log-Structured File System [122]: this article proposed a new pattern to use blocks into a le system as a continuous and cycling log. Following this pattern, every write operation triggers new block writes into free spaces, thus lling free spaces of the disk, as it happens in circular buers.

Besides the fact that this approach requires

a garbage collection layer, this article proposed nonetheless a real new ap-

70

proach for le system design. It was inspired on the copy-on-write approach, and it permitted to avoid the need of dual writes for consistency reasons (the rst one on the journal and the second to the eective data block). Finally, this approach implicitly enforces a wear level strategy: writing each time dierent blocks, writes are distributed around a disk, and endurance of memory cells is consequently raised. This design has thus inspired those of many Flash translation layers and many Flash le systems [85]. This approach could be valuable also in the context of SCMs, where the issues of cell wearing must be taken into account.

To a lesser extent, other articles from the 90s that anticipated the topics of nonvolatility are [141] and [9]: the former one describes the architecture of a computer that used a persistent memory (Flash + SRAM) on the memory bus, and the latter uses a non-volatile DRAM to use it either as cache or to speedup recovery times.

Choosing the hardware model Recalling the observations made previously about the hardware model, the data model and the failure model, it could be worthwhile to present here the two hardware models that will be subsequently considered. Each of the proposals presented afterwards necessarily uses one of them. Storage class memory could be used only in two congurations: either alone, in replacement of DRAM, either

in tandem

with it. The rst option, i.e. DRAM replacement, represents a simpler alternative, as it avoids the needs to manage heterogeneity. However, there are also some problematic aspects: rstly, as persistent memories would be slower than DRAM, such a use would sacrice performances. Moreover, this option would force operating system designers to necessarily manage the issues related with persistence: all the memory available would be persistent. The second option is more complex: standard DRAM would share with SCM the whole set of physical addresses.

Therefore, a portion of addresses would be

volatile, while the remaining one would be persistent. Despite the need to manage memory heterogeneity, this approach is the preferred one in most implementations. This conguration, beside its complexities, permits to mix both the technologies

71

(DRAM and NVRAM) to achieve the best compromise between performances and storage needs. Moreover, it gives to developers the faculty to decide when and to what extent they want to use persistent memories. As it will be soon apparent, the hardware model, the data model and the failure model, are tightly related: some data and failure models are achievable only on a given hardware conguration.

2.3.2

No changes into operating system

A rst and tempting approach to use SCMs could be the easier one: to use SCM just as common DRAM using a standard operating system. This would correspond to the choice of maintaining the same data model and the same failure model currently used in operating systems.

This approach would permit a standard

operating system to benet immediately from the density increase that SCM would oer: whatever the chosen hardware conguration, the SCM would be anyway used as a standard, volatile memory. This solution would be however problematic since a part or all of the memory (depending on the hardware conguration) would become persistent.

O-the-

shelf operating systems do use DRAM with the expectation of its volatility: the operating system enforces safety and security of data inside it just as long as power is on, without the need to take care of it when power goes down, as it is given as granted that it is erased.

It has been evidenced in [32] that, even in

the case of standard DRAM, data is not lost immediately, but gradually, within 1 to 5 minutes: this fact alone can be a source of security concerns. More so, if non-volatile memories were used with a standard o-the-shelf operating system, data security could be bypassed even more easily: each data ever stored in it would persist as long as it would not be re-written, and the read operation would be much easier. As temporary data can store passwords, encryption keys, sensitive data, and an innite pattern of mixed information, it would be a terribly bad practice to expose all this data to potentially unauthorized accesses.

To mitigate these

problems, a change in the hardware conguration or in the data and failure model would be required: just with the intent of maintaining the same levels of safety and security currently enforced by standard operating systems, the SCM should be

72

encrypted, either in hardware [26, 3], either in software [117]. The encryption of the memory with random keys set at each system reboot would emulate the volatility of the memory, thus permitting a safe use of persistent memories as common DRAM. This strategy could even increase the resistance of standard DRAM to security attacks as those presented in [32]. Another criticality about this way of managing SCM arises from performances: since SCM performances are worse than those of DRAM, a normal operating system using SCM as DRAM would suer from reduced performances, unless the density increase would be so necessary to compensate the performance loss. Moreover, in a hybrid hardware architecture, the operating system would use the whole memory as just a unique quality of memory, whereas features between SCM and DRAM would be very dierent from each other: the timings of the operating system would issue a problematic variability and unpredictability. Finally, this rst approach is not studied in deep into literature since practically it is a no-use strategy on persistence: opportunities oered by persistence are simply ignored, whereas some problems arise and must be dealt with.

This

approach is however useful as a cognitive tool to get the awareness that persistence is just a property of certain memories: it must be managed in order to benet from it.

2.3.3

Whole System Persistence

This approach, proposed by researchers from Microsoft Research in [56], is mainly focused on the reality of large database systems and of large data centers, even if it could also be used in standard computers. The chosen hardware model uses only persistent memory. Since currently SCM technologies are not as mature to be used as DRAM replacement, in this approach 9

persistence is achieved using NVDIMMs , currently available on the market. However even if not originally thought to be used with SCM, this approach is nonetheless perfectly suited for them. Here, persistence is exploited to achieve practically a zero impact on power failures (failure model). The goal of this approach is to trans-

9 Non-volatile DIMMs. NVDIMMs are standard DRAM DIMMs that use super-capacitors and NAND Flash modules to behave persistently.

73

form a commonly critical scenario as power interruptions into a suspend/resume cycle: if achieved, this change would lead to systems completely resilient to power outages. The originating observations of this approach are:

- a consolidated trend in large databases is the storage of the entire dataset into main memory. The cloud paradigm further urged the development of caching servers using large dataset entirely into memory (see section 2.1.2); - DRAM in servers can reach big sizes, currently around 6TB [69]: accordingly to this gure, clusters of servers can manage tens or even hundreds of terabytes of memory each; - power outages are expensive, especially for large environments.

The cost

of resuming a system increases in complex environments because when recovering large datasets, many I/O requests are placed on storage back-ends, typically slow. This stress suered by storage back-ends can be itself cause for other critical events such as that experienced at Facebook in 2010 [89]; - as the quantity of memory in server raises, the cost of recovery from backends raises, since the amount of data necessary to re-build the entire state is raised too .

These observations have inspired the search for a mechanism that relieves machines from the need of re-building the entire in-memory dataset when power outages happen. Key ideas of the Whole System Persistence are to:

- retain all the content present in memory thanks to persistence; - save the entire state of a server (registers, caches) into persistent memory when a power fail event is detected (Flush-on-fail strategy); - modify the hardware in order to permit a residual energy window long enough to permit the state save into memory, by adding capacitance into PSU using supercapacitors;

74

- restore automatically into registers and caches the state saved previously, as soon as power is restored.

These steps should appear to the operating system as transparent as possible, emulating just a suspend/resume cycle. Actually, some subtleties about the saved state and the real state of devices after the power cycle do point out that the process of resume/restart needs some further adjustment in order to be completely transparent. Even if some minor changes to the operating system may be needed, this approach has been successfully tested and showed in [110]. This proposal will be certainly of great advantage into large data centers: the focus is exactly on some of the major criticalities experimented in that context. However, some remarks must be made about it:

- although its use on every day computing would allow to use a computer immediately after the switch-on (after the switch-o the system has managed a power fail event), the same security problems noticed previously would arise (see section 2.3.2); - there is no specic data consistency management: entrusted to the operating system.

it should be therefore

This fact however suggests that, since

current operating systems are designed to reload at each reboot, there must be some mechanism to detect system degraded functionality and force a system reboot; - this approach uses persistence only as a temporary storage, not as a long term memory. The rest of the system does not even have the perception of being using persistence. This strategy is certainly simple, but some of the opportunities of persistence are not used.

2.3.4

Persistence awareness in the operating system

A step further toward the exploitation of SCMs is the intervention of the operating system: dierently to what happens in the approaches shown before, in the ones hereafter presented the operating system is aware of the presence of some persistent memory device connected to the standard memory bus along with DRAM. The

75

hardware model here considered is thus hybrid.

The presence of SCM can be

notied by the rmware at boot time as it normally happens for other hardware features in BIOS, and the memory physical address space will then be divided in a DRAM portion and a SCM portion. Almost all of the approaches here presented use persistent memory to store data into les, as it happens today with common hard disks or SSDs: through a le system. This continuity with the well-known paradigm of le systems has the advantage of preserving software compatibility, being thus a viable and acceptable approach to persistence-awareness. Anyway, while still maintaining the persistence awareness only at the operating system level, le systems are not the only way to exploit it. As described at the end of this topic, some proposals conceive a scenario both more complex and more thorough. For the moment, however, my focus is on le systems.

Typical services oered by le systems, are:

- a high level interface to applications to use storage; - an interposing layer between applications and devices whose job is arranging data to be safely stored and lately safely retrieved; - security enforcing and concurrency management on contained data; - a low level interface to device drivers.

Despite all the features oered by persistent memories, the need of these services (at least of the rst three ones) is unchanged: the features of new non-volatile memories do not inuence the need of le system services, at least as long as the le/folder paradigm is extensively used as it is today. Instead, what may be subject to revision are the mechanisms used to oer these services: le systems are software devices that have been always developed keeping in mind the technical details of the underlying storage media (see section A.3); as persistent memories are so dierent from the other ones currently in use, it is thus advisable to re-design and adapt operating systems to their features.

76

Brief analysis of current I/O path In Linux, applications that need to operate on les stored into I/O devices follow a sequence of common steps involving many kernel layers, illustrated in gure 2.4.

Figure 2.4: The Linux I/O path.

©

2014 Oikawa

This sequence is triggered by applications, it arrives (when needed) to the I/O devices and eventually it returns to application themselves. Applications use le system services through lesystem-related system calls, as open(), read(), write(), close(), and so on.

All these system calls use in turn the services of a kernel

layer called Virtual File System (VFS). Its role is to hide to applications all of the implementation details of each specic le system, thus exposing to them just a standard and well-documented interface.

The VFS manages all the software

objects required to use les, folders, links, and issues all the specic requests directly to each specic le system.

In turn, each specic le system interacts

with the page cache to check if the data is cached into memory (see also section

77

2.1.2); in case it is not, the kernel issues the needed I/O request to the block device driver layer. This layer in turn issues the request to the right device driver. The device driver then executes the task along with the driven device. This sequence, even if rather roughly sketched, describes the impressive amount of operations carried out by the kernel (other fundamental tasks, as security checking, have not been even considered).

Caches anticipated the persistence shift to main memory Following these steps, it is apparent how

memory,

it is common to have le system data into

though used as cache, that can be reasonably seen as a (temporary, as it

is volatile) storage location, whose backing are I/O devices. Moreover, these cache locations are used as the fundamental building block of the mmap() system call: with this system call, an application can map into its own virtual memory address range data stored into les from some le system. This mechanism is permitted by the page cache: if data is not present, it is rstly retrieved from the I/O device, then when the page cache has retrieved it, the memory pages containing that data can be mapped into the application's virtual address space. system call can be thus seen as a sort of

used into memory.

The mmap()

byte addressability on persistent data

Developers, in order to exploit the byte addressability of NOR

Flash memories, added during the 00s new functionalities to the mmap() system call: XIP (eXecute-In-Place). These changes permit to mmap() , in cooperation with a XIP-enabled le system and a XIP-enabled device driver, to connect the virtual address space of an application directly to the byte-addressable Flash chip, without using the page cache as middle ground. This feature was added to permit to lower the size of DRAMs into portable devices [11], and to decrease their bootup

XIP in eect, allows to processes to address directly data existing into persistent storage as if that data was into main memory. time [14].

Consistency considerations As stated before, caches are used principally to raise the performances of devices that otherwise behave poorly. I/O performances are raised because caching moves I/O operations from the devices to the memory: here operations are faster, band-

78

width is high, latency is minimal. However, this mechanism has a further cost in reliability

10

. Data into caches is always up to date, but data into devices becomes

up to date at a slower rate: since transfers costs are high, and particularly in the case of small transfers, few bulk transfers are preferred against many little ones (write-back caching strategy). This raises throughput and lowers transfer costs. This need forces operating system to wait (usually for 30s) to transfer back to I/O devices those pages of cache that are marked as dirty. The idea behind this behavior is to raise performances at the cost of some uncertainty. Continuing to discuss about I/O transfers, there are some other further uncertainties.

Firstly,

data transferred to I/O devices is unsafe until when writes are eectively executed; secondly, often modications to data stored on I/O devices require more than just one write operation: if a modify operation is not completely executed in all its steps, there is the risk of le system corruption. So, just to summarize this topic, caches are useful, but they raise the risk of data loss; also, write operations are risky while in progress, as well as logically-connected multiple operations are, until they are completely executed. These issues have been studied extensively throughout the past decades: correct survival of data is strategic in both generic le systems and, even more, in database systems.

Actually, it is from the database world that the acronym of ACID

comes (Atomicity, Consistency, Isolation, Durability), as well as the concept of transaction [31].

In order to tackle these issues, in the years many approaches

have been proposed and successfully used both in database and in le systems: the most known among them are transaction logging and copy-on-write techniques. Transaction logging is extensively used in journaling le systems

11

, whereas copy-

on-write is used in le systems that use shadow paging, and in those who use the log-structured design. As a further remark, all of these techniques do t perfectly in a fault model that tries to nullify the adverse eects of power outages. These strategies against faults do not take into account software faults, bugs, crashes, for the same assumptions explained before in RIO.

10 As said before, a rst cost is in complexity, see section 2.1.2. 11 For example, le systems as ext3 and ext4.

79

Approaching design While trying to treasure the observations just made and to keep speaking in a rather general way, a project of a le system designed specically for SCM should at least take into account the following major dierences between SCM and I/O devices:

- access semantics is completely dierent (easy load/store vs complex I/O read/write); - access granularity is dierent (byte or 64 bytes vs blocks); - cost of accessing persistent media changes (low with SCM, high with I/O devices). The motivation that induced the widespread use of caches loses its relevance: cache worthiness could be reconsidered; - execution delay window is dierent (short with SCM, long with I/O devices) but still present. This fact can simplify the design of a le system resilient to power outages, but the risks of data loss still must be a concern.

As

an example, journaling can lose its appeal compared to logging and shadow paging, which seem to be patterns better suited for memory storage; - ACID loses the last letter, Durability, as it is implicitly achieved through persistence. Atomicity, Consistency and Isolation still are needed to guarantee a long data lifespan.

Moreover, some new issues would arise. Among them, at least the following should be considered:

- memory protection against operating system crashes and programming errors should become a primary goal: memory is less safe than I/O. The experience about this topic documented in [19] can be used as a reference; - some subtle issues related to how atomicity is achieved do arise (it is fundamental in shadow paging for example). Memory operations (load/store) reach eectively the memory only after CPU caches in a write-back manner.

80

Cache lines tagged as dirty must be rewritten into memory, but this operation could be subject to reordering: this behavior optimizes stores and raises performances, but becomes a problem if the programmer relies on an exact order in which certain stores happen (as it is the case of atomic writes where an exact order is needed). While cache contents can be synchronized between processors and even between cores with cache coherency algorithms (using for example, in x64 instruction set, the mfence instruction), currently there are no guarantees against memory reordering operations from cache to memory: such guarantees can only be achieved using the mfence instruction in tandem with cache ushing, but this practice lowers sensibly performances.

Personal hypotheses Here I would like to expose some ideas that come up while imagining a system that use persistent memories along with a le system. The rst one actually would lead probably to a bad design, whereas the second one is only just sketched, even if might proof quite interesting.

I feel nonetheless that they are quite useful to

underline the dierences that come out when compared to the approaches found in literature. From a Recycle viewpoint a persistent memory could be exploited by:

- Recycling the page cache facility: a le system would be located completely inside a part of it.

Page cache should be programmed to use that zone

without a backing store; - Using a ramdisk block device along with the directive O_DIRECT of the open() sytem call to avoid the page cache and to not duplicate data into memory.

Both these approaches focus on the fact that, if storage is placed on the memory bus, then page cache would just replicate the same data, thus wasting space and CPU cycles.

So, my idea was to use storage either in the page cache only (the

former approach), or in the persistent storage only (the latter). While this intent is paramount also in the approaches found in literature, my thoughts were indeed much more airy.

81

The rst approach would appear somehow redundant, since already both TMPFS and RAMFS special le systems take prot of the page cache facility in the same way I imagined: without a backing store. This approach would be somehow similar to the proposals made into into [19], but it has to be reminded that RIO is well suited only on systems whose entire memory address range is persistent, and this would not be the case: this could be a clue that probably this might not be a correct design choice. However, similarly to what is objected in [142] about TMPFS and RAMFS, page cache has been developed rstly just as cache: the focus was on speed, giving for granted its volatility. Moreover, the page cache still should be used as a volatile cache for other standard I/O devices. Finally, at present there is no easy way to instruct page cache to use a given range of physical addresses: this information would be necessary to use page cache on persistent zone of memory. Each of these remarks stresses how a hypothetical redesign of page cache would be both not trivial and would unfortunately aect its core logic: simply, it would probably be a bad design choice. A better choice would likely be the creation of a new facility devoted to SCM [142]. The second strategy is the opposite of the previous one:

if the page cache

cannot be used, the alternative way to not duplicate data is to avoid it.

This

approach might be better than the previous one: surely it does not aect the logic of other important facilities of the operating system. O_DIRECT is already used as a open() modier, either to avoid double caching (as it is the case for database engines, which have their specic cache mechanism) or to avoid cache pollution (if caching is for some reason not needed). This interesting approach has already inspired other developers in the eort to build a solution suitable for persistence memory, as in the PRAMFS le system [95] and in current eorts of the Linux community to develop DAX (a successor for XIP, see section 2.3.5). However, my idea appears to me as somehow limited. The approach here chosen is the use of a le system to permit to every application to continue to use the already developed persistence semantics without any need of re-developing the application code: this means that if an application does not use O_DIRECT natively, reads and writes still pass through the page cache.

Moreover, O_DIRECT inuences only les

accessed with open(), not les accessed through memory mapping. So, while my idea was certainly in the right direction, it was however incomplete.

82

A gradual framework In eect, to build a working solution as those found in literature, some facilities must be developed:

-

A manager: a means for reserving and managing persistent memory, in order to avoid the standard memory manager to use it as common volatile memories. The alternatives are: to build a specic driver to use persistent memory as it was a device, to embed management capability in a le system developed specically to be used on memory areas, to modify the standard memory manager in order to use persistent memory as another type of memory, develop any other facility to manage persistent memory only.

-

A translator: a means for changing the semantics to access data. This change of semantics is necessary but the problem is where to place the translation mechanism: the solutions proposed in literature are either inside device driver, or in the le system, or in a library devoted to act as a semantic translator.

Moreover, as solutions become more and more thorough, these services can be oered:

-

Eciency: a means to avoid completely the use of page cache and access directly data.

-

Safety: a means to enforce memory protection against operating system crashes.

-

Consistency: a means to resist to power failures during writes and to guarantee long data lifespan.

-

Exploitation: a design to manage space that exploits the architecture of the memory, thus leaving the design approaches t for hard disks.

-

Integration: an elegant solution would permit the use of persistent memory both as storage and as memory for the kernel. Such use is the more complex, as the kernel must be instructed about how to use persistent memory and

83

this can potentially expose the kernel to bugs. These types of solutions are some sort of bridge between this class of approaches and those which propose to expose persistent memories directly to applications.

The dierent approaches are summarized in table B.6. Before presenting the details of each specic line of the table, I shall make a nal remark: these le systems do use persistent memories as storage, but all their internal structures used to keep the accounting about les, folders, open les, etc. still are placed into DRAM. This behavior is the same of common le systems: it simply reects the fact that still these approaches do not use persistent memory as memory available to programs, but just as a place for storage.

2.3.5

Adapting current le systems

The simplest approach The simplest approach would only permit to use a standard le system on persistent memory: this would only need a manager and a translator. However, as le systems would run without changes, a block device driver would be needed: developers should thus embed into it both the functions of the manager and those of the translator. A viable starting point could be the modication of the standard Linux brd driver (oered by brd.c). Such a solution would be functional even if inecient: page cache would be used normally.

Linux developer community Linux kernel developers are however walking through a more thorough solution: 12

DAX. This acronym stands for Direct Access

and it's developed to permit the

use of standard le systems on persistent memories with minor changes. It is the successor of XIP and it is a sort of complete solution that oers both automatic O_DIRECT for open() system calls and XIP functionality for mmap() system calls. Even if these solutions are not well documented in scientic literature (to my knowledge there are only some slides and videos), current eorts can be found in

12 The X probably stands for the rst letter of the XIP acronym. Direct access comes instead from the main function that must be implemented into compliant le systems, direct_access

84

the mailing lists and ocial Linux documentation about experimental features [99, 93]. To follow this paradigm, modications must be made into a driver (to become DAX compliant and to use persistent memory) and in a standard le system (that uses then the DAX driver through DAX functions): currently a le system subject to these modications is ext4. This approach is object of work by Linux kernel developers, and this fact should be considered as a clue of its validity. Moreover, as documented in [73], eorts made in this direction induced developers to focus on the refactoring of the design of I/O system calls to increase their eciency in the case of use with fast storage. This fact, as well as the challenges raised by the new NVM Express standard, could lead to a deep redesign of the mechanics of the I/O subsystem in Linux. The issues presented about fast SSD apply also in this context: software is a big cause of loss in latency and throughput and Reduce, Refactor, Recycle still is a valuable methodology.

Quill Quill is a proposal documented in literature [2] that has been developed to permit even less modications to common le systems. As the previous approach, Quill is not focused on a specic le system, but its aim is instead to be used with standard ones.

Another of its goals is to use as least as possible the kernel to

avoid expensive context switches: the most of its code is in user mode. It acts as a user mode translator, developed as a service library that interposes between each I/O system call and its eective execution, adding a sort of indirection. Quill is a software facility built of three components:

- the Nib, that catches the I/O system calls and forwards each of them to the following component, the Hub; - the Hub, that chooses the right handler for the system call, depending on whether the request is relative to a XIP le system or not; - the handlers, that eectively manage the requests: if it concerns a standard le system, the handler selected by the Hub is the standard system call of the le system. On the other hand, if the request concerns a XIP le system, a special handler performs a mmap() operation.

85

Quill, comparing to the approach of the Linux community, simplies the need of le system refactoring. Anyway, a thorough comparison between the eective performances of the two approaches does not exist. While both these approaches are surely focused on guaranteeing the needed eciency, the management of safety against operating system crashes is not well documented, and most probably is up to the design of the manager. As a further remark, both the previous approaches do rely on a driver that is used both as the manager and as the translator. This design is indeed a requirement, since common le system do expect to use a block device and, hence, a block device driver. Regarding consistency, the semantics used depends on the eective le system used.

Moreover, these solutions are exible

and simple enough to be used as a springboard towards a fast use of persistent memories into operating systems.

2.3.6

Persistent-memory le systems

A personal convincement of the author is that since software latencies are problematic at high speed, sooner or later, a le system specically designed to use memory storage, will be needed in order to further increase performance; this would permit to save latency from those optimizations that are related to common spinning disks as well as to gain latency with a design suited to memory. The next approaches will show the job made by researchers to develop le systems specically designed for byte-addressable persistent memories: BPFS [23], PRAMFS [95], PMFS [25], SCMFS [142]. As a generic observation, the following approaches are a step further to the exploitation of persistence memories. Even if the specic features do vary between each of them, these approach are the results of eorts made to oer a larger set of features, especially the ones that are needed into a le system used in production environments: safety and consistency.

BPFS BPFS is an experimental le system developed by researchers at Microsoft. The literature about it is focused on the internal structure of the le system and in the proposal of two important hardware modications to permit a fast utilization of

86

its features. This le system aims at providing high consistency guarantees using a design similar to that used into WAFL. Like WAFL, BPFS uses a tree-like structure that starts from a root inode: this design permits an update of an arbitrary portion of the tree with a single pointer write. However, BPFS researchers argued that in WAFL the mechanism used to perform le system updates was too much expensive (each update triggered a cascade of copy-on-write operations from the modied location up to the root of the le system tree). This remark conducted their work towards a proposal of a short-circuit shadow paging, i.e. a technique that would use adaptively three dierent modify approaches: in place writes, in place appends, partial copy-on-write. This technique permits ecient writes, along with a management of operations focused on consistency. The side eect of these choices is however the loss of the powerful snapshot management of WAFL. An other specicity about this approach is the proposal of two hardware modications. Usually this type of requests are to be avoided for the simple fact that they go easily unheard: hardware modications are very expensive and happen only when the protability is certain.

However,

one of the two proposed modications is indeed very interesting: it tries to address the problem of memory re-ordering writes. As previously described, currently the problem can be managed by ushing the cache (in

tandem

with mfence instruc-

tion): this approach is anyway limited, as it lowers considerably the performances. In [23] it is proposed to add into the hardware (as new instructions, similar to mfence) a mechanism to allow programmers to set ordering constraints into L1, L2, L3 caches: epoch barriers. An epoch would be a sequence of writes to persistent memory from the same thread, delimited by a new form of memory barrier issued by software. An epoch that contains dirty data that is not yet reected to BPRAM is an in-ight epoch; an in-ight epoch commits when all of the dirty data written during that epoch is successfully written back to persistent storage. The key invariant is that when a write is issued to persistent storage, all writes from all previous epochs must have already been committed to the persistent storage, including any data cached in volatile buers on the memory chips themselves. So, as long as this invariant is maintained, an epoch can remain in-ight within the cache subsystem long after the processor commits the memory barrier that marks the end of that epoch, and multiple epochs can potentially be in ight within

87

the cache subsystem at each point in time. Writes can still be reordered within an epoch, subject to standard reordering constraints [23].

This behavior would

be achieved through the insertion of two new elds into caches: a persistence bit and a Epoch identication pointer. The other proposal made into [23] consists in adding capacitance into RAM modules in order to reach the guarantee of eective completion of each write request already entrusted to the memory modules. This last change proposal is similar to those found in [56]. BPFS proposal brings both lights and shadows. Lights are surely related to the care of this design, and to the search for a solid mechanism for consistency. Moreover, the design is well adapted to memory access patterns. Shadows consists primarily on the fact that some features are not mentioned: security issues are not taken into account, as neither is a revision of mmap() system call (XIP functionality). Moreover, BPFS relies on hypothetical hardware changes, and their concrete realization is aleatory. Another issue is a certain amount of opacity about some details on the eective implementation they built: in [23] is pointed out how this solution has been developed on the Windows operating system platform, even if there is a complete port for Linux FUSE, but no further details are given. The problem of the depth of documentation arises also in other proposals, as the one that follows (PRAMFS), but in this case some information lacks completely:

for example, it cannot be

clearly identied where management and semantic translation happen. The most likely answer is in a driver, as the previous scenarios, but it is indeed a guess.

PRAMFS The next proposals, PRAMFS, comes from the open source community. It consists in a much more classical le system design, along with the required features about XIP functionality, I/O direct access and security against operating system crashes. It is a simpler approach than BPFS but it tries to address more issues. Memory protection is achieved using current virtual memory infrastructure, thus marking the pages into TLB and into page tables as read-only and changing the permissions only when strictly needed. This method seems the same as that of [19]. For what concerns the issues about the management and the semantic translation, this approach seems to be better than the previous ones: these two fundamental func-

88

tions are executed directly in the le system itself, without intervention of a block device driver.

To be exact, the documentation in this regard is not completely

clear, but some clues from PRAMFS documentation and from [60] conrm what just claimed. Another clue of this behavior, is the way this le system is mounted: directly by specifying the starting physical address.

mount t pramfs o physaddr=...

(Example of mount code)

This behavior is thus similar to those of TMPFS and RAMFS, albeit adapted to persistent memory: it is a great advantage, as it permits to avoid those overheads that are accountable to the block device emulation that is at best, an unneeded layer (see section 2.3.6). Unfortunately, reecting perhaps the very prototypical status of this proposal, in the documentation there is no mention about consistency concerns and relative mitigation techniques.

PMFS The PMFS le system was developed by the kernel developers of the Linux community before they started to focus on the DAX approach. In its design, direct interaction with persistent memory is clear: the le system manages directly the persistent memory, which is reclaimed from the kernel at mount time. The management and the translation are executed by the le system.

Although PMFS

internal design is dierent from the other ones reviewed before, the intent of the developers was the same: to create a lightweight le system for SCM in order to give to application a support for standard read, write and mmap operations, while oering consistency and memory optimizations to increase performances. XIP features are used to permit ecient use of mmap() system call. Standard I/O system calls are conveniently translated into memory operations while avoiding any data replication into page cache. In order to oer high levels of consistency, metadata updates are executed through atomic updates when feasible, or instead, by using a undo journal, copy-on-write is used for data updates. As in other approaches, BPFS developers remarked the need for hardware features to help consistency, and they pointed out the same consistency issues about memory operations remarked

89

in other papers. In order to solve these issues, they proposed the insertion of a new hardware instruction (pm_wbarrier) into the instructions set.

This approach is

similar to that proposed by BPFS developers, though simpler (no cache structure is changed). (i.e.

Such an instruction would guarantee the durability of those stores

the store instruction has been eectively executed) to persistent memory

already ushed from CPU caches.

An original approach to obtain the desired

memory protection is explained: instead of using the expensive RIO strategy of a continuous write-protect and write-unprotect cycle, it would be better to use a uninterruptible, temporal, write window to protect virtual memory pages.

SCMFS The last proposal here reviewed, SCMFS, comes from academic researchers from Texas A&M University and is presented in [142]. Proposed through a neat and thorough paper, their work is centered on a new le system developed specically for SCMs (SCMFS stands indeed for Storage Class Memory File System). A major strength of their approach is the high integration of such a le system with current Linux memory subsystem. Such an integration could pave the way to a future use of persistent structures and data by the kernel.

For this reason this

proposal really seems to represent a step further toward the concepts presented into [60]. The paper presents how the team modied the BIOS to advertise to the operating system about the presence of SCM and the Linux memory manager, in order to create a new memory zone (ZONE_STORAGE), that will be used only with new non-volatile system calls (nvmalloc(), nvfree()). In turn, the le system uses these new system call to manage memory allocation of its structures into persistent memory zone. Following these concepts, the manager is the standard Linux memory manager, and the translator is the le system, that uses directly the allocated memory. Another major strength is the high integration with existing virtual memory hardware infrastructure: each structure used in SCMFS uses extensively the virtual memory concept, and has been engineered to adapt easily to current page tables, TLBs, and CPU caches.

For example, each le is seen

as a at address range of the virtual address space starting from zero to a maximum address: this range is then remapped into a non-contiguous set of physical

90

memory, as it happens normally with application heap and stack. The whole le system space is managed into a range of virtual addresses. Superpages are used to avoid excessive use of space in TLB. preallocation is used to save valuable time in complex memory allocation procedures. Like BPFS, SCMFS also relies on some guarantees about ordering of memory instructions, but contrary to BPFS, it uses the slower (but viable) approach of mfence and cache ushing. In this case, the hardware modication proposed by researchers from Microsoft would be of great help. It is claimed that this operation is performed each and every time a critical information is changed in order to achieve a good consistency enforcement. However, since the section about consistency is only just briey depicted, it should be further investigated if these consistency guarantees would be sucient when in production. Another interesting feature of this approach is the need of a garbage collection facility: since each le receives a big virtual address range, under stress circumstances it could be necessary to manage fragmentation (too many holes of unused virtual addresses). into [122].

This need is similar in le system as that proposed

Despite the depth of the proposal, some areas seem to remain quite

opaque: there is no description of the implementation of the I/O system calls and of the much likely page cache bypass, as nor there is a source code made available on Internet. Moreover, while the focus on the extensive reutilization of the virtual memory infrastructure can suggest that memory protection is enforced, nonetheless this topic lacks.

2.3.7

Further steps

Other strategies to take advantage of non-volatile memories Until now, the classical storage dichotomy between data used by processes (heap, stack) and storage (les), has been respected. When they reserve a memory portion for le storage, le systems use it exclusively for le storage. Anyway, as persistent memory is byte addressable as just DRAM is and its use through a le system is not the only pattern of use. Rather, as long as persistence awareness is up to the operating system only, le systems are the only way to allow applications to use persistence seamlessly.

In fact, le system are fundamental to permit to

applications to be executed correctly, as interaction with le system is embedded

91

into a plethora of programs. However, an operating system could use persistent memory for itself, instead of servicing applications. Applications would thus only benet from a better job done by the operating system thanks to persistent memory, instead of using it directly, though unconsciously. An operating system could therefore:

- use persistent memory as DRAM extension to avoid the swapping of user virtual memory pages or to store part of its memory structures when DRAM runs low, or both; - use persistent memory to store a part of its data structures persistently, in order to boost boot-up, reboots and restore cycles after power failures.

Concerning DRAM extension, this operation would expose data moved from DRAM to the persistent memory to the same security issues remarked before. However, it could be remarked that, in case of swapping, the same issues arise when process data is moved to hard disks.

In literature, a potential use of persistent

memory as DRAM extension has been proposed into [53] and in [38]. The second approach would use persistent memory to store (persistently, not as DRAM replacement) a part of the data structures used for the operating system execution. This approach somehow anticipates issues presented in the next paragraphs when the proposals to expose persistence to applications will be presented. Here it should be remarked only that this approach, though having a great appeal, brings with it many potential programming bugs. Whilst risks will be analyzed subsequently along with persistence into applications, the assumption that is made here is that kernel code is expected to be safe. Correctness, safety and quality of kernel code is achievable much more easily than in user applications. The exploitation of persistent memory to decrease boot and restore times still has not been studied thoroughly, and scientic literature about this topic is little. The issues about boot and recovery time have been met before in the WSP approach, but it had limitations about real exploitation of persistence. A better approach could be the use of a mixed strategy for boot and recovery: strategic structures are recreated at each boot to preserve system health across reboots (and to exploit volatility too), while other structures and data can be left into persistent memory,

92

ready to be used. The work needed at each startup would thus be less, saving time and increasing responsiveness. A study about this topic is presented in [41].

A step further: integration If the approaches just shown were only alternatives to those about le systems, it would be very likely that the choice of operating system developers would reward le systems: a fast le system is indeed appealing.

However, those approaches

are not alternatives: although integration would be a complex task, memory architecture would permit to use persistent memory both with a le system and for the other uses just shown, all at the same time.

Just as memory is divided

into physical pages, and each physical page can be owned by dierent processes (through virtual address ranges), similarly, some pages could be used for the le system, whereas the others could be used for the operating system's advantage. Such an architecture would permit to an operating system, when booting, to execute the kernel image immediately from the persistent memory, and then load the persistent le system at boot time. Researchers still have not delved this topic: it seems an area needing further investigations.

To my knowledge, [60] is the only attempt of modelization of

integration. A key aspect of this topic is that, in order to use persistent memory following the integration concept, there must be a facility that somehow allocates persistent memory dynamically. Such a facility would manage requests from the le system and from the kernel as well as garbage collection (if needed), and other related activities. A rst approach to integration is made implicitly in the SCMFS proposal: they modied the standard Linux memory allocator to add the nvmalloc() and nvfree() new system calls. This can indeed be one of the viable ways to achieve integration: the two system calls can permit dynamic allocation of persistent memory. Besides the fact that those system calls in SCMFS are used exclusively by the le system, the kernel itself could take advantage of them. In [60] a dierent choice is proposed: the management of persistent space should be up to the le system, that must then assign memory to the standard memory manager upon request. Following Oikawa's proposal, the use of persistent storage is for temporary DRAM substitution, not for kernel structure persistence. Oikawa

93

models the access management in three viable alternatives: directly, indirectly and through mmap() operations. With the direct method, the Linux memory manager accesses directly the data structures of the le system to obtain memory pages; on the other hand, the indirect method uses a special le into the le system to be used by DRAM.

Issues about persistence-awareness in operating systems Some nal remarks are concluding this part, where persistence is exposed at the operating system level.

Articles which propose le system services to exploit

persistence are the majority, and the eorts made by researchers are extensive. Nonetheless, achievements on this subject are, while promising, still experimental. Software solutions oering a complete set of features still do not exist. The same situation is reected by the level of persistence awareness of current operating systems: an operating system eectively persistence-aware is still far to come. Indeed, each of the software big players is currently investing in research and eorts about future exploitation of new memory technologies, but these are still intents. Moreover, research has to be broadened and deepened in order to achieve products suitable to real in production scenarios: unfortunately, still many issues are not addressed properly or, worse, not even yet considered. For example, one issue still centered on le systems is the fact that the RRR approach is still underutilized: this approach is only rarely used into the articles analyzed. Surely, refactoring is a concern for the DAX implementors, but the other proposals do not cite the need of analyzing thoroughly the eciency of the software stack. The risk, for example, is that kernel execution could be over-used, thus incrementing the needs of expensive context switches. Another issue is the fact that the discussion, until now, has been entirely focused on systems working practically stand-alone, whereas concrete needs are actually dierent: the paradigms just seen must be proven to behave properly also in distributed and in highly replicated environments, such as those of big data centers. A rst eort in this direction is presented in [147], but this topic should still be faced in depth by researchers. Moreover, researchers must adapt the architectures conceived for persistent memories to modern computing trends: virtualization, multi-core architectures, concurrent and highly parallel computa-

94

tions, and so on. This branch of research in computer science will have to mature in time: this process will be eventually stimulated by the eective release of the new memory technologies on the market.

2.4

Storage Class Memory and applications

A topic related with those that have just been shown is the level of awareness of persistence into applications, and, in particular, the level of awareness about persistent-memory devices. Since operating systems are the foundations on which applications are built, many researchers have wondered if the presence of persistent memories could be exposed to applications. As it ordinarily happens, applications use their own virtual addresses to perform their tasks: it is therefore natural to question whether applications could use directly persistent memories as they do with standard DRAM. Persistence referred to applications is not a novelty, as applications have always used les and folders to manage persistent data. Moreover, the topic about the generic concept of persistence in applications is a vast research domain: in the years many eorts have been made to permit the seamless use of persistent data structure into applications. In particular, researchers concentrated their works to permit to object-oriented programming languages the use of persistent object through the use of database back-end.

Some examples are ObjectStore [44], Thor [47], and

the Java Persistence Api (JPA) [81, 106]. Other examples referring to persistence into applications in general [7] and in Java in particular [6] arise from the work of researchers from the Glasgow University during the 90s. Such eorts were inspired principally by the fact that data structures used in programming languages are badly suited to how persistence is managed by le systems, and applications usually perform expensive adaptation management (an example is the complex process of serialization):

a direct use of persistent data structures and persistent objects

into programming languages would be much easier. However, the use of database back-ends conveys some complex mapping issues. Researchers think that the new memory technologies could potentially remove those complexities, by removing both the need of a database back-end and the need of serialization: the topic thus moves from persistence in general to persistence into main memory, also in the

95

context of user applications. Exposing storage class memory to applications sounds attractive also for other reasons:

if applications could directly address persistent-memory devices, they

could use their full power without intermediation, thus removing unnecessary overheads. In turn, this approach could relief applications from the burden of relying on the kernel to execute functions related to persistence (through a le system): this would permit high savings in terms of latency and energy. As naturally applications use memory in terms of memory instructions, no translation would be needed, increasing the eectiveness of the approach. Moreover, this strategy would allow to programmers to nally use all the researches and studies made in the past decades to optimize in-memory data structures: until now, as persistence has been always relegated to slow devices, such slowness has heavily discouraged the direct use of these data structures. This new approach to persistence would allow to programmers to use highly ecient data structures also when coping with persistence. While the benets could be many, the issues would be many too.

A rst

observation roots in the fact that, as previously shown, the idea of using SCM as if it was just a slower DRAM has many contraindications: in order to be properly used, SCM should be used consciously. Moreover, a part of this consciousness is achieved through consistency management and enforcing, as persistence is deeply related with data consistency

13

: if SCM was exposed to applications, applications

would have to manage consistency issues. Another observation related with the preceding one refers to the average quality of code in user applications in comparison with that of kernel code: usually user applications cannot be trusted as safe, while kernel code is almost safe. This is not secondary: currently, by relying on le system services, applications demand to the operating system all the tasks related to the needed level of consistency. This approach has the merit of giving the programmer the opportunity to concentrate just on its main goal: the development of a functional application. Contrariwise, exposing persistence to applications would require an increase of the issues that developers should manage: they should use persistence knowingly.

Such an ap-

proach would require extensive code restructuring and rewriting in order to take prot of the potentially available performance gains.

13 Consistency permits to persistence to be eective.

96

Another important claim made by scholars is that this approach would induce the typical programming issues to reach the domain of persistence:

this event

would yield to a scenario even more complex. For example, the reader would agree that the management of pointers into programs is central in many programming languages; the presence of SCM and DRAM together, however, would complicate their use, and would permit the following pointers: - non volatile (NV) to NV pointers; - NV to volatile (V) pointers; - V to NV pointers; - V to V pointers. Clearly, at shutdown, only NV memory areas would survive, exposing the code that uses unsafe pointers to potential and subtle programming bugs. Moreover, the risk of dangling pointer, memory leaks, multiple free()s and locking errors would be present as in every programming language: the risk is however that, in the absence of appropriate checks, such errors could persist in time, thus becoming persistent errors. Two competing university research groups have proposed, around 2011, two different approaches to expose SCM to applications: NV-Heaps [22] and Mnemosyne [135]. These two papers are considered among scholars as the two reference works about persistent-memory exploiting in applications:

this fact is proved by the

numerous citations that both the papers have in literature. While the two proposals are quite dierent from each other, the goals that the two research group had were nonetheless quite similar: their intent was that of building a framework for programming languages able to permit to applications a safe use of persistence into main memory.

A major goal undertaken by both

the research group has been that of guaranteeing high consistency and, in the meanwhile, high performances: the former goal would permit to relief programmer from the dicult and error-prone explicit management of consistency, while the latter one is necessary to use conveniently persistent-memories without sacricing excessively their high performances. Here follows a brief presentations of both the approaches:

97

NV-Heaps: this proposal presents itself as the most complete. The group that developed it had the intent to address most of the major problems that spring from SCM exposition in applications. As it will soon showed, however, this intent of completeness is payed at cost of generality. NV-Heaps consists in a C++ library built upon a Linux kernel, and it's focused on allowing the use of persistent, user-dened object [. . . ] as an attractive abstraction for working with non-volatile program state. The system requirements are a XIP le 14

system, along with cache epochs

. The services oered by their library are

extensive: pointer safety through referential integrity, exible ACID transactions, a familiar interface (using C++ common syntax), high performance and high scalability. Each NV-Heap represents a sort of persistent domain for an application, where only safe pointers can be used: extra NV to NV and NV to V pointers are avoided. Moreover, the library, through transactions, permits the correct storage of data in time, while preserving performances. Concurrency related primitives like atomic sections and generational locks are supplied. Each NV-Heap, nally, is managed through a le abstraction: each heap is completely self contained, allowing the system to copy, to move or transmit them just like normal les. NV-Heaps are used by applications by recurring to the library services, using the le name as a handle:

the

library, then, in co-operation with the kernel, executes the mmap() through the XIP functionality, mapping the application virtual address space with the eective persistent memory area used by the NV-Heap. Interaction with kernel is used only when strictly necessary.

Mnemosyne: this proposal oers fewer features than the preceding one, but this simplicity preserves its generality. Mnemosyne too is developed as a library that oers to user mode programs the ability to use safely persistent-memory. The design goals that the developers decided to follow were respectively: the prior need to maximize user-mode accesses to persistence, the need to implement consistent updates and the need to use conventional hardware. Mnemosyne is therefore developed as a low-level interface, similar to C, that

14 This proposal has been developed by the same people that proposed epochs for the BPFS le system (see section 2.3.6).

98

provides: - persistent memory regions, allocatable either statically or dynamically with a pmalloc() instruction similar to malloc(); - persistence primitives to consistently update data; - durable memory transactions. Persistent memory regions are managed simply by extending the Linux functionality to manage memory regions, quite similarly to how the kernel is modied in the SCMFS proposal (see section 2.3.6). Consistent updates are oered through single variable updates, append updates, shadow updates and in place updates, whereas the implemented persistence primitives consist in a persistent heap and in a log facility.

Write ordering is achieved

through the easier approach (mfence and ush).

Finally, transactions are

oered through a compiler facility that permits to convert common C/C++ code in transactions.

Each of the two approaches presented above has its strength: while the NV-Heaps is more thorough, the Mnemosyne one is more general. From the point of view of generality, the former approach is somehow problematic, as it specically relies on the C++ programming model. Perhaps a more general model as the latter is preferable: its services could be used to build, upon it, more specialized libraries aimed each to serve dierent programming languages; in such a way, further levels of consistency could be oered.

For example, referential integrity in the former

approach is managed through variable overloading, but this feature is not common to all the programming languages. Perhaps a layered approach would be better customizable in order to achieve a better ne-grained level of service. Trying to look at the eventual weaknesses of these approaches, the most remarkable one is probably the one related to compatibility with current applications. If this approach was the only mean to exploit SCM, applications would need to be re-written or re-structured in order to benet from such improvements. Without these modications, otherwise, applications would continue to use volatile memory as they always did.

99

While this topic has been just only sketched, it is indeed a valuable part of the studies about persistence, and would be worthy of further investigations. As a last remark about these approaches, I have the feeling that these works represent just the rst steps on a long path: these proposals remain somehow limited as they address a part only of the approach to persistent memories, that should be, nally, when the times will be mature, a complete approach.

Conclusions In the rst chapter, current memories have been presented, along with their limits, the economical and the technical ones. Afterwards, the new persistent memories have been introduced, with some details about their internals.

Then, my study

moved from the devices, to operating systems: the aim has been that of understanding at best the extent of the changes that operating systems should adopt in order to use at best the new devices. Actually, these devices represent a real disruptive change in the eld of computer memories, probably the most notable of the last decades: if exploited at their full potential, they surely would change the way in which today storage is conceived. To reach this appealing goal, however, operating system should undergo a deep restructuring: the approaches seen in the preceding chapter are just the rst steps in this direction. Even if those approaches have been experimental attempts, all the measurements that researchers made to verify the eective performances of their proposals conrmed the high potential revenue in terms of latency, throughput and other performance metrics: results were indeed promising in each of them. With the aim of reaching the conclusion of this work, my last intent is to leave here some personal considerations about the future work that researchers and developers will probably have to do in the next years to prepare operating to persistent memories.

My personal convincement is that the eorts that still must be made to achieve a real persistent-memory awareness in operating systems are many: while each of the proposals here presented is a concrete step toward it, the reach of the goal is still far and, up to now, such an operating system doesn't exist yet. As these new memories are expected to reach a rst degree of maturity by the next decade, this

101

time window could permit to the scientic community to have a reasonable period to prepare this transition also into operating systems.

A complete solution While each of the approaches that have been described tries to manage a subset of the potential benets that persistent memories could oer, what still is lacking is a complete solution.

Even if persistence awareness is valuable when achieved

through a fast le system or else, when achieved through application awareness, it would be even better if users or developers would have not to choose either one, but if they could benet both from the former

and

from the latter. I think

that this should be one of the long term goals in operating system research: the implementation of a complete SCM solution. In the context of a stand-alone system, a complete approach should thus allow the seamless use of persistent memories through: - le system services; - kernel persistent objects and data structures; - application persistent objects and data structures. Anyway, these services, if conceived to remain stand-alone, would result to be almost useless.

Indeed, to reach an exhaustive level of completeness, a correct

approach to persistence should also: - be highly scalable; - t distributed environments; - t virtualized and cloud environments; - adapt easily to new hardware architectures; - behave adaptively depending on the scale in which it would be used and on the metrics on which performances would be measured.

102

Without doubt, these wishes represent, in their entirety, a tough target: there is enough material for many and many years of research in operating systems. I have nonetheless the feeling that slowly, one step at the time, many research domains in computer science are converging.

Perhaps, at the right time, the knowledge

reached in each of them would represent the critical mass that would permit to obtain a complete product, able to exploit thoroughly the new storage class memories.

Converging research domains In particular, from what I have read for this work, research about persistence and persistent memories, is increasingly related to some other research domains of computer science. These other areas can contribute to pursue a complete approach in order to exploit persistent memories and, in the meanwhile, they represent inuencing factors of how this goal is going to be achieved:

Changing hardware architectures: not only the memory panorama is changing; also current hardware architectures are experimenting a slow but continuous change, evolving toward platforms that use many cores, eventually dierent from each other: operating system design is trying to follow these trends [10, 116]. Some new hardware approaches are being developed [103], and new operating systems, in the case that these new technologies should succeed, would have to adapt to these new architectures.

Moreover, it is

much likely that new computing architectures will be developed taking into account the recent achievement in memory technology: such eorts would thus represent a further opportunity for the search of a thorough persistenceawareness.

Database systems: the eld of databases is quickly approaching main memories; as underlined before, the use of the entire dataset in memory is not new, but this trend is increasing with the use of No-SQL database systems, like the key-value stores used in distributed caching systems.

The knowledge

gained in database systems has been proven to be fundamental in managing consistency requirements into the approaches to persistence just seen, and it

103

is likely that each further achievement related to in-memory databases, could be used also in the context of persistent memories. Database paradigms as the key-value one, have already been hypothesized to be used to exploit persistent memories [8].

Distributed systems: the need to scale software to big sizes, as those needed into data centers, already motivated the use of distributed paradigms to both databases and storage systems. The research in this scientic branch is further advancing: currently, eorts as those of RAMCloud, represent an interesting approach to storage systems using only main memory [112, 128]. These eorts too could represent a valuable and useful contribution to a further exploiting of persistent memories.

File system design: while the log-structured approach has already been cited, researchers have proposed to use this approach also when managing DRAM [123]. These experiences could then be used also for persistent memories.

Transactional memories: transactional memory represents an important research eld in computer science; in the past, this approach and, more generally, the need to use implicitly consistency when using main memory, has been deeply investigated [34, 33]; the knowledge gained in this eld has been used many times in the approaches previously presented and, much likely, each new achievement in this eld could have the potential to inuence also the research about persistent memories and their use in operating systems.

A futuristic operating system Thinking of a hypothetical next-generation operating system, I would represent it as being built similarly to current hypervisors used to achieve virtualization. The storage facility would be an important piece of the hypervisor, and would be the part conceived to use persistent memory. This hypotetical storage facility would: - behave as a database, using extensively the fast key-value paradigm: such a management of its data could then permit the use of variable sizes of data into distributed environments, abandoning the use of xed-size blocks. Moreover,

104

a database-like behavior would permit the storage facility to be used as a service, used by many software layers: applications, and so on.

operating systems, le systems,

Such an approach would permit the transparent

move of the stored data as needed, permitting thus, for example, replication, scaling, caching.

The most fascinating hypotheses would be the ability to

scale a local heap (eventually persistent) from a local process to a distributed one, in order to permit the use of that data concurrently from, for example, a server cluster in a data center. - perform data allocation following a log-structured pattern: this would permit to manage easily the memory wearing; such a pattern seems to be well suited to be used together with key-value databases. Data allocation should permit the concurrent use of many dierent services (persistent objects, le systems, kernel data, and so on); - use a highly ecient snapshot facility similar to that used in the WAFL le system, or similar to that proposed in [8]. - use transactions and ACID semantics to guarantee reliability at the highest levels, in order to permit the usage in production environments; - implement the services necessary to t to multiple distributed environments, using adaptive technologies that change behavior depending on performances, trac and other metrics.

Final salutation Despite these personal thoughts and whatever the future will hold about memory technologies and operating systems, I hope that my work can prove to be a tool to help the understanding of the topic about persistent-memory awareness in operating systems.

Appendix A Asides A.1

General

ITRS - International Technology Roadmap for Semiconductors ITRS, acronym of International Technology Roadmap for Semiconductors is an international organization built on the ashes of the former born United States' national organization NTRS, National Technology Roadmap for Semiconductors. ITRS is currently sponsored by the ve leading chip manufacturing regions in the world:

Europe, Japan, Korea, Taiwan, United States.

The sponsoring or-

ganizations are the semiconductor industry associations of each of those regions: 1

ESIA, JEITA, KSIA, TSIA and SIA . Its aim is to help the semiconductor industry as a whole to maintain its protability, oering, among other services, the following: produce every two years a thorough report about the semiconductor industry status and its roadmap to maintain the exponential growth.

This in

particular is a key document drafted by a international committee of scientists and technologists, conveying the most exhaustive and accurate assessment on the semiconductor industry and promoting a deep and vast analysis eort on current and future semiconductor technologies

1 Respectively, European Semiconductor Industry Association, Japan Electronics and Information Technology industries Association, Korea Semiconductor Industry Association, Taiwan Semiconductor Industry Association and Semiconductor Industry Association

106

A.2

Physics and Semiconductors

Ferroelectricity Property of the matter, usually noticed on some crystal structure materials: these materials are able to be electrically polarized under the eect of an electric eld, maintain the polarization when the electric eld ceases and to reverse (or change) the polarization if the electric eld reverses (or changes). Ferroelectricity discovery roots on studies of piroelectric and piezoelectric properties conducted by Pierre and Paul-Jacques Curie brothers around 1880, and was rstly noticed as an anomalous behavior of Rochelle salt in 1894 by F. Pockels (this salt was rstly separated in 1655 by Elie Seignette, an apothecary in the town of La Rochelle, France). Ferroelectricity was then called as such and identied as a specic property of the matter in 1924 by W. F. G. Swann [75].

Ferromagnetism From Encyclopedia Britannica:

physical phenomenon in which certain electri-

cally uncharged materials strongly attract others. Two materials found in nature, lodestone (or magnetite, an oxide of iron, Fe3 O4 ) and iron, have the ability to acquire such attractive powers, and they are often called natural ferromagnets. They were discovered more than 2,000 years ago, and all early scientic studies of magnetism were conducted on these materials. Today, ferromagnetic materials are used in a wide variety of devices essential to everyday lifee.g., electric motors and generators, transformers, telephones, and loudspeakers. Ferromagnetism is a kind of magnetism that is associated with iron, cobalt, nickel, and some alloys or compounds containing one or more of these elements. It also occurs in gadolinium and a few other rare-earth elements. In contrast to other substances, ferromagnetic materials are magnetized easily, and in strong magnetic elds the magnetization approaches a denite limit called saturation.

When a

eld is applied and then removed, the magnetization does not return to its original valuethis phenomenon is referred to as hysteresis. When heated to a certain temperature called the Curie point, which is dierent for each substance, ferromagnetic materials lose their characteristic properties and cease to be magnetic;

107

however, they become ferromagnetic again on cooling. The magnetism in ferromagnetic materials is caused by the alignment patterns of their constituent atoms, which act as elementary electromagnets.

Ferromag-

netism is explained by the concept that some species of atoms possess a magnetic momentthat is, that such an atom itself is an elementary electromagnet produced by the motion of electrons about its nucleus and by the spin of its electrons on their own axes. Below the Curie point, atoms that behave as tiny magnets in ferromagnetic materials spontaneously align themselves. They become oriented in the same direction, so that their magnetic elds reinforce each other. One requirement of a ferromagnetic material is that its atoms or ions have permanent magnetic moments. The magnetic moment of an atom comes from its electrons, since the nuclear contribution is negligible. Another requirement for ferromagnetism is some kind of interatomic force that keeps the magnetic moments of many atoms parallel to each other. Without such a force the atoms would be disordered by thermal agitation, the moments of neighbouring atoms would neutralize each other, and the large magnetic moment characteristic of ferromagnetic materials would not exist. There is ample evidence that some atoms or ions have a permanent magnetic moment that may be pictured as a dipole consisting of a positive, or north, pole separated from a negative, or south, pole.

In ferromagnets, the large coupling

between the atomic magnetic moments leads to some degree of dipole alignment and hence to a net magnetization. Since 1950, and particularly since 1960, several ionically bound compounds have been discovered to be ferromagnetic. Some of these compounds are electrical insulators; others have a conductivity of magnitude typical of semiconductors. Such compounds include chalcogenides (compounds of oxygen, sulfur, selenium, or tellurium), halides (compounds of uorine, chlorine, bromine, or iodine), and their combinations. The ions with permanent dipole moments in these materials are manganese, chromium (Cr), and europium (Eu); the others are diamagnetic. At low temperatures, the rare-earth metals holmium (Ho) and erbium (Er) have a nonparallel moment arrangement that gives rise to a substantial spontaneous magnetization. Some ionic compounds with the spinel crystal structure also possess ferromagnetic ordering. A dierent structure leads to a spontaneous magnetization

108

in thulium (Tm) below 32 kelvins (K).

Mott transition Mott transition describes the transition from insulating to metallic state of a material. It appears if the electron density and therefore the electron screening of the coulomb potential changes. Normally we consider a material either to be a metal or an insulator, depending on the position of the Fermi energy within the band structure. But due to screening a transition can take place. To understand this we consider an electron in a nite quantum well. There is only a nite number of bound states inside the well. If its width is decreased all states move up in energy and the highest ones move outside the well. Therefore the number of bound states decreases until a critical value is reached. Below this width there are no more bound states. An insulating material with a certain lattice and long distances between the atoms is considered. If the atoms are moved closer together the electron density increases, screening of the coulomb potential appears and the energy levels move up. After a certain point there are no more bound states for the outer electrons and the material becomes a metal. [90].

Tunnel Junctions As said into Tsymbal and Kohlstedt paper, The phenomenon of electron tunneling has been known since the advent of quantum mechanics, but it continues to enrich our understanding of many elds of physics, as well as oering a route toward useful devices.

A tunnel junction consists of two metal electrodes separated by

a nanometer-thick insulating barrier layer, as was rst discussed by Frenkel in 1930. Although forbidden by classical physics, an electron is allowed to traverse a potential barrier that exceeds the electron's energy. The electron therefore has a nite probability of being found on the opposite side of the barrier. A famous example is electron tunneling in superconducting tunnel junctions, discovered by Giaever, that allowed measurement of important properties of superconductors. In the 1970s, spin-dependent electron tunneling from ferromagnetic metal electrodes across an amorphous Al2 O3 lm was observed by Tedrow and Meservey. The latter

109

discovery led Jullière to propose and demonstrate a magnetic tunnel junction in which the tunneling current depends on the relative magnetization orientation of the two ferromagnetic electrodes, the phenomenon nowadays known as tunneling (or junction) magnetoresistance. New kinds of tunnel junctions may be very useful for various technological applications.

For example, magnetic tunnel junctions

have recently attracted considerable interest due to their potential application in spin-electronic devices such as magnetic eld sensors and magnetic random access memories. [132]. Tunnel junctions are thus electronic compounds built with dierent layers of potentially dierent materials, acting as resistive switching elements, containing at least a tunnel barrier element. The tunnel term refers to the mechanics of the electron passage into the tunnel barrier element: electron passage is by means of direct tunneling as studied in quantum mechanics. The eective resistive switch depends on the underlying physical principle, but the eect is the modulation and the change in the electronic potential barrier between layers, resulting in changes in resistivity of the tunneling layer(s).

Ferromagnetic Tunnel Junctions A magnetic tunnel junction consists in a sandwich of two magnetic material layers separated by a thin barrier. One of the two magnetic layers has a xed magnetic polarization (xed layer), whereas in the ferromagnetic layer (free layer) magnetization can be switched. Dierent magnetic polarization in the free layer interacts with the polarization of the xed layer, changing the resistance into the tunneling layer by means of Giant Magneto-Resistive eect.

Ferroelectric Tunnel Junctions Once again, as said into Tsymbal and Kohlstedt paper, Yet another concept is the ferroelectric tunnel junction (FTJ), which takes advantage of a ferroelectric as the barrier material.

Ferroelectrics possess a spontaneous electric polarization that

can be switched by an applied electric eld. This adds a new functional property to a tunnel junction, which may lead to novel, yet undiscovered electronic devices based on FTJs. The discovery of ferroelectricity goes back to 1921 approximately

110

when the principles of quantum mechanical electron tunneling were formulated. The basic idea of a FTJ (called a polar switch at that time) was formulated in 1971 by Esaki et al. Owing to a reversible electric polarization, FTJs are expected to have current-voltage characteristics dierent from those of conventional tunnel junctions. The electric eldinduced polarization reversal of a ferroelectric barrier may have a profound eect on the conductance of a FTJ, leading to resistive switching when the magnitude of the applied eld equals that of the coercive eld of the ferroelectric.

Indeed, the polarization reversal alters the sign of the

polarization charges at a barrier-electrode interface. Each ferroelectric tunnel junction is thus a device in which two electrodes sandwich a tunnel barrier with ferroelectric properties [28]. The electric polarization of the barrier can be switched by means of opposite electrical eld (suciently high to reach the coercitive eld of the ferroelectric), causing a change into electronic potential barrier, in turn triggering a dierent conductance by means of Giant Electroresistance eect [149, 125].

Field Eect Transistor Transistors are foundamental semiconductive units featuring three electrodes. A potential dierence applied on one electrode (the gate) inuences the passage of a current between the other two electrodes (source and drain). Transistors are used either as switches or as ampliers. There are two main types of transistors: - bipolar junction transistors - eld eect transistors A typical eld eect transistor (FET) schema is shown on gure A.1 [57, p. 247]. If a potential dierence applied to G is less than threshold, there is no conducting channel between S and D; vice versa, a conductive channel establishes between S and D, allowing thus current passage.

Memristor In 1971 Dr. Leonard Chua hypothesized the existence of these devices, a fourth basic type of electrical devices along with resistors, capacitors and inductors [21].

111

©

Figure A.1: Field eect transistor, perspective (a) and front (b) 2001 The McGraw Companies

Fascinating studies on memristance have been undertaken since Chua's hypothesis, because this type of devices could permit to change the computing paradigm: networks of memristors can supersede transistors into a processor functional units and can be used to build a computing paradigm based on neural networks [140]. Researchers are claiming that nally, the new persistent memories using the 2terminal conguration, are full-edged memristors if their switching mechanics is implicitly embedded inside them, as it is the case of redox memory cells [127].

A.3

Operating systems

Hardware technology inuences le system design As already stated, le systems are one of the sources of added latency and reduced throughput. Moreover, it has been said that this impact diers among le systems, depending on their internal design. File systems are software components designed to serialize and maintain data persistently into a persistent memory device. However, for more than fty years, these persistent memory devices have always been identied just as hard disks. Flash, albeit appeared in 1984, is still considered like a sort of new comer. During this long time, le systems necessarily have adapted to the features of hard disks: as they needed to enforce a safe endurance of data in time, this goal can be achieved more incisively only when the internals of the memory medium are exploited (or,

112

at least, known and taken into account). An example is the transition from the old traditional Unix le system (the one developed in Bell Labs) to the newer Unix Fast File System, then called UFS [51]:

- the new le system did distribute inodes throughout the disk near data blocks they pointed to, in order to reduce drastically seek time and the need to execute random reads; - the new le system was organized into cylinder groups: one of the eects was the added redundancy to replicate the superblock in such a way that it was distributed between cylinders and platters too (to obtain a better resiliency upon a single platter failure). It is apparent that it was thoroughly taken into account the physical structure of hard disks.

The same adaptation on the physical features of the memory media happened when Flash memories became widespread: many le systems have been specically designed for Flash memories (JFFS, YAFFS, YAFFS2, UBIFS, and so on). As an aside, it could be interesting to note that also in the case of common SSDs or common Flash USB sticks, even if the internal architecture is hidden, dierent le system settings can change performances because they adapt better or worse to the underlying architecture: this is the case of block sizes of le systems. Some le system block sizes are well suited to the dimension of the erase size and of the internal block size of the Flash chips, whereas some other are not [87, 92]. These observations could easily explicate the reason why ext3 performs so badly when used with fast SSDs

Appendix B Tables

Source

Rank Gartner [74]

IEEE [80]

1

Computing Everywhere

Wearable devices

2

The Internet of Things (IoT)

Internet of Anything

3

3D Printing

Security into software design

4

Advanced, Pervasive, Invisible Analytics

Software-dened Anything (SDx)

5

Context-Rich Systems

Cloud security and privacy concerns grow

6

Smart Machines

3D Printing

7

Cloud/Client Architecture

Predictive Analytics

8

Software-Dened Infrastructure and Applications

Embedded Computing security

9

Web-Scale IT

Augmented Reality Applications

Risk-Based Security and Self-Protection

Smartphones: new opportunities for Digital Health

10

Table B.1: Top 10 technology trends for 2015

114

Parameters

Feature size (nm) Cell area Read latency (ns) Write/Erase latency (ns)

NAND Flash

FeRAM

STT-MRAM

45 4 12 100 109 > 10y 6000

PCM

Prototypical

DRAM

65 20 35 35 >1012 > 10y 2500

Baseline

36 6 < 10 < 10 >1016

180 22 40 65 1014 10y 30

Endurance

16 4 100 100/1000 105 10y 0.4

64ms

4

Data retention Write energy (fJ)

Table B.2: Performance comparison between memories

Emerging

TCM

40 4

BNF

n/a

n/a

FTJ

VCM

35 4

n/a

Redox memories ECM

5 4

n/a

10 106 4h 1000

10 104 3d 10

50

20 4

n/a

10y > 10y 115/< 1000 n/a

n/a

10y 1000/8000

Figures come from ITRS 2013 Emerging Research Devices tables ERD3, ERD4a and ERD4b [82]. Figures about

power consumption of DRAM and Flash could contain some problems relative to how the values are calculated (table ERD3).

115

Technology - operation PCM - Write PCM - Read Emerging memory - Write Emerging memory - Read

Latency (

0.1 0.012 0.02 0.005

µs

)

4K bit per bit (

µs

)

409.6 49.152 81.92 20.48

Table B.3: 4K Transfer times with PCM and other memories

µs

4K 64 bit (

51.2 6.144 10.24 2.56

)

116

SATA III

Bus

2010

2008

Year

Transfers/s

2007

2005

PCI Express gen3 DDR3

6G 8G 1333M 6.4G

Intel QPI

.8 .98 64 16

bit/transfer

600MB/s 985MB/s 10.6GB/s 12.8GB/s

Payload

µs µs µs µs

6.83 4.16 0.39 0.32

Bus transfer time

Table B.4: Bus latency comparison

7.2x 11.82x 126.03x 153.6x

Read

0.9x 1.48x 15.75x 19.2x

Read 64 bit

59.97x 98.46x 1050.26x 1280x

Write

7.5x 12.31x 131.28x 160x

Write 64 bit

Bus transfer time is the time elapsed when transferring 4K using the theoretical speed of the bus.

Bus transfer time

6.83 4.16 0.39 0.32

µs µs µs µs

146x 240x 2564x 3125x

HDD Latency (1000

µs

/ bus)

14.64x 24x 256x 312x

SSD Latency (100

µs

/ bus)

The last four columns are ratios: bus transfer time / memory speed. Memory speeds are those from table B.3.

Bus SATA III PCI Express gen3 DDR3 Intel QPI

Table B.5: HDD speed vs bus theoretical speed

Integration

Optional Consistency

No

Required Safety

File system

Approaches

Eciency

Driver

Translator

No cache

Manager Block driver

Type Std FS + DAX

Name Linux

No

No

No

No

Yes No

File system

Yes

n/a Yes

Driver

Yes

FS tune FS tune

No cache

FS tune

n/a File system

File system (clue)

Block driver

SCM FS

SCM FS SCM FS

Viable: nvmalloc, nvfree

S.FS + XIP + library

PRAMFS

Yes

File system

n/a

Memory mgr

n/a

Quill

PMFS SCM FS

Table B.6: Persistence awareness through le systems

SCMs; in these cases, cache avoidance is given for granted.

around the SCMs features. Such specic design usually reects to eciency: le system are built to use eciently

respectively). SCM FS: storage class memory le system, i.e. developers have built a le system specically suited

Linux and Quill are developed to use only standard le systems with minimal change (DAX and XIP compliance

SCMFS

BPFS

117

Bibliography [1]

Ameen Akel et al. Onyx: A Protoype Phase Change Memory Storage Ar-

Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems. HotStorage'11. Portland, OR: USENIX Association, 2011, pp. 22. url: http://dl.acm.org/citation.cfm?id=2002 218.2002220. ray.

[2]

In:

Quill: Exploiting Fast Non-Volatile Memory by Transparently Bypassing the File System.

Louis Alex, Eisner Todor, and Mollov Steven Swanson.

2013. [3]

Ross Anderson and Markus Kuhn.

Tamper Resistance:

A Cautionary

Proceedings of the 2Nd Conference on Proceedings of the Second USENIX Workshop on Electronic Commerce - Volume 2. WOEC'96. Oakland, California: USENIX Association, 1996, pp. 11. url: http: //dl.acm.org/citation.cfm?id=1267167.1267168. Note. In:

[4]

Dmytro Apalkov et al. Spin-transfer Torque Magnetic Random Access

J. Emerg. Technol. Comput. Syst. 9.2 (2013), url: http://doi.acm.org/10.1145/2463585.2463589.

Memory (STT-MRAM). in: pp. 131.

More-than-Moore. Tech. rep. ITRS, 2010. url: ht tp://www.itrs.net/ITRS%201999-2014%20Mtgs,%20Presentations%20&% 20Links/2010ITRS/IRC-ITRS-MtM-v2%203.pdf.

[5]

Wolfgang Arden et al.

[6]

M.

Rec. 25.4 .245905. [7]

SIGMOD url: http://doi.acm.org/10.1145/245882

P. Atkinson et al. An Orthogonally Persistent Java. (1996), pp. 6875.

In:

Malcolm Atkinson and Ronald Morrison. Orthogonally Persistent Object

119

The VLDB Journal 4.3 (1995), pp. //dl.acm.org/citation.cfm?id=615224.615226. Systems.

[8]

In:

319402.

url: http:

Katelin A. Bailey et al. Exploring Storage Class Memory with Key Value

Proceedings of the 1st Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads. INFLOW '13. Farmington, Pennsylvania: ACM, 2013, pp. 41. url: http://doi.acm.org/10.1145/2527 792.2527799. Stores. In:

[9]

Mary Baker et al. Non-volatile Memory for Fast, Reliable File Systems.

Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS V. Boston, Massachusetts, USA: ACM, 1992, pp. 1022. url: http://do i.acm.org/10.1145/143365.143380. In:

[10]

Andrew Baumann et al. The Multikernel:

A New OS Architecture for

Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles. SOSP '09. Big Sky, Montana, USA: ACM, 2009, pp. 2944. url: http://doi.acm.org/10.1145/1629 575.1629579. Scalable Multicore Systems. In:

[11]

Tony Benavides et al. The Enabling of an Execute-In-Place Architecture to Reduce the Embedded System Memory Footprint and Boot Time. In:

JCP 3.1 .79-89.

(2008), pp. 7989.

url: http://dx.doi.org/10.4304/jcp.3.1

ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead. 2008.

[12]

Keren Bergman et al.

[13]

R. Bez et al. Introduction to ash memory. In: 91.4 (2003), pp. 489502.

Proceedings of the IEEE url: http://dx.doi.org/10.1109/JPROC.200

3.811702. [14]

Tim R. Bird. Methods to Improve Bootup Time in Linux. In:

of the Linux Symposium 2004. [15]

Proceedings

Vol. I. 2004, pp. 7988.

Julien Borghetti et al. Memristive switches enable stateful logic operations

Nature 464.7290 http://dx.doi.org/10.1038/nature08940. via material implication. In:

(2010), pp. 873876.

url:

120

[16]

Adrian M. Cauleld et al. Moneta: A High-Performance Storage Array

Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '43. Washington, DC, USA: IEEE Computer Society, 2010, pp. 385395. url: http://dx.doi.org/10.1109/MICRO.2010.33. Architecture for Next-Generation, Non-volatile Memories. In:

[17]

Understanding the Impact of Emerging Non-Volatile Memories on HighPerformance, IO-Intensive Computing. Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 2010, pp. 111.

Washington, DC, USA: IEEE Computer Society,

url: http://dx.doi.org/10.1109/SC.2010.56.

Materials Today 14.12 (2011), pp. 608615. url: http://www.sciencedirec t.com/science/article/pii/S1369702111703029.

[18]

Ting-Chang Chang et al. Developments in nanocrystal memory. In:

[19]

Peter M. Chen et al. The Rio File Cache: Surviving Operating System

Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems. Crashes. In:

ASPLOS VII. Cambridge, Massachusetts, USA: ACM, 1996, pp. 7483.

url: http://doi.acm.org/10.1145/237090.237154. [20]

Sangyeun Cho and Hyunjin Lee.

Flip-N-Write:

A simple deterministic

technique to improve PRAM write performance, energy and endurance.

Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on. 2009, pp. 347357. In:

[21]

IEEE Transactions on 18.5 (1971), rg/10.1109/TCT.1971.1083337. [22]

[23]

Circuit Theory, url: http://dx.doi.o

L. O. Chua. Memristor-The missing circuit element. In: pp. 507519.

NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories. Vol. 39. ACM SIGARCH Computer Architecture News 1. 2011, pp. 105118. url: http://dl.acm.org/citation.cfm?i d=1950380. Jeremy Condit et al. Better I/O through byte-addressable, persistent memory. In:

Proceedings of the ACM SIGOPS 22nd symposium on Operating

121

systems principles. 2009, ion.cfm?id=1629589. [24]

pp. 133146.

R. H. Dennard. Technical literature [Reprint of "Field-Eect Transistor Memory" (US Patent No.

3,387,286)].

Newsletter, IEEE 13.1 (2008), 1109/N-SSC.2008.4785686. [25]

url: http://dl.acm.org/citat

Solid-State Circuits Society url: http://dx.doi.org/10.

In:

pp. 1725.

Subramanya R. Dulloor et al. System Software for Persistent Memory.

Proceedings of the Ninth European Conference on Computer Systems. EuroSys '14. Amsterdam, The Netherlands: ACM, 2014, pp. 151. url: http://doi.acm.org/10.1145/2592798.2592814. In:

[26]

W. Enck et al. Defending Against Attacks on Main Memory Persistence.

Computer Security Applications Conference, 2008. ACSAC 2008. Annual. 2008, pp. 6574. url: http://dx.doi.org/10.1109/ACSAC.2008. 45.

In:

[27]

Michael Fitsilis and Rainer Waser.

Scaling of the ferroelectric eld ef-

fect transistor and programming concepts for non-volatile memory applications. Aachen, Techn. Hochsch., Diss., 2005. PhD thesis. Aachen: Fakultat fur Elektrotechnik und Informationstechnik der Rheinisch-Westfalischen Technischen Hochschule Aachen, 2005.

url: http://publications.rwt

h-aachen.de/record/62096. [28]

Vincent Garcia and Manuel Bibes. Ferroelectric tunnel junctions for in-

Nat Commun 5.. url: http://dx.doi.org/10.1038/ncomms5289.

formation storage and processing. In: p. .

(2014). Review,

Seminar on 2013 ITRS Roadmap Update. 2014. url: http://www.ewh.ieee .org/r6/scv/eds/slides/2014-Mar-11-Paolo.pdf.

[29]

Paolo Gargini. The Roadmap to Success: 2013 ITRS Update. In:

[30]

Bharan Giridhar et al. Exploring DRAM Organizations for Energy-ecient

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. SC '13. Denver, Colorado: ACM, 2013, pp. 231. url: http: //doi.acm.org/10.1145/2503210.2503215. and Resilient Exascale Memories.

In:

122

[31]

Theo Haerder and Andreas Reuter.

Principles of Transaction-oriented

ACM Comput. Surv. url: http://doi.acm.org/10.1145/289.291. Database Recovery. In:

[32]

J.

15.4 (1983), pp. 287317.

Alex Halderman et al. Lest We Remember:

Cold-boot Attacks on

Commun. ACM 52.5 (2009), http://doi.acm.org/10.1145/1506409.1506429. Encryption Keys.

[33]

In:

pp. 9198.

url:

Lance Hammond et al. Transactional Memory Coherence and Consis-

Proceedings of the 31st Annual International Symposium on Computer Architecture. ISCA '04. München, Germany: IEEE Computer Society, 2004, pp. 102. url: http://dl.acm.org/citation.cfm?i d=998680.1006711. tency.

[34]

In:

Maurice Herlihy and J.

Eliot B. Moss.

Transactional Memory: Archi-

SIGARCH Comput. url: http://doi.acm.org/10.1

tectural Support for Lock-free Data Structures. In:

Archit. News 21.2 (1993), 145/173682.165164. [35]

pp. 289300.

Dave Hitz, James Lau, and Michael Malcolm.

File System Design for

Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference. an NFS File Server Appliance.

In:

WTEC'94. San Francisco, California: USENIX Association, 1994, pp. 19 19. [36]

url: http://dl.acm.org/citation.cfm?id=1267074.1267093.

John E. Hopcroft, Rajeev Motwani, and Jerey D. Ullman.

guaggi e calcolabilità.

Automi, lin-

Ed. by Pearson Education. Prima Edizione Italiana.

Pearson Education, 2003. [37]

H. Hunter, L.

A. Lastras-Montano, and B. Bhattacharjee.

Computer 47.9 url: http://dx.doi.org/10.1109/MC.2014.233.

Server Systems for New Memory Technologies. In: pp. 7884. [38]

Adapting (2014),

Ju-Young Jung and Sangyeun Cho. Dynamic Co-management of Persis-

Proceedings of the 8th ACM International Conference on Computing Frontiers. CF '11. Ischia, Italy: ACM, 2011, pp. 131. url: http://doi.acm.org/10.1145/ 2016604.2016620. tent RAM Main Memory and Storage Resources. In:

123

[39]

Tolga Kaya and Hur Koser.

A New Batteryless Active RFID System:

RFID Eurasia, 2007 1st Annual. 2007, pp. http://dx.doi.org/10.1109/RFIDEURASIA.2007.4368151. [40]

url:

Smart RFID. in:

14.

Kyung Min Kim, Doo Seok Jeong, and Cheol Seong Hwang.

Nanola-

mentary resistive switching in binary oxide system: a review on the present

Nanotechnology 22.25 (2011), p. http://dx.doi.org/10.1088/0957-4484/22/25/254002.

status and outlook.

[41]

In:

Myungsik Kim, Jinchul Shin, and Youjip Won.

254002.

url:

Selective Segment Ini-

tialization: Exploiting NVRAM to Reduce Device Startup Latency.

Embedded Systems Letters, IEEE [42]

6.2 (2014), pp. 3336.

Young-Jin Kim et al. I/O Performance Optimization Techniques for Hy-

Consumer Elec14691476. url: http:

brid Hard Disk-Based Mobile Consumer Devices.

tronics, IEEE Transactions on 53.4 (2007), pp. //dx.doi.org/10.1109/TCE.2007.4429239. [43]

physica status solidi url: http://dx.doi.org/10.1002/pssb.19

pp. 359372.

Commun. url: http://doi.acm.org/10.1145/1252

Charles Lamb et al. The ObjectStore Database System.

ACM 34.10 23.125244. [45]

(1991), pp. 5063.

In:

Simon Lavington. In the Footsteps of Colossus: A Description of Oedipus.

IEEE Ann. Hist. Comput. 28.2 (2006), //dx.doi.org/10.1109/MAHC.2006.34. In:

[46]

In:

B. T. Kolomiets. Vitreous Semiconductors (I). in:

(b) 7.2 (1964), 640070202. [44]

In:

pp. 4455.

url: http:

Benjamin C. Lee et al. Architecting Phase Change Memory As a Scalable

Proceedings of the 36th Annual International Symposium on Computer Architecture. ISCA '09. Austin, TX, USA: ACM, 2009, pp. 213. url: http://doi.acm.org/10.1145/1555754.1555758. Dram Alternative. In:

[47]

B. Liskov et al. Safe and Ecient Sharing of Persistent Objects in Thor.

Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. SIGMOD '96. Montreal, Quebec, Canada: ACM, 1996, pp. 318329. url: http://doi.acm.org/10.1145/233269.233346. In:

124

[48]

T.

P. Ma and Jin-Ping Han.

Why is nonvolatile ferroelectric memory

Electron Device Letters, IEEE 23.7 url: http://dx.doi.org/10.1109/LED.2002.1015

eld-eect transistor still elusive? In: (2002), pp. 386388.

207. [49]

F. Masuoka et al. A new ash E2PROM cell using triple polysilicon tech-

Electron Devices Meeting, 1984 International. Vol. 30. 1984, 464467. url: http://dx.doi.org/10.1109/IEDM.1984.190752.

nology. In: pp. [50]

Brian Matas and Christian De Suberbasaux.

MEMORY 1997.

Ed. by In-

tegrated Circuit Engineering Corporation. Integrated Circuit Engineering,

url: http://smithsonianchips.si.edu/ice/cd/MEMORY97/titl

1997.

e.pdf. [51]

Trans. Comput. Syst. g/10.1145/989.990. [52]

2.3 (1984), pp. 181197.

Stephan Menzel et al. Switching kinetics of electrochemical metallization

Phys. Chem. Chem. Phys. 15.18 url: http://dx.doi.org/10.1039/C3CP50738F.

memory cells. 6952. [53]

ACM url: http://doi.acm.or

Marshall K. McKusick et al. A Fast File System for UNIX. in:

In:

(2013), pp. 6945

Jerey C. Mogul et al. Operating System Support for NVM+DRAM Hybrid Main Memory. In:

Proceedings of the 12th Conference on Hot Top-

ics in Operating Systems.

HotOS'09.

USENIX Association, 2009, pp. 1414.

Monte Verità, Switzerland:

url: http://dl.acm.org/citat

ion.cfm?id=1855568.1855582. [54]

G. E. Moore. No exponential is forever: but "Forever" can be delayed!

Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC. 2003 IEEE International. 2003, pp. 20 23. url: http://dx.doi.org/10.1109/ISSCC.2003.1234194. [semiconductor industry]. In:

Memory Workshop (IMW), 2013 5th IEEE International. 2013, pp. 2125. url: http://dx.doi.org/10.1109/IMW.2013.6582088.

[55]

O. Mutlu. Memory scaling: A systems architecture perspective. In:

[56]

Whole-system persistence.

Vol. 40. ACM SIGARCH Computer Architec-

ture News 1. 2012, pp. 401410.

url: http://dl.acm.org/citation.cf

125

m?id=2151018. [57]

Donald A. Neamen.

Electronic Circuit Analysis and Design.

Ed. by Mc-

GrawHill. 2nd. McGrawHill, 2000.

Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). Lombard, IL: USENIX, 2013, pp. 385 398. url: https://www.usenix.org/conference/nsdi13/technical-s essions/presentation/nishtala.

[58]

Rajesh Nishtala et al. Scaling Memcache at Facebook.

In:

[59]

S. Oikawa. Virtualizing Storage as Memory for High Performance Storage

Parallel and Distributed Processing with Applications (ISPA), 2014 IEEE International Symposium on. 2014, pp. 1825. url: http: //dx.doi.org/10.1109/ISPA.2014.12. Access. In:

[60]

Shuichi Oikawa. Non-volatile main memory management methods based

SpringerPlus 3.1 (2014), springerplus.com/content/3/1/494.

on a le system. In:

p. 494.

url: http://www.

3D NAND: Benets of Charge Traps over Floating Gates. 2013. url: http://thememoryguy.com/3d-nand-benefits-of-charge-traps -over-floating-gates/.

[61]

[online].

[62]

[online].

[63]

[online].

[64]

[online].

[65]

[online].

[66]

[online].

Amazon ElastiCache. azon.com/elasticache/.

last accessed: 2015.

url: http://aws.am

An idiosyncratic survey of Spintronics. last accessed: 2015. url: https://physics.tamu.edu/calendar/talks/cmseminars/cm_talks/200 7_10_18_Levy_P.pdf. BEE3. last accessed: .com/en-us/projects/bee3/.

2015.

url: http://research.microsoft

Big Data. last accessed: 2015. url: http://lookup.computerl anguage.com/host_app/search?cid=C999999&term=Big%20Data.

Comparing Technologies: MRAM vs. FRAM. last accessed: 2015. url: http://www.everspin.com/PDF/EST02130_Comparing_Technologie s_FRAM_vs_MRAM_AppNote.pdf.

126

DARPA Developing ExtremeScale Supercomputer System. 2010. url: http://www.darpa.mil/WorkArea/DownloadAsset.aspx?id=1795.

[67]

[online].

[68]

[online].

[69]

[online].

[70]

[online].

[71]

[online].

[72]

[online].

[73]

[online].

[74]

[online].

[75]

[online].

[76]

[online].

[77]

[online].

Datacenter Construction Expected To Boom. 2014. url: http:// www.enterprisetech.com/2014/04/17/datacenter-construction-expec ted-boom/. Dell PowerEdge R920 Data Sheet. last accessed: 2015. url: ht tp://i.dell.com/sites/doccontent/shared-content/data-sheets/en /Documents/PowerEdge_R920_Spec-Sheet.pdf. European Exascale Software Initiative [Home Page]. http://www.eesi-project.eu/pages/menu/homepage.php.

2013.

url:

FRAM Structure. last accessed: 2015. url: http://www.fujits u.com/global/products/devices/semiconductor/memory/fram/overvi ew/structure/. Fundamentals of volatile memory technologies. 2011. url: http: //www.electronicproducts.com/Digital_ICs/Memory/Fundamentals_o f_volatile_memory_technologies.aspx.

Further adventures in non-volatile memory. last url: https://www.youtube.com/watch?v=UzsPnw11KX0.

accessed: 2015.

Gartner Identies the Top 10 Strategic Technology Trends for 2015. 2014. url: http://www.gartner.com/newsroom/id/2867917. History of ferroelectrics. last accessed: 2015. url: http://www. ieee-uffc.org/ferroelectrics/learning-e003.asp.

How Does Flash Memory Store Data? last accessed: 2015. url: https://product.tdk.com/info/en/techlibrary/archives/techjourn al/vol01_ssd/contents03.html. HP and SK Hynix Cancel Plans to Commercialize MemristorBased Memory in 2013. 2012. url: http://www.xbitlabs.com/news/st orage/display/20120927125227_HP_and_Hynix_Cancel_Plans_to_Comme rcialize_Memristor_Based_Memory_in_2013.html.

127

[78]

Hybrid Memory Cube Consortium - Home Page. url: http://www.hybridmemorycube.org/.

[online]. 2015.

last accessed:

IBM 350 disk storage unit. last accessed: 2015. url: http://ww w-03.ibm.com/ibm/history/exhibits/storage/storage_350.html.

[79]

[online].

[80]

[online].

[81]

[online].

[82]

[online].

[83]

[online].

[84]

[online].

[85]

[online].

[86]

[online].

[87]

[online].

[88]

[online].

[89]

[online].

IEEE-CS Unveils Top 10 Technology Trends for 2015. 2014. url: http://www.computer.org/web/pressroom/2015-tech-trends. Introduction to the Java Persistence API. last accessed: 2015. url: http://docs.oracle.com/javaee/6/tutorial/doc/bnbpz.html. ITRS 2013 ERD TABLES. last accessed: 2015. url: https://ww w.dropbox.com/sh/2fme4y0avvv7uxs/AAAB10oeC7wNtQkFp5XAcenba/ITR S/2013ITRS/2013ITRS%20Tables_R1/ERD_2013Tables.xlsx?dl=0. ITRS 2013 EXECUTIVE SUMMARY. last accessed: 2015. url: http://www.itrs.net/ITRS%201999-2014%20Mtgs,%20Presentations%20&% 20Links/2013ITRS/2013Chapters/2013ExecutiveSummary.pdf. ITRS ERD 2013 REPORT. last accessed: 2015. url: https://ww w.dropbox.com/sh/6xq737bg6pww9gq/AAAXRzGlUis1sVUxurZnMCY4a/201 3ERD.pdf?dl=0. Log-structured le systems. es/353411/.

2009.

url: http://lwn.net/Articl

Magnetic Core Memory. last accessed: 2015. url: http://www. computerhistory.org/revolution/memory-storage/8/253. Managing ash storage with Linux. 2012. url: http://free-e lectrons.com/blog/managing-flash-storage-with-linux/. Mechanical roadmap points to hard drives over 100TB by 2025. 2014. url: http://techreport.com/news/27420/mechanical-roadmap -points-to-hard-drives-over-100tb-by-2025. More Details on Today's Outage. 2010. url: https://www.face book.com/notes/facebook-engineering/more-details-on-todays-out age/431441338919.

128

Mott transition. last accessed: 2015. url: http://lamp.tu-gra z.ac.at/~hadley/ss2/problems/mott/s.pdf.

[90]

[online].

[91]

[online].

[92]

[online].

[93]

[online].

[94]

[online].

[95]

[online].

[96]

[online].

[97]

[online].

[98]

[online].

[99]

[online].

[100]

[online].

[101]

[online].

NVM Express and the PCI Express SSD Revolution. 2012. url: http://www.nvmexpress.org/wp-content/uploads/2013/04/IDF-2012-N VM-Express-and-the-PCI-Express-SSD-Revolution.pdf. Optimizing Linux with cheap ash drives. //lwn.net/Articles/428584/.

2011.

url: http:

[PATCH v10 11/21] Replace XIP documentation with DAX. accessed: 2015. url: http://lwn.net/Articles/610316/.

last

PCM BECOMES A REALITY. 2009. url: http://www.object ive-analysis.com/uploads/2009-08-03_Objective_Analysis_PCM_Whi te_Paper.pdf. Protected and Persistent RAM Filesystem. url: http://pramfs.sourceforge.net/.


Samsung 850 PRO Specications. last accessed: 2015. url: http: //www.samsung.com/global/business/semiconductor/minisite/SSD/g lobal/html/ssd850pro/specifications.html. Seagate preps for 30TB laser-assisted hard drives. 2014. url: ht tp://www.computerworld.com/article/2846415/seagate-preps-for-3 0tb-laser-assisted-hard-drives.html. Solid Memory by Toshiba. last accessed: 2015. url: http://www. toshiba-memory.com/cms/en/meta/memory_division/about_us.html. Supporting lesystems in persistent memory. url: http://lwn.net/Articles/610174/.


The Discovery of Giant Magnetoresistance. 2007. url: http:// www.nobelprize.org/nobel_prizes/physics/laureates/2007/advance d-physicsprize2007.pdf.

The High-k Solution. 2007. url: http://spectrum.ieee.org/ semiconductors/design/the-highk-solution.

129

The Inconvenient Truths of NAND Flash Memory. 2007. url: ht tps://www.micron.com/~/media/documents/products/presentation/fl ash_mem_summit_jcooke_inconvenient_truths_nand.pdf.

[102]

[online].

[103]

[online].

[104]

[online].

[105]

[online].

[106]

[online].

[107]

[online].

[108]

[online].

[109]

[online].

[110]

[online].

[111]

[online].

[112]

John Ousterhout et al. The Case for RAMCloud.

The Machine: A new kind of computer. last accessed: 2015. url: http://www.hpl.hp.com/research/systems-research/themachine/. The Transition to PCI Express for Client SSDs. 2012. url: http: //www.flashmemorysummit.com/English/Collaterals/Proceedings/20 12/20120821_S102C_Huffman.pdf. Ultrastar He8. last accessed: 2015. url: http://www.hgst.com /hard-drives/enterprise-hard-drives/enterprise-sas-drives/ultr astar-he8. Understanding JPA. 2008. url: http://www.javaworld.com/ar ticle/2077817/java-se/understanding-jpa-part-1-the-object-orien ted-paradigm-of-data-persistence.html?null. Understanding Moore's Law: Four Decades of Innovation. 2006. url: http://www.chemheritage.org/community/store/books-and-cat alogs/understanding-moores-law.aspx. Ushering in the 3D Memory Era with V- NAND. 2013. url: http: //www.flashmemorysummit.com/English/Collaterals/Proceedings/20 13/20130813_KeynoteB_Elliot_Jung.pdf. WD Black - Mobile Hard Drives. last accessed: 2015. url: http: //www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771435.pdf. Whole System Persistence Computer based on NVDIMM. url: https://www.youtube.com/watch?v=gFuXn2QHXWo.

2014.

ZFS THE LAST WORD IN FILE SYSTEMS. 2008. url: http: //lib.stanford.edu/files/pasig-spring08/RaymondClark_ZFS_Overi view.pdf. 54.7 (2011), pp. 121130.

1965751.

Commun. ACM url: http://doi.acm.org/10.1145/1965724. In:

130

[113]

Stanford R. Ovshinsky.

Reversible Electrical Switching Phenomena in

Phys. Rev. Lett. 21.20 (1968), pp. 14501453. url: http://link.aps.org/doi/10.1103/PhysRevLett.21.1450. Disordered Structures. In:

[114]

Stuart S.

P. Parkin, Masamitsu Hayashi, and Luc Thomas.

Magnetic

Science (New York, N.Y.) 320.5873 url: http://www.ncbi.nlm.nih.gov/pubmed/?ter

domain-wall racetrack memory. In: (2008), pp. 190194.

m=18403702. [115]

Commun. ACM 47.10 url: http://doi.acm.org/10.1145/1022594.102259

David A. Patterson. Latency Lags Bandwith. In: (2004), pp. 7175.

6. [116]

Simon Peter et al. Arrakis: The Operating System is the Control Plane.

11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). Broomeld, CO: USENIX Association, 2014, pp. 1 16. url: https://www.usenix.org/conference/osdi14/technical-s essions/presentation/peter.

In:

[117]

P.

A.

H. Peterson.

Cryptkeeper:

Improving security with encrypted

Technologies for Homeland Security (HST), 2010 IEEE International Conference on. 2010, pp. 120126. url: http://dx.doi.org/1 0.1109/THS.2010.5655081. RAM. in:

[118]

Moinuddin K. Qureshi et al. Enhancing Lifetime and Security of PCM-

Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 42. New York, New York: ACM, 2009, pp. 1423. url: http: //doi.acm.org/10.1145/1669112.1669117. based Main Memory with Start-gap Wear Leveling. In:

Journal of Magnetism and Magnetic Materials 320.7 (2008), pp. 11901216. url: http: //www.sciencedirect.com/science/article/pii/S0304885307010116.

[119]

D. C. Ralph and M. D. Stiles. Spin transfer torques. In:

[120]

Simone Raoux, Welnic Wojciech, and Daniele Ielmini.

Phase Change

Materials and Their Application to Nonvolatile Memories.

cal Reviews 110.1 (2010). PMID: 19715293, //dx.doi.org/10.1021/cr900040x.

pp. 240267.

Chemiurl: http: In:

131

[121]

Ohad Rodeh, Josef Bacik, and Chris Mason. BTRFS: The Linux B-Tree

Trans. Storage 9.3 m.org/10.1145/2501620.2501623.

Filesystem. In:

[122]

(2013), pp. 91.

url: http://doi.ac

Mendel Rosenblum and John K. Ousterhout. The Design and Implementa-

ACM Trans. Comput. Syst. 10.1 url: http://doi.acm.org/10.1145/146941.146943.

tion of a Log-structured File System. In: (1992), pp. 2652. [123]

Log-structured Memory for DRAM-based Storage.

Proceedings of the 12th

USENIX Conference on File and Storage Technologies. Santa Clara, CA:

url: https://www.usenix.org/conference/ fast14/technical-sessions/presentation/rumble. USENIX, 2014, pp. 116.

[124]

K. Sakui. Professor Fujio Masuoka's Passion and Patience Toward Flash

Solid-State Circuits Magazine, IEEE 5.4 (2013), url: http://dx.doi.org/10.1109/MSSC.2013.2278084.

Memory. In:

[125]

Rohit Soni et al. Giant electrode eect on tunnelling electroresistance in

Nat Commun url: http://dx.doi.org/10.1038/ncomms6414. ferroelectric tunnel junctions. In:

[126]

pp. 3033.

5.. (2014). Article, p. .

D. B. Strukov and H. Kohlstedt. Resistive switching phenomena in thin

MRS Bulletin 37.02 (2012), url: http://journals.cambridge.org/article_S088376

lms: Materials, devices, and applications. In: pp. 108114.

9412000024. [127]

(2008), pp. 8083. [128]

Nature 453.7191 url: http://dx.doi.org/10.1038/nature06932.

Dmitri B. Strukov et al. The missing memristor found. In:

Ryan Stutsman and John Ousterhout. Toward Common Patterns for Dis-

Presented as part of the 14th Workshop on Hot Topics in Operating Systems. Santa Ana Pueblo, NM: USENIX, 2013. url: https://www.usenix.org/toward-common-p atterns-distributed-concurrent-fault-tolerant-code. tributed, Concurrent, Fault-Tolerant Code.

[129]

In:

S. Swanson and A. M. Cauleld. Refactor, Reduce, Recycle: Restructur-

Computer 46.8 url: http://dx.doi.org/10.1109/MC.2013.222.

ing the I/O Stack for the Future of Storage. In: pp. 5259.

(2013),

132

[130]

Andrew S. Tanenbaum and Albert S. Woodhull.

sign and Implementation.

Operating Systems - De-

Ed. by Pearson International. 3rd. International,

Pearson, 2009. [131]

Junji Tominaga et al. Large Optical Transitions in Rewritable Digital Ver-

Symposium G Phase-Change Materials for Recongurable Electronics and Memory Applications. Vol. 1072. MRS Proceedings. 2008. url: http: //journals.cambridge.org/article_S1946427400030414. satile Discs:

[132]

An Interlayer Atomic Zipper in a SbTe Alloy.

In:

Evgeny Y. Tsymbal and Hermann Kohlstedt. Tunneling Across a Ferro-

Science 313.5784 (2006), pp. 181183. url: http://www.sc iencemag.org/content/313/5784/181.short.

electric. In:

[133]

Julian Turner. Eects of Data Center Vibration on Compute System Per-

Proceedings of the First USENIX Conference on Sustainable Information Technology. SustainIT'10. San Jose, CA: USENIX Association, 2010, pp. 55. url: http://dl.acm.org/citation.cfm?id=186315 9.1863164. formance. In:

[134]

P. Vettiger et al. The "millipede" - nanotechnology entering data storage.

Nanotechnology, IEEE Transactions on 1.1 (2002), http://dx.doi.org/10.1109/TNANO.2002.1005425. In:

[135]

Mnemosyne: Lightweight persistent memory.

Vol. 39.

Computer Architecture News 1. 2011, pp. 91104.

pp. 3955.

url:

ACM SIGARCH

url: http://dl.acm

.org/citation.cfm?id=1950379. [136]

Yiqun Wang et al. A 3us wake-up time nonvolatile processor based on ferro-

ESSCIRC. IEEE, 2012, pp. 149152. url: http:// ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6331297. electric ip-ops. In:

[137]

Rainer Waser et al. Redox-Based Resistive Switching Memories Nanoionic

Advanced Materials 21.25-26 url: http://dx.doi.org/10.1002/adma.200900

Mechanisms, Prospects, and Challenges. In: (2009), pp. 26322663.

375. [138]

H. A. R. Wegener et al. The variable threshold transistor, a new electricallyalterable, non-destructive read-only storage device. In:

Electron Devices

133

Meeting, 1967 International. Vol. 13. 1967, //dx.doi.org/10.1109/IEDM.1967.187833. [139]

S.

A. Wolf et al. Spintronics:

pp. 7070.

url: http:

A Spin-Based Electronics Vision for the

Science 294.5546 (2001), pp. 14881495. url: http://www. sciencemag.org/content/294/5546/1488.abstract. Future. In:

[140]

C. David Wright, Peiman Hosseini, and Jorge A. Vazquez Diosdado. Beyond von-Neumann Computing with Nanoscale Phase-Change Memory De-

Advanced Functional Materials 23.18 (2013), url: http://dx.doi.org/10.1002/adfm.201202383. vices.

[141]

In:

pp. 22482254.

Michael Wu and Willy Zwaenepoel. eNVy: A Non-volatile, Main Memory

Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS VI. San Jose, California, USA: ACM, 1994, pp. 8697. url: http://doi.acm.org/10.1145/195473.195506. Storage System. In:

[142]

Xiaojian Wu, Sheng Qiu, and A. L. Narasimha Reddy. SCMFS: A File

Trans. Storage url: http://doi.acm.org/10.1145/2501620.2501

System for Storage Class Memory and Its Extensions. In: 9.3 (2013), pp. 71.

621. [143]

Wm A. Wulf and Sally A. McKee. Hitting the Memory Wall: Implications

SIGARCH Comput. Archit. News 23.1 (1995), pp. 20 url: http://doi.acm.org/10.1145/216585.216588.

of the Obvious. In: 24. [144]

Yuan Xie. Modeling, Architecture, and Applications for Emerging Memory Technologies. In:

Design Test of Computers, IEEE

28.1 (2011), pp. 44

51. [145]

J. Joshua Yang et al. Metal oxide memories based on thermochemical and

MRS Bulletin 37.02 (2012), pp. 131137. url: http://journals.cambridge.org/article_S0883769411003563.

valence change mechanisms. In:

[146]

Jisoo Yang, Dave B. Minturn, and Frank Hady. When Poll is Better Than

Proceedings of the 10th USENIX Conference on File and Storage Technologies. FAST'12. San Jose, CA: USENIX Association, 2012, pp. 33. url: http://dl.acm.org/citation.cfm?id=2208461.2208464. Interrupt.

In:

134

[147]

Yiying Zhang et al. Mojim: A Reliable and Highly-Available Non-Volatile Memory System. In:

ASPLOS '15, March 1418, 2015, Istanbul, Turkey.

2015. [148]

Ping Zhou et al. A Durable and Energy Ecient Main Memory Using

Proceedings of the 36th Annual International Symposium on Computer Architecture. ISCA '09. Austin, TX, USA: ACM, 2009, pp. 1423. url: http://doi.acm.org/10.1145/ 1555754.1555759. Phase Change Memory Technology.

[149]

M.

In:

Ye. Zhuravlev et al. Giant Electroresistance in Ferroelectric Tunnel

Phys. Rev. Lett. 94 (24 2005), p. 246802. url: http: //link.aps.org/doi/10.1103/PhysRevLett.94.246802. Junctions.

In:

Acknowledgments The idea about this work has been conceived thanks to Professor De Paoli, who talked to me about these new memories, which are the subject of this work: I would like to thank him for both the idea and the trust that he reserved me. Professor Mariani was asked to help me as advisor of this nal work: I would like to thank him for his advices and its helpfulness, which have been valuable. Great many thanks to all the people that supported me during this tough time.

Persistent-memory awareness inoperating systems

Persistent-memory awareness inoperating systems

Suggest Documents

Awareness Systems - CiteSeerX

Awareness Requirements for Adaptive Systems

context awareness in distributed computing systems - Microsoft

Affect-awareness Framework for Intelligent Tutoring Systems

Context-Awareness for Service Oriented Systems

Fail-Awareness in Timed Asynchronous Systems

Developing Systems for Cyber Situational Awareness - Robobrarian

Incorporating Virtualization Awareness in Service Monitoring Systems

Context-Awareness for Service Oriented Systems - arXiv

Context Awareness in Distributed Computing Systems - Microsoft ...

Reducing interruptions in Pervasive Awareness Systems

Network Awareness and Mobile Agent Systems - CiteSeerX

Contextual-Analysis for Infrastructure Awareness Systems

Affect-awareness Framework for Intelligent Tutoring Systems

AWARENESS

Bounded Awareness Bounded Awareness - CiteSeerX

On the role of awareness systems for supporting parent ... - CiteSeerX

Systems of Privilege: Intersections, Awareness ... - Wiley Online Library

The Challenges in Preserving Privacy in Awareness Systems

Awareness in Context-Aware Information Systems - Semantic Scholar

Context-awareness in Mobile Tourist Information Systems: Challenges ...

dynamic focusing of awareness in fuzzy control systems - AIRCC Online

From Awareness Requirements to Adaptive Systems: a ... - CiteSeerX

Creating User Awareness of Application Permissions in Mobile Systems