A NOR Emulation Strategy over NAND Flash Memory Jian-Hong Lin†
Yuan-Hao Chang†
Jen-Wei Hsieh§
Tei-Wei Kuo†
Cheng-Chih Yang‡∗
†
Graduate Institute of Networking and Multimedia Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan 106, R.O.C. {r94944003, d93944006, ktw}@csie.ntu.edu.tw §
‡
Department of Computer Science and Information Engineering National Chiayi University,Chiayi, Taiwan 60004, R.O.C.
[email protected]
Product Development Firmware Engineering Gruop Genesys Logic, Inc. Taipei, Taiwan 231, R.O.C
[email protected]
Abstract This work is motivated by a strong market demand in the replacement of NOR flash memory with NAND flash memory to cut down the cost in many embedded-system designs, such as mobile phones. Different from LRU-related caching or buffering studies, we are interested in prediction-based prefetching based on given execution traces of application executions. An implementation strategy is proposed in the storage of the prefetching information with limited SRAM and run-time overheads. An efficient prediction procedure is presented based on information extracted from application executions to reduce the performance gap between NAND flash memory and NOR flash memory in reads. With the behavior of a target application extracted from a set of collected traces, we show that data access to NOR flash memory can be responded effectively over the proposed implementation.
Keywords: NAND, NOR, flash memory, data caching
1 INTRODUCTION While NAND flash memory (referred to as NAND for short) has become a popular alternative in the implementation of storage systems, NOR flash memory (referred to as NOR for short) is widely adopted in embedded system designs to store and run programs. Compared to NAND, NOR has good performance in reads and supports XIP (eXecute-In-Place) to run programs directly. Since NAND is much inexpensive, and NAND is better in writes, NAND is popularly adopted in storage system implementations [19, 20, 24, 27]. Because of the cost difference between NAND and NOR, it becomes increasingly attractive in replacing NOR with NAND in embedded system designs, such as that in mobile phones. The demand keeps increasing significantly because the difference might become even widely in the near future (where the cost of 8Gb NOR is 5 times more than that of 8Gb NAND) [12]. Such an observation underlines the motivation of this research. ∗ Supported
by the National Science Council of Taiwan, R.O.C., under Grant NSC-95R0062-AE00-07 and NSC-95-2221-E-002-094-MY3
The management of flash memory is carried out by either software on a host system (as a raw medium) or hardware circuits/firmware inside its embedded devices (as block-oriented devices). In the past decade, there have been excellent research and implementation designs being proposed in the management of flash-memory storage systems, e.g., [2, 3, 5, 6, 8, 9, 14, 28, 30, 31]. In particular, some researchers exploited efficient management schemes for large-scale storage systems and/or consider different system architecture designs, e,g., [8, 9, 14, 28, 30, 31]. In the industry, several vendors, such as Intel and Microsoft, also start exploring the advantages in having flash memory in their product designs, e.g., the flash-memory cache of hard disks (known as the Robson solution) and the fast booting in Windows Vista [1, 4, 7, 29]. Besides, flash memory also becomes a layer in the traditional memory hierarchy, such as that with NAND in a demand paging mechanism (with compiler assistance): The source code of applications and specific compilers are used to lay out compiled code in fixed memory locations in such studies [17, 18]. Among the approaches that try to improve the performance of NAND with a SRAM cache [13, 15, 16, 22, 23], OneNAND by Samsung presented a simple but effective hardware architecture to replace NOR with NAND and a SRAM cache [13, 22, 23]. Although the idea is intuitive and useful, little past work is reported in how to manage the system performance of NAND with a SRAM cache, and the resources, such as source code and specific compilers, involved in some researches [17, 18] are not always available in the development. We must point out that the success in the replacement of NOR with NAND is seriously dependent on an intelligent way in the management of the SRAM cache in the product domains. Different from popular caching ideas adopted in the memory hierarchy and OneNAND-related work [13, 15, 16, 22, 23], we are interested in application-oriented caching. Instead of the adopting of an LRU-like policy, we are interested in prediction-based prefetching based on given execution traces of applications. We consider the designs of embedded systems with a limited set of applications, such as a set of selected system programs in mobile phones or arcade games of amusement-park machines. In this paper, we propose an efficient prediction mechanism with limited SRAM-space requirements and an efficient implementation. The idea of prediction graphs is presented based on the working-set concept [10, 11], and an implementation strategy is proposed to reduce run-time and space overheads. A prefetch procedure is then proposed to prefetch pages from NAND based on the trace analy-
sis of application executions. A series of experiments is conducted based on realistic traces, that are based on computer games with different characteristics: “Age of Empire II (AOE II)”, “The Typing of the Death (TTD)”, and “Raiden”. The experimental results are very encouraging: We show that the average read performance of NAND with the proposed prediction mechanism could be even better than those of NOR for 24%, 216%, and 298% in AOE II, TTD, and Raiden, respectively. Furthermore, the cache miss rates were 35.27%, 4.21%, and 0.06% for AOE II, TTD, and Raiden, respectively. The rest of this paper is organized as follows: Section 2 describes the characteristics of flash memory and research motivation. In Section 3, an efficient prediction mechanism is proposed. Section 4 summarizes the experimental results on read performance, cache miss rate, and extra overheads. Section 5 is the conclusion.
Data Bus Address Bus
Host Interface Converter
byte
Cache (SRAM) Prefetch Procedure Control Logic
512 bytes
NAND Flash Memory
Figure 1. An Architecture for the Performance Improvement of NAND Flash Memory.
2 Flash-Memory Characteristics and Research Motivation There are two types of flash memory: NAND and NOR. Each NAND flash memory chip consists of many blocks, and each block is of a fixed number of pages. A block is the smallest unit for erase operations, while reads and writes are done in pages. A page contains a user area and a spare area, where the user area is for the data storage of a logical block, and the spare area stores ECC and other house-keeping information (i.e., LBA). Because flash memory is write-once, we do not overwrite data on each update. Instead, data are written to free space, and the old versions of data are invalidated (or considered as dead). The update strategy is called “outplace update”. In other words, any existing data on flash memory could not be over-written (updated) unless its corresponding block is erased. The pages that store live data and dead data are called “valid pages” and “invalid pages”, respectively. Depending on the designs, blocks have different bounds on the number of erases over a block. For example, the typical erase bounds of SLC and MLC×2 NAND flash memory are 10,000 and 1,000, respectively1 . Each page of small-block(/large-block) SLC NAND can store 512B(/2KB) data, and there are 32(/64) pages per block. The spare area of a small-block(/large-block) SLC NAND page is 16B(/64B). On the other hand, each page of MLC×2 NAND can store 2KB, and there are 128 pages per block. Different from NAND flash memory, a byte is the unit for reads and writes over NOR flash memory. SLC NOR [25] Price (US$/GB) [12] Read (random access of 8bits) Write (random access of 8bits) Read (sequential access) Write (sequential access) Erase
34.65 40ns 14µs 23.842MB/s 0.068MB/s 0.217MB/s
SLC NAND [21] (large-block, 2KB-page) 6.79 25µs 300µs 15.33MB/s 4.57MB/s 6.25MB/s
Table 1. The Typical Characteristics of NOR and NAND. NAND has been widely adopted in the implementation of storage systems because of its advantages in cost and write throughput (for block-oriented access), compared to NOR. The typical cost of 1GB NOR costs US$34.65 in the market, compared to US$6.79 per GB for NAND, and the price gap of NAND and NOR will get even wider in the coming future. However, due to the high performance of NOR in reads, as shown in Table 1, and its eXecute-InPlace (XIP) characteristics, NOR is adopted in various embedded1 There are two major NAND flash memory designs: SLC (Single Level Cell) flash memory and MLC (Multiple Level Cell) flash memory. Each cell of SLC flash memory contains one-bit information while each cell of MLC×n flash memory contains n-bit information.
system designs, such as mobile phones and Personal Multimedia Players (PMP). The characteristics of NAND and NOR are summarized in Table 1. This research is motivated by a strong market demand in the replacement of NOR with NAND in many embedded-system designs. In order to fill up the performance gap between NAND and NOR, SRAM is a nature choice for data caching in performance improvement, such as that in the simple but effective hardware architecture adopted by OneNAND [13, 22, 23]. (Please see Figure 1.) However, the most critical technical problem behind the success in the replacement of NOR with NAND is on the prediction scheme and its implementation design. Such an observation underlines the objective of this research. That is the design and implementation of an effective prediction mechanism for applications, with the considerations of flash-memory characteristics. Because of stringent resource supports over embedded systems, the proposed mechanism must also face challenges in restricted SRAM usage and limited computing power.
3 An Efficient Prediction Mechanism 3.1
Overview
In order to fill up the performance gap between NAND and NOR, SRAM can serve as a cache layer for data access over NAND. As shown in Figure 1, the Host Interface is responsible to the communication with the host system via address and data buses. The Control Logic manages the caching activity and provides the service emulation of NOR with NAND and SRAM. The Control Logic should have an intelligent prediction mechanism implemented to improve the system performance. Different from popular caching ideas adopted in the memory hierarchy, this research aims at an application-oriented caching mechanism. Instead of the adopting of an LRU-like policy, we are interested in prediction-based prefetching based on given execution traces of applications. We consider the designs of embedded systems with a limited set of applications, such as a set of selected system programs in mobile phones or arcade games of amusement-park machines. The design and implementation should also consider the resource constraints of a controller in the SRAM capacity and computing. There are two major components in the Control Logic: The Converter emulates NOR access over NAND with a SRAM cache, where address translation must be done from byte addressing (for NOR) to Logical Block Address (LBA) addressing (for NAND). Note that each 512B/2KB NAND page corresponds to one and four LBA’s, respectively [26]. The Prefetch Procedure tries to prefetch data from NAND to SRAM so that the hit rate of the NOR access is high over SRAM. The procedure should parse and extract the behavior of the target application via a set of collected traces. According to the extracted access patterns from the col-
lected traces, the procedure generates prediction information, referred to as a prediction graph. In Section 3.2, we shall define a prediction graph and present its implementation strategy over NAND. An algorithm design for the Prefetch Procedure will be then presented in Section 3.3.
3.2
A Prediction Graph and Implementation
8 0 6 1
7
2
9 3
5 11
4
10
the spare area of the corresponding page. It is because the spare area of a page in current implementations and the specification has unused space, and the reading of a page usually comes with the reading of its data and spare areas simultaneously. In such a way, the accessing of the subsequent LBA information of a regular node comes with no extra cost. Since a branch node has more than one subsequent LBA’s, the spare area of the corresponding page might not have enough free space to store the information. We propose to maintain a branch table to save the subsequent LBA information of all branch nodes. The starting entry address of the branch table that corresponds to a branch node can be saved at the spare area of the corresponding page, as shown in Figure 3(a). The starting entry records the number of subsequent LBA’s of the branch node, and the subsequent LBA’s are stored in the entries following the starting entry (Please see Figure 3(b)). The branch table can be saved on flash memory. During the run time, the entire table can be loaded into SRAM for better performance. If there is not enough SRAM space, parts of the table can be loaded in an on-demand fashion.
3.3
A Prefetch Procedure
12 13
...
Figure 2. An example of a prediction graph The access pattern of an application execution over NOR (or NAND) consists of a sequence of LBA’s, where some LBA’s are for instructions, and the others are for data. As an application runs for multiple times, the “virtually” complete picture of the possible access pattern of an application execution might appear, as shown in Figure 2. Since most application executions are input-dependent or data-driven, there can be more than one subsequent LBA’s following a given LBA, where each LBA corresponds to one node in the graph. A node with more than one subsequent LBA’s is called a branch node (such as the shaded nodes in Figure 2), and the other nodes are called regular nodes. The graph that corresponds to the access patterns is referred to as the prediction graph of the patterns. If pages in NAND could be pre-fetched in an ontime fashion, and there is enough SRAM space for caching, then all data accesses can be done over SRAM.
… data
data
data
(regular node) (regular node)
branch table
(branch node)
(a) Prediction Information
… 3 addr(b1) addr(b2) addr(b3)
… (b) A Branch Table
Figure 3. The Storage of a Prediction Graph The technical problems are how to save the prediction graph over flash memory with overheads minimized and how to prefetch pages based on the graph in a simple but effective way. We propose to save the subsequent LBA information of each regular node at
1
2
3
current
4
5
1
6
...
next
Figure 4. A Snapshot of the Cache The objective of the prefetch procedure is to prefetch data from NAND based on a given prediction graph such that most data accesses occur over SRAM. The basic idea is to prefetch data by following the LBA order in the graph. In order to efficiently look up a selected page in the cache, we propose to adopt a cyclic buffer in the cache management, and let two indices current and next denote the pages currently accessed and prefetched, respectively. When current = next, the caching buffer is empty. When current = (next + 1) mod SIZE, the buffer is full, where SIZE is the number of buffers for page caching. Consider the prediction graph shown in Figure 2. The page that corresponds to Node 2 is currently accessed, and the page that corresponds to Node 6 is just prefetched (Please see Figure 4). The prefetch procedure is done in a greedy way: Let P1 be the last prefetched page. If P1 corresponds to a regular node, then the page that corresponds to the subsequent LBA is prefetched. If P1 corresponds to a branch node, then the procedure should prefetch pages by following all possible next LBA links in an equal base and an alternative way. That is, the prefetch procedure can follow each LBA link in an alternative way. For example, pages corresponding to Nodes 4 and 5 are prefetched after the page that corresponds to Node 3 is prefetched, as shown in Figure 4. The next pages to be prefetched are the pages corresponding to Nodes 1 and 6. In order to properly manage the preferching cost, the prefetch procedure stops following an LBA link when next reaches a branch node again along a link, or when next and current might point to the same page (both referred to as Stop Conditions). When the caching buffer is full (also referred to as a Stop Condition), the prefetch procedure should also stop temporarily. Take the prediction graph shown in Figure 2 as an example. The prefetch procedure should not prefetch pages corresponding to Nodes 8 and 9 when the page corresponding to Node 7 is prefetched. When current reaches a page that corresponds to a branch node, the next page to be accessed (referred to as the target page) will determine which branch the application execution will follow. The prefetch procedure should start prefecting the page that corresponds to the subsequent LBA of the target page (or the pages that correspond to the subsequent LBA’s of the target page if the target page corresponds to a branch node). The above prefetch procedure shall repeat again if it stops tentatively because of any Stop Condition. Note that all pages cached in the SRAM cache between current and next stay in the cache after the target page (in the following of a branch) is accessed. It is because some of the
cached pages might be accessed shortly, even though the access of the target page has determined which branch the application execution will follow. Note that cache misses are still possible, e.g., those when current = next. In such a case, data are accessed from NAND and loaded onto the SRAM cache in an on-demand fashion. The pseudo code of the prefetch procedure is as shown in Algorithm 1: Two flags, i.e., stop and startbch , are used to track the prefetching state: stop and startbch denote the satisfaction of any Stop Condition and the reaching of a branch node, respectively. Initially, stop and startbch are set as F ALSE. If any Stop Condition is satisfied when the procedure is invoked, then the procedure simply returns (Step 1). The procedure prefetches one page in each iteration (Steps 2-19) until the cache is full (i.e., a Stop Condition), or we reach a branch node for the first time. First, next is checked if it will point to the same page as current does, then the prefetch procedure stops and returns (Steps 3-6) Otherwise, in each iteration, the procedure increases next, i.e., the location of the next free cache buffer (Step 7). The LBA is obtained by checking up the latest prefetched LBA (Step 8), and then the page of the LBA is prefetched (Step 9). After the prefetching of a page, the procedure checks up whether the prefetched page corresponds to a branch node (Steps 10-11). If so, the procedure loads the corresponding branch table entries (Step 12) and save the subsequent LBA of each branch of the branch node (Steps 12-17). Because the prefetched page corresponds to a branch node, the procedure should start prefetching pages by following each branch in an alternative way (Steps 20-36). The loop will stop when the cache is full (Step 20), when every next LBA link of a branch node reaches the next branch node (Steps 31-35), or when next and current might point to the same page (Steps 22-25). In each iteration of the loop, if the LBA link indexed by idxbch does not yet reach the next branch node (Step 21), the next LBA following the link shall be prefetched (Steps 26-28). Pages are prefetched by following all possible next LBA links in an equal base and an alternative way (Step 30). Note that stop should be set to F ALSE when the cache is no longer full or when next and current do not point to the same page2 . Moreover, stop and startbch should both be reset to F ALSE when current passes a branch node and meets the target page or when a cache miss occurs (i.e., current = next). Once stop is set to FALSE, the prefetch procedure is invoked. When startbch is FALSE in such an invocation, the prefetch procedure starts prefetching from the first loop between Steps 2 and 19. Otherwise, the prefetch procedure will continue its previous prefetching job by following next LBA links of a visited branch node in an alternative way (Steps 20-36).
4 PERFORMANCE EVALUATION 4.1
Performance Metrics and Experiment Setup
The purpose of this section is to evaluate the capability of the proposed prefetch procedure and implemenation in terms of read performance (Section 4.2) and prefetching overhead (Section 4.3). The read performance were evaluated against the number of game traces considered for the creation of a prediction graph. The prefetching overhead was evaluated based on the percentage of redundant data that were prefetched unnecessarily. The performance of the proposed prediction mechanism was evaluated over a trace-driven simulation. The experimental traces were collected over a mobile PC in the unit of a sector (512B), and the unit was consistent with the unit in data prefetching. Since NOR is mainly used to store programs, we conduct serial experiments by running some benchmark applications, such as game softwares. Three games with different characteristics were considered in the experiments, and their execution traces were collected: Age of Empires II (referred to as AOE II), The Typing 2 Performance enhancement
condition setting and actions.
is possible by deploying more complicated
Algorithm 1: Prefetch Procedure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Input: stop, next, current, lba, idxbch , Nbch , lbabch [], and startbch Output: null if stop = T U RE then return ; while startbch = F ALSE and (next + 1) mod SIZE 6= current do if ChkN xLBA(lba) = cache(current) then stop ← T RU E ; return ; end next ← (next + 1) mod SIZE ; lba ← GetNxLBA(lba) ; Read(next, lba) ; startbch ← IsBchStart() ; if startbch = T RU E then LdBchTable(GetNxLBA(lba)) ; idxbch ← 0 ; Nbch ← GetBchNum() ; for i = 0; i < Nbch ; i = i + 1 do lbabch [i] ← GetBchLBA(i) ; end end end while startbch = T RU E and (next + 1) mod SIZE 6= current do if IsBchCplt(idxbch ) = F ALSE then if ChkN xLBA(lbabch [idxbch ]) = cache(current) then stop ← T RU E ; return ; end next ← (next + 1) mod SIZE ; lbabch [idxbch ] ←GetNxLBA(lbabch [idxbch ]); Read(next, lbabch [idxbch ]) ; end idxbch ← (idxbch + 1) mod Nbch ; if IsBchStop() = T RU E then stop ← T RU E ; startbch ← F ALSE ; return; end end
of the Death (refereed to as TTD), and Raiden. Each game was played for ten times in the trace collection. The characteristics of the games are summarized in Table 2. AOE II is a real-time strategy game, where all players conduct their game actions simultaneously. Compared to conventional turn-based strategy games, real-time strategy games progress in real time rather than turn-byturn. In general, the access pattern of each execution is diversified and hard to predict. TTD is a game for English typing. A player can pick up any stage to play. Once a player clears a stage, some animation of that stage will be displayed. Compared to AOE II, the program size over NOR is large, but its executions are more predictable. Raiden is a 3D vertical-shooter game, in which players should clear stages one by one, and each enemy appears at some specific time and place. Once a stage is cleared, the data of the next stage are completely loaded to run. The execution of this game has good predictability, but data are loaded in burst. We considered large-block NAND and NOR in the experiments, where there were 64 pages per block and 2KB per page for large-block NAND. The response time in per-page read over NAND was 100 µs, and the response time in per-byte read over NOR was 40ns. The set-up time of NAND to read data from a page was 25 µs, where the set-up time was for transferring data from the page cells to the internal page buffer. There was no setup time in NOR. In the experiments, SRAM was used to store the branch table and serve as the cache space. The response time in per-bye read was set as 10ns. We assume that the branch table was
Size Average number of branches Burst in reads Temporal locality in data access Randomness in data access Branch table size
AOE II
TTD
Raiden
small (438 MB) high low low high large (35.14 KB)
large (812 MB) medium medium low medium large (39.83 KB)
small (467 MB) low high low low small (0.43 KB)
Table 2. The characteristics of games under investigation
because the access pattern of AOE II was highly random. The increasing in the number of collected traces for the prediction graph could not reduce the cache miss rate significantly. For TTD, good improvement was observed with the inclusion of two more traces. It was because the last two traces were, in fact, collected during the advance of players in the game by clearing more stages. Furthermore, we summarize the read performance of the proposed scheme and other existing products in Table 3. It shows that the read performance of some specific applications with regular access patterns is even better than that of OneNAND. On the other hand, in the worst case, i.e., 100% miss rate, the desired data has to be read from NAND flash memory on each read request. Thus, it is impractical to use NAND to replace NOR without any prediction mechanism because the read performance gap between the emulated NOR and NOR is too large. ˔ˢ˘ʳ˜˜
originally stored over NAND. The table was loaded into SRAM in an on-demand fashion so that the branch table could always fit in SRAM. In the experiments, the proposed prediction mechanism was evaluated against the number of traces. We must point out that the larger the number of traces were considered, the larger the average number of branches per branch node. It was because more branches were observed when more traces were analyzed (even though the average number of branches per branch node might saturate when enough representative traces were analyzed). How the average number of branches per branch node grew with the given set of traces also depended on the characteristics of games under considerations. As shown in Figure 5, the average number of branches per branch node was less than four, and the average number of branches per branch node grew slowly, except AOE II. It was because data accesses of AOE II were more randomized, compared to others. ˔ˢ˘ʳ˜˜
˧˧˗
˄˃˃
̅˸˴˷ʳ̃˸̅˹̂̅̀˴́˶˸ʳʻˠ˕˂̆ʼ
ˌ˃
˧˧˗
ˌˇˁˌˌ
˥˴˼˷˸́
ˌˇˁˌˌ
˦˟˖ʳˡˢ˥
ˌˇˁˌˌ
ˌˇˁˌˌ
ˋ˄ˁ˃ˉ ˊˈˁ˅ˇ
ˋ˃ ˊ˃ ˉ˃ ˈ˃
ˈˉˁˋ˅
˅ˌˁ˅ˉ
˅ˌˁˆˉ
ˇ˃ˁ˄˅
ˇ˃ ˆ˃
ˈˉˁˈ˅ ˇˈˁ˃ˉ ˅ˊˁ˄ˈ
˅ˋˁˇˇ
˅ˌˁˈˊ ˅ˆˁˋˇ
˅˃ ˄˃ ˃ ˅
ˇ
ˉ ́̈̀˵˸̅ʳ̂˹ʳ̇̅˴˶˸̆
ˋ
˄˃
Figure 6. The read performance with different numbers of traces (4KB cache)
˥˴˼˷˸́
ˆˁˇ
˴̉˸̅˴˺˸ʳ˵̅˴́˶˻ʳ́̈̀˵˸̅
ˆˁ˅ ˆ
Read (MB/s)
˅ˁˋ
AOE II
TTD
Raiden
Worst case
NOR
OneNAND
94.44
75.24
29.57
8.76
23.84
68
˅ˁˉ ˅ˁˇ
Table 3. Comparison of the read performance (10 traces and 4KB cache in our approach)
˅ˁ˅ ˅ ˄
˅
ˆ
ˇ ˈ ˉ ˊ ́̈̀˵˸̅ʳ̂˹ʳ̇̅˴˶˸̆
ˋ
ˌ
˄˃
4.3 Figure 5. Increment of average branch number
4.2
Read Performance
Figure 6 shows the read performance of the proposed approach for the three games with respect to different numbers of traces, where the cache size was 4KB. We found that 4KB cache was sufficient for the games under considerations because the read performance became saturated when the cache size was no less than 4KB. The read performance of each game was better than that of NOR even when only two traces were used to generate a prediction graph. For example, the improvement ratios over AOE II, TTD, and Raiden were 24%, 216%, and 298%, respectively, when the number of traces for each game was 10, and the size of cache was 4KB. When there were more than two traces, the read performance of Raiden had almost no improvement because the cache miss rate was almost zero. For AOE II, the read performance was improved slowly when the number of collected traces increased
Cache Pollution Rate
Cache pollution Rate is the rate of data that are prefetched but not referenced during the program execution. The prefetching of unnecessary data represented overheads and might even decreased the read performance because the prefetching activities of unnecessary data might delay the prefetching of useful data. In addition, unnecessary data transfer leads to extra power consumption, which is critical to designs of embedded systems. Let NSRAM 2host be the amount of data accessed by the host, and Nf lash2SRAM the amount of data transferred from NAND flash memory to SRAM. The cache pollution rate was defined as follows: NSRAM 2host Nf lash2SRAM As shown in Figure 7, the cache pollution rate increased as the number of traces for each game increased. That was because more traces led to a larger number of branches per branch node, and only one of the LBA links that follow a given branch node was actually referenced by the program. In summary, there was a trade-off between the prefetching accuracy and the prefetching overhead, even though the cache pollution rates were still lower than 10% in most cases. Cache pollution rate
=
1−
˔ˢ˘ʳ˜˜
˧˧˗
˥˴˼˷˸́ ˄˃ˁˉˆ
˄˄ ˄˃
ˋˁˌ˄
˶˴˶˻˸ʳ̃̂˿˿̈̇˼̂́ʳ̅˴̇˸ʳʻʸʼ
ˋ
[9]
ˌˁˇˆ
ˋˁˆ
ˌ ˊˁˆˋ
ˊ
ˉˁ˃ˌ
ˉ ˇˁˉˇ
ˈ
ˇˁˋˋ
[10]
ˈˁ˃ˌ
ˇ
[11]
ˆ ˅ ˄
˄ˁˈˊ ˃ˁ˃˄
˃ˁ˃˅
˃ˁ˃ˆ
˃ˁ˃ˆ
˃ˁ˃ˆ
˅
ˇ
ˉ ́̈̀˵˸̅ʳ̂˹ʳ̇̅˴˶˸̆
ˋ
˄˃
˃
Figure 7. The cache pollution rate (4KB cache)
[12] [13]
[14]
[15]
5 Conclusions This research proposes an application-oriented approach in the replacement of NOR with NAND. It is strongly motivated by a market demand in cutting down the cost of embedded systems in the storing and running of applications over NOR. Different from the previous work in caching and buffering and OneNAND-related work [13, 15, 16, 22, 23], we consider the designs of embedded systems with a limited set of applications. We propose an efficient prediction mechanism with limited SRAM-space requirements and an efficient implementation. A prefetch procedure is proposed to prefetch pages from NAND based on the trace analysis of application executions. A series of experiments is conducted based on realistic traces of computer games with different characteristics: “Age of Empire II (AOE II)”, “The Typing of the Death (TTD)”, and “Raiden”. The experimental results are very encouraging: We show that the average read performance of NAND with the proposed prediction mechanism could be even better than those of NOR for 24%, 216%, and 298% for the three games, respectively. Their cache miss rates were 35.27%, 4.21%, and 0.06%, respectively. The percentage of unnecessary prefetched data was lower than 10% in most cases. For future research, we shall further extend the proposed mechanism to explore on-line incremental mechanisms to adapt to dynamic changings in programs’ access patterns. We also plan to incorporate the research results to the designs of adaptors in storage system designs. More research will be conducted to analyze the execution traces of different user applications.
References
[16]
[17]
[18]
[19] [20] [21] [22] [23] [24] [25] [26] [27] [28]
[1] Flash Cache Memory Puts Robson in the Middle. Intel. [2] Flash File System. US Patent 540,448. In Intel Corporation. [3] FTL Logger Exchanging Data with FTL Systems. Technical report, Intel Corporation. [4] Software Concerns of Implementing a Resident Flash Disk. Intel Corporation. [5] Flash-memory Translation Layer for NAND flash (NFTL). M-Systems, 1998. [6] Understanding the Flash Translation Layer (FTL) Specification, http://developer.intel.com/. Technical report, Intel Corporation, Dec 1998. [7] Windows ReadyDrive and Hybrid Hard Disk Drives, http:// www.microsoft.com/whdc/device/storage/hybrid.mspx. Technical report, Microsoft, May 2006. [8] L.-P. Chang and T.-W. Kuo. An Adaptive Striping Architecture for Flash Memory Storage Systems of Embedded Sys-
[29]
[30]
[31]
tems. In IEEE Real-Time and Embedded Technology and Applications Symposium, pages 187–196, 2002. L.-P. Chang and T.-W. Kuo. An Efficient Management Scheme for Large-Scale Flash-Memory Storage Systems. In ACM Symposium on Applied Computing (SAC), pages 862– 868, Mar 2004. P. J. Denning. The Working Set Model for Program Behavior. Communications of the ACM, 11(5):323–333, 1968. P. J. Denning and S. C. Schwartz. Properties of the WorkingSet Model. Communications of the ACM, 15(3):191–198, 1972. DRAMeXchange. NAND Flash Contract Price, http://www.dramexchange.com/, 03 2007. Y. Joo, Y. Choi, C. Park, S. W. Chung, E.-Y. Chung, and N. Chang. Demand Paging for OneNANDT M Flash eXecute-In-Place. CODES+ISSS, October 2006. A. Kawaguchi, S. Nishioka, and H. Motoda. A FlashMemory Based File System. In Proceedings of the 1995 USENIX Technical Conference, pages 155–164, Jan 1995. J.-H. Lee, G.-H. Park, and S.-D. Kim. A new NAND-type flash memory package with smart buffer system for spatial and temporal localities. JOURNAL OF SYSTEMS ARCHITECTURE, 51:111–123, 2004. C. Park, J.-U. Kang, S.-Y. Park, and J.-S. Kim. Energyaware demand paging on nand flash-based embedded storages. ISLPED, August 2004. C. Park, J. Lim, K. Kwon, J. Lee, and S. L. Min. Compilerassisted demand paging for embedded systems with flash memory. EMSOFT, September 2004. C. Park, J. Seo, D. Seo, S. Kim, and B. Kim. Cost-efficient memory architecture design of nand flash memory embedded systems. ICCD, 2003. Z. Paz. Alternatives to Using NAND Flash White Paper. Technical report, M-Systems, August 2003. R. A. Quinnell. Meet Different Needs with NAND and NOR. Technical report, TOSHIBA, September 2005. Samsung Electronics. K9F1G08Q0M 128M x 8bit NAND Flash Memory Data Sheet, 2003. Samsung Electronics. OneNAND Features and Performance, 11 2005. Samsung Electronics. KFW8G16Q2M-DEBx 512M x 16bit OneNAND Flash Memory Data Sheet, 09 2006. M. Santarini. NAND versus NOR. Technical report, EDN, October 2005. Silicon Storage Technology (SST). SST39LF040 4K x 8bit SST Flash Memory Data Sheet, 2005. STMicroelectronics. NAND08Gx3C2A 8Gbit Multi-level NAND Flash Memory, 2005. A. Tal. Two Technologies Compared: NOR vs. NAND White Paper. Technical report, M-Systems, July 2003. C.-H. Wu and T.-W. Kuo. An Adaptive Two-Level Management for the Flash Translation Layer in Embedded Systems. In IEEE/ACM 2006 International Conference on ComputerAided Design (ICCAD), November 2006. M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile Main Memory Storage System. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 86–97, 1994. Q. Xin, E. L. Miller, T. Schwarz, D. D. Long, S. A. Brandt, and W. Litwin. Reliability Mechanisms for Very Large Storage Systems. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS’03), pages 146–156, Apr 2003. K. S. Yim, H. Bahn, and K. Koh. A Flash Compression Layer for SmartMedia Card Systems. IEEE Transactions on Consumer Electronics, 50(1):192–197, Feburary 2004.