Design and implementation of MLC NAND flash-based ... - koasas

5 downloads 27252 Views 1MB Size Report
Mar 17, 2009 - hard disk or SLC NAND flash memory do not have. Table 1 shows ... assume hard disk storage, they may not provide the best attainable performance on ... review the traditional transaction recovery methods. We present.
The Journal of Systems and Software 82 (2009) 1447–1458

Contents lists available at ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

Design and implementation of MLC NAND flash-based DBMS for mobile devices Ki Yong Lee a,*, Hyojun Kim b, Kyoung-Gu Woo b, Yon Dohn Chung c, Myoung Ho Kim a a b c

Division of Computer Science, KAIST, 373-1 Guseong-Dong, Yuseong-Gu, Daejeon 305-701, Republic of Korea Software Laboratories, Samsung Electronics, 416, Maetan-3Dong, Yeongtong-Gu, Suwon, Gyeonggi-Do 443-742, Republic of Korea Department of Computer Science and Engineering, Korea University, Anam-Dong, Seongbuk-Gu, Seoul 136-701, Republic of Korea

a r t i c l e

i n f o

Article history: Received 27 August 2008 Received in revised form 5 January 2009 Accepted 5 March 2009 Available online 17 March 2009 Keywords: Flash-based DBMS MLC NAND flash memory Transaction processing

a b s t r a c t Recently, Multi-Level Cell (MLC) NAND flash memory is becoming widely used as storage media for mobile devices such as mobile phones, MP3 players, PDAs and digital cameras. MLC NAND flash memory, however, has some restrictions that hard disk or Single-Level Cell (SLC) NAND flash memory do not have. Since most traditional database techniques assume hard disk, they may not provide the best attainable performance on MLC NAND flash memory. In this paper, we design and implement an MLC NAND flash-based DBMS for mobile devices, called AceDB Flashlight, which fully exploits the unique characteristics of MLC NAND flash memory. Our performance evaluations on an MLC NAND flash-based device show that the proposed DBMS significantly outperforms the existing ones. Ó 2009 Elsevier Inc. All rights reserved.

1. Introduction NAND flash memory is increasingly being adopted as storage media for a variety of consumer and industrial devices. The growing popularity of NAND flash memory is due to its non-volatility, light weight, small size, low power consumption, and shock resistance (Douglis et al., 1994). NAND flash memory comes in two types: Single-Level Cell (SLC) and Multi-Level Cell (MLC). Since MLC NAND flash memory provides higher capacity at a lower cost than SLC NAND flash memory, it is widely used in many mobile devices such as mobile phones, MP3 players, PDAs, and digital cameras. MLC NAND flash memory, however, has some restrictions that hard disk or SLC NAND flash memory do not have. Table 1 shows the comparison of SLC NAND flash memory, MLC NAND flash memory and a hard disk. The restrictions of MLC NAND flash memory are as follows. First, the minimum unit of a write operation is a page, which is 2 or 4 KB in size depending on the product. Thus, writing only a portion of a page is not possible. Second, a page cannot be overwritten without erasing it. This characteristic is called erase-before-write. Furthermore, an erase operation can only be performed on a block unit, which is much larger (typically 64 or 128 pages) than a page. Third, pages should be written in sequential order within a block. After the ith page in a block is written, the jð1 6 j < iÞth page in the block cannot be written until the block is erased. Fourth, if a write operation to a page is abnormally termi-

* Corresponding author. Tel.: +82 11 243 0924; fax: +82 42 867 2255. E-mail addresses: [email protected] (K.Y. Lee), [email protected] (H. Kim), [email protected] (K.-G. Woo), [email protected] (Y.D. Chung), [email protected] (M.H. Kim). 0164-1212/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2009.03.008

nated due to a system failure, the content of some previously written pages in the same block may also be corrupted (Kim et al., 2007; Lasser and Yair, 2006). We call this phenomenon the paired page problem in this paper. Fifth, the total number of write/erase cycles for a block is limited. If a block is erased more than the specified number of times, the block becomes worn-out and unreliable. In spite of these restrictions, the market for MLC NAND flash memory grows continuously due to a rapid increase in capacity as well as a fast improvement in performance. Currently, the use of an embedded DBMS on a mobile device, such as a mobile phone or an MP3 player, is increasing rapidly with the increase of the storage size (Pucheral et al., 2001; Bolchini et al., 2003; Sen and Ramaritham, 2005; Kim et al., 2006). For example, many mobile devices use embedded DBMSs to efficiently retrieve, navigate, and manage files stored in their storage. However, since most traditional database techniques assume hard disk storage, they may not provide the best attainable performance on mobile devices equipped with MLC NAND flash memory (Gal and Toledo, 2005; Lee and Moon, 2007). A traditional hard disk-based DBMS may run on flash memory without any modification by using an intermediate software layer, often called the Flash Translation Layer (FTL) (Kim et al., 2002; Kang et al., 2006; Lee et al., 2007). FTL emulates a random access block device by hiding the write restrictions of flash memory. This approach, however, has some problems. First, due to the erase-before-write characteristic of flash memory, FTL writes the modified data into another empty place that has been erased in advance. Thus, the old data remains untouched in the original place. However, the traditional disk-based DBMS is not aware of this, and generates log records redundantly to store the old value, which is untouched in the original place, of the updated data. This causes

1448

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

Table 1 Comparison of SLC NAND flash, MLC NAND flash and hard disk. Media Minimum unit of operation

Access time

Sequential page write restriction Paired page problem Number of write/erase cycles a b c

Read Write Erase Read Write Erase

SLC NANDa

MLC NANDb

Hard diskc

Page Sector Block 25 ls (2 KB) 200 ls (2 KB) 1.5 ms (128 KB) Yes No 105

Page Page Block 60 ls (2 KB) 800 ls (2 KB) 1.5 ms (256 KB) Yes Yes 5  103

Sector Sector – 12.7 ms (512B) 12.7 ms (512B) – No No –

Samsung K9K8G08U0A. Samsung K9G8G08U0M. Seagate Barracuda 7200.7.

unnecessary writes to flash memory, resulting in performance degradation. Second, frequent small writes for log records cause significant overhead in MLC NAND flash memory. Since the minimum unit of a write operation is a page, each log write requires another new empty page. Third, most existing FTLs are optimized for file systems, not DBMSs. Unlike a file system, where write requests are concentrated on particular areas of a disk, e.g., FAT, the inode table, or a portion of a file, write requests from a traditional DBMS are randomly scattered over a wide range of logical address space. This pattern of random writes typically causes many erase operations in FTL (Lee and Moon, 2007; Lim and Park, 2006). Lastly, existing DBMSs do not consider the paired page problem of MLC NAND flash memory. Since even pages on which previous write operations have been completed can be corrupted by a power failure during a write operation, the reliability of a database system can significantly decrease. As far as we are aware of, we are the first to address this problem in mobile embedded DBMS area. In this paper, we design and implement an MLC NAND flashbased DBMS, called AceDB Flashlight, for mobile devices. Our proposed DBMS maximizes database performance on MLC NAND flash memory by fully exploiting the unique characteristics of MLC NAND flash memory. Our proposed DBMS accesses MLC NAND flash memory directly without using FTL. The contributions of our paper are summarized as follows:  We review the characteristics of MLC NAND flash memory and describe the performance characteristics of the existing transaction processing schemes on MLC NAND flash memory. We then derive design principles for an MLC NAND flash-based DBMS on mobile devices.  We propose a new transaction processing scheme optimized for MLC NAND flash-based DBMS. The proposed scheme eliminates frequent small log writes, which cause significant overhead in MLC NAND flash memory. Also, the proposed DBMS converts randomly ordered write requests from a transaction into physically sequential writes within a block, which is an optimized write pattern for flash memory. The paired page problem in MLC NAND flash memory is also considered.  We show the effectiveness of the proposed DBMS through an extensive set of experiments on a real MLC NAND flash-based device. The experimental results show that the proposed DBMS significantly outperforms the existing DBMSs in the write performance. The rest of the paper is organized as follows. In Section 2, we describe the characteristics of MLC NAND flash memory and review the traditional transaction recovery methods. We present related work in Section 3. In Section 4, we describe the proposed MLC NAND flash-based DBMS in detail. We present the results of our performance evaluation in Section 5. Finally, we conclude our work in Section 6.

2. Background 2.1. Characteristics of MLC NAND flash memory As described in Section 1, NAND flash memory comes in two types: SLC and MLC. SLC NAND flash memory stores one bit per each memory cell, whereas MLC NAND flash memory stores more than one bit per cell by using multiple charge levels in the floating gate. For example, two-bit MLC NAND flash memory stores two bits per cell by using four different charge levels in the floating gate. Although MLC NAND flash memory provides higher capacity at a lower cost, SLC NAND flash memory achieves higher speed and cell endurance. As a result, SLC NAND flash memory is mostly used in industrial applications where performance and reliability are highly critical, while MLC NAND flash memory is mostly used in the consumer type of products such as mobile phones, MP3 players and flash memory cards. MLC NAND flash memory is organized as a fixed number of blocks, each of which is typically composed of 64 or 128 pages. A page has a size of 2 or 4 KB depending on the product. It is expected that the page size of MLC NAND flash memory will increase to 8 KB in the near future (Takeuchi et al., 2007). For each page, a spare area is provided to store management information such as an error correction code (ECC). The spare area can be read or written at the same time when a page is read or written with virtually no additional overhead. Typically, the bits in a single memory cell in MLC NAND flash memory belong to different pages in a block (Kim et al., 2007; Lasser and Yair, 2006). In two-bit MLC NAND flash memory, the two pages containing bits from the same memory cells are often called paired pages. If a power failure occurs in changing the state of a bit in a memory cell, the states of other bits in the same cell can also change. Accordingly, if a power failure occurs during a write operation to a page, the content of the paired pages can also be corrupted. We call this phenomenon the paired page problem in this paper. Fig. 1 illustrates the paired page problem. 2.2. Traditional transaction recovery Traditionally, there have been two approaches for transaction recovery: log-based recovery and shadow paging. In this section, we describe the performance characteristics of both approaches on MLC NAND flash memory. 2.2.1. Log-based recovery Log-based recovery is the most widely used recovery method in conventional DBMSs. Log-based recovery generates log records for each database modification. Log records are typically flushed to disk when a dirty page is evicted from the buffer pool and the log buffer becomes full. Or they may be flushed when a transaction

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

Block Page

3. Corruption 1. Write operation 2. Power failure : Paired pages Fig. 1. The paired page problem.

commits or aborts, depending on the specific policy of the commit and the abort of a transaction. However, on MLC NAND flash-based storage, log-based recovery has the following problems. In log-based recovery, many small writes are often required to flush log records according to the write-ahead-logging (WAL) protocol (Mohan et al., 1992), though the size of each log record may be small. In MLC NAND flash memory, since the minimum unit of a write operation is a page, even flushing only one log record requires the whole page to be written. Furthermore, since in-place updates are not allowed in flash memory, each log flush requires another new empty page. This produces many invalidated pages, which will eventually cause many block erase and page copy operations to reclaim them (see Fig. 2). All this work significantly degrades the database performance and decreases the lifetime of flash memory. To make matters worse, the size of a page continues to increase as the capacity of MLC NAND flash memory increases (Takeuchi et al., 2007). 2.2.2. Shadow paging Shadow paging is an alternative recovery technique that may not require log records (Silberschatz et al., 2005). When a database page is to be modified by a transaction, a new database page is allocated and all updates are done on this page. The old database page remains untouched as a shadow page. When the transaction commits, all the newly allocated pages become permanent. When the transaction aborts, all the newly allocated pages are simply discarded. On flash memory, old data is not overwritten, and new data is written elsewhere. Therefore, shadow paging is intrinsically performed on flash memory. Shadow paging is, however, not widely

Database

Log flush Page

1

2

3

4

Flash Memory

… Invalidated Valid pages page : Log records

Fig. 2. Log-based recovery on MLC NAND flash memory.

1449

adopted in disk-based DBMSs due to the following disadvantages (Silberschatz et al., 2005). First, when a transaction commits, multiple database pages need to be written to disk. In disk, the cost of writing multiple database pages can be very high due to the seek time and the rotational delay. Flash memory, however, has no such performance penalty because it has no moving mechanical parts like disk heads. Second, shadow paging causes the location of a database item to be changed when they are updated. As a result, the spatial locality of data is lost. This may cause severe performance degradation in disk, where the seek time and the rotational delay are dominant performance factors. On the contrary, in flash memory, all data can be accessed in the same time regardless of their physical location. Thus this is not a problem in flash memory. Third, shadow paging is known to be hard to support record-level concurrency control. However, in most mobile devices, page-level concurrency control is often found to be enough for user’s demands in practice.

3. Related work In this section, we describe some previous approaches to flashbased DBMSs that attempt to overcome the write restriction of flash memory. 3.1. FTL approach As described in Section 1, by using FTL, a traditional disk-based DBMS may run on flash memory without any modification. When the DBMS issues a write request, FTL redirects the write request to an empty page in flash memory that has been erased in advance. For this purpose, FTL manages an internal mapping table between logical pages and physical pages. There have been proposed various page mapping schemes for an efficient FTL (Kim et al., 2002; Kang et al., 2006; Lee et al., 2007). Fig. 3a illustrates a typical state of flash memory when many write requests are executed through FTL. Here, Data page represents a page containing database data, while Log page represents a page containing log records. Note that data pages and log pages are written interchangeably within each block without distinction. Recall, as mentioned before, that most existing FTLs are not optimized for database applications. 3.2. Log-structured approach Another approach to flash-based DBMSs is that a DBMS accesses flash memory directly without the help of an extra software layer like FTL (Kim et al., 2006). Most systems taking this approach are based on the log-structured design (Aleph One Limited, 2006; Bityutskiy, 2005; Rosenblum and Ousterhout, 1992; Woodhouse, 2003). In this design, the whole database is regarded as a single large log. When a modification is made to the database, only the changes made to the database, i.e., log records, are written to the end of the database. Thus, writes are always done at the end of the database sequentially. Fig. 3b shows an example of how pages are written in the log-structured approach. It is easy to see that this approach does not violate any of the write restrictions of flash memory. One of the most important advantages of this approach is that it always delivers good write performance regardless of the pattern of write requests. However, in this approach, many small log writes may also be inevitable. Another problem of this approach is that when a database page is read from the database, the current version of the database page has to be re-created by applying the changes in various log records to the previous version of that database page. Because the log records associated with the database

1450

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

: Data page

Flash memory

: Log page

Flash memory

: Empty page

: Write order

Flash memory

Flash memory

Page

Block

Txn1 Log region

Txn2 Log region

Txn3 Log region

(a)

(b)

(c)

(d)

Fig. 3. Comparison of flash-based DBMS approaches. (a) FTL approach. (b) Log-structured approach. (c) In-page logging approach. (d) The proposed approach.

page may be scattered all over the log, it may be very costly to recreate the up-to-date version of the database page from the logstructured database (Lee and Moon, 2007). Besides, the log-structured approach does not consider the paired page problem. Since write requests from all transactions are directed to the last block of the database, a power failure during a write operation of a transaction can corrupt the contents of other pages in the last block written by other transactions.

be written to the pre-defined region in a block (i.e., the log region). Because of the sequential page write restriction of MLC NAND flash memory, many empty pages in a block may not be used even if they have not been written yet. Finally, the paired page problem is also not considered in the IPL approach. Since write requests from different transactions can be directed to the same block, a power failure during a write operation of a transaction can corrupt the contents of other pages in the same block written by other transactions.

3.3. In-page logging approach To overcome the disadvantages of the log-structured approach, a new approach called in-page logging (IPL) is recently proposed for flash-based DBMS (Lee and Moon, 2007). In this approach, all the log records associated with a database page are stored in the same block to which the database page belongs. Thus, the current version of the database page can be re-created efficiently by accessing log records stored in the same block. In order to locate a database page and its log records in the same block, a fixed number of pages in each block are allocated as the log region. Fig. 3c shows how pages are written in the IPL approach. In addition, an in-memory log sector of 512 bytes is allocated to each database page in the buffer pool to store log records associated with the database page. When an in-memory log sector becomes full or the associated database page is evicted from the buffer pool, only the log sector is written to flash memory. Note that the dirty database page itself is not written back to flash memory. Although log records are scattered among blocks, writes can be performed without performance penalty since flash memory has no mechanical latency. However, in MLC NAND flash memory, the IPL approach has also several problems. First, since the minimum unit of a write operation is a page in MLC NAND flash memory, the cost of writing a log sector (512 bytes) is the same as the cost of writing a whole database page (e.g., 4 KB). Thus, the problem of frequent small log writes also arises in the IPL approach. Second, the IPL approach may exhibit bad performance when write requests are concentrated on specific blocks. When a block runs out of free pages in the log region, an erase operation and a number of write operations have to be performed, to merge database pages and log records in the block into a new block. Third, the IPL requires log records to

4. Our MLC NAND flash-based DBMS In this section, we first present the design concepts of our proposed DMBS, AceDB Flashlight. We then describe the transaction processing scheme of AceDB Flashlight in detail. 4.1. Design guidelines After analyzing important characteristics of MLC NAND flash memory, we have derived several design guidelines for MLC NAND flash-based DBMS as follows: (1) avoid frequent small writes incurred by log records. (2) provide consistent write performance regardless of the pattern of write requests. (3) take into consideration the paired page problem of MLC NAND flash memory. Based on these guidelines, we have developed an MLC NAND flash-optimized DBMS that employs the following two strategies for transaction processing. 4.1.1. MLC NAND flash-optimized transaction recovery When a modification is made to a database page, the database page is loaded into and updated in the buffer pool just as in a traditional DBMS. However, we do not generate log records for each update. Instead, when a dirty page is evicted from the buffer pool, we write it into another empty place of flash memory. For this purpose, we maintain a logical-to-physical page map in RAM. When a transaction commits, dirty pages in the buffer pool updated by the transaction are written back to flash memory. When a transaction aborts, the logical-to-physical page map in RAM is restored to its previous state. Note that this mechanism is similar to shadow pag-

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

ing. Consequently, transaction recovery can be effectively implemented on MLC NAND flash memory without any significant overhead of logging. Note also that reading a database page from flash memory can be performed very efficiently because we need not recreate the current version of the database page. In order to minimize the amount of data unnecessarily written to flash memory, we make the size of a database page the same as a page size of MLC NAND flash memory.

1451

the writes are always done sequentially within a block allocated to the transaction, the write performance can always be guaranteed regardless of the pattern of write requests. Note that this way of writing shares a similar spirit with the log-structured approach, where writes are always done only at the end of the database. However, the log-structured approach cannot avoid the paired page problem. 4.2. Transaction recovery

4.1.2. MLC NAND flash-optimized transaction writes In all the previous approaches described in Section 3, when an empty physical page is allocated to a logical page for a write request, the transaction that made the write request is not considered. That is, a write request to a logical page would be directed to the same physical page regardless of which transaction made the write request. In contrast, in our approach, physical pages are allocated on the transaction basis. When a transaction begins, a physical block is allocated to the transaction. Then, all the subsequent writes of the transaction are done sequentially in that block. Consequently, logically random write requests from a transaction are converted into physically sequential writes within a block. Fig. 3d shows an example of how pages are written by three transactions (Txn1, Txn2, and Txn3) in our approach. This approach has two major advantages: First, the paired page problem can be overcome. As described in Section 3, due to the paired page problem, the previous approaches cannot ensure database consistency in MLC NAND flash memory because a power failure during a write operation of one transaction can corrupt pages written by other transaction. On the other hand, a power failure during a write operation of one transaction cannot corrupt pages written by other transaction in our approach. This is because all different transactions write into different physical blocks. A block written by one transaction has nothing to do with the block written by another transaction. When a power failure occurs during a write operation of a certain transaction, the transaction is aborted and the blocks allocated to the transaction are simply invalidated. In this way, database consistency can be ensured despite the paired page problem. Second, this approach provides the consistent write performance regardless of the pattern of write requests. Because

Data block

Now we describe the proposed transaction recovery mechanism in detail. The proposed DBMS classifies blocks in flash memory into four categories: data block, transaction block, context block, and garbage block. Fig. 4 shows the relationship among these block. A data block is a block that contains committed data only, i.e., data written by already committed transactions. Thus, data in data blocks represent a consistent database state. Data blocks are connected together in a list, called the data block list. When a new data block is created, it is appended to the end of the data block list. Each physical page in a data block stores its logical page number in the spare area. A page in a data block is invalidated if there is a page with the same logical page number in a newly appended data block. That is, the pages in the newly appended data block represent the current version of the logical pages stored in the block. A garbage block is a block that contains invalidated or empty pages only. A garbage block can be reused either as a transaction block or a context block upon request. At the initial time, all blocks in flash memory are considered as garbage blocks except one block (i.e., the context block). Each garbage block is marked with a page offset, called LPO, which indicates the last written page number in the block. A garbage block is not erased until there are free (empty) pages in the block. When a garbage block is reused as a transaction block, all pages in the block whose offsets are less than LPO are just ignored. Note that, by using LPO, we need not erase a garbage block as long as there are free pages in the block. A transaction block is a block that contains uncommitted data, i.e., data written by a currently active transaction. A write request from an active transaction is directed to a transaction block allocated to the transaction, instead of to a data block directly. When

Data block list

Transaction blocks Transaction commit

Context block

Current Block map

Block status

Garbage blocks

... LPO

: valid page

: invalid page

: empty page

Fig. 4. Relationship among blocks in flash memory.

1452

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

a dirty page updated by a transaction is evicted from the buffer pool, it is written to a transaction block allocated to the transaction. If a transaction commits, all transaction blocks allocated to the transaction become data blocks and are appended to the end of the data block list. If a transaction aborts, all transaction blocks allocated to the transaction become garbage blocks. A context block is the block dedicated to the block map that is a collection of block status records. Each block status record contains the status of each block in flash memory. The status of a block is represented by four bytes as shown in Fig. 5. Thus, the overall size of the block map is calculated as follows: Block map size = the number of blocks in flash memory  4 bytes. One or more pages in flash memory may be used to store the block map. The block map is updated when a transaction commits or aborts, or garbage collection is performed (which will be described in Section 4.4). When the block map is updated, only the updated part of the block map is written sequentially in the context block with a new version number. Then, the old part is invalidated. In order to guarantee the atomic write of the updated part of the block map, after the updated part is written in the context block, we write a special end mark in the spare area of the last written page. If there is no end mark in the spare area of the last written page of the context block, we ignore the last written part. If the context block has no free page, a garbage block is selected and erased to be the new context block. The old context block is then erased and becomes a garbage block. Now we summarize the transaction recovery mechanism in the proposed DBMS.  Transaction begin: A garbage block is selected and allocated to the transaction as a new transaction block. All subsequent writes of the transaction are done sequentially in the transaction block. If the transaction block becomes full, another garbage block is selected and allocated to the transaction as an additional transaction block.  Transaction commit: All the dirty pages in the buffer pool updated by the transaction and not yet flushed to flash memory are written to the last transaction block allocated to the transaction. All the transaction blocks allocated to the transaction then become data blocks and are appended to the end of the data block list. The commit operation is completed by writing the updated part of the block map into the context block.  Transaction abort: All the dirty pages in the buffer pool updated by the transaction are invalidated and removed from the buffer pool. All the transaction blocks allocated to the transaction are marked with LPO, which indicates the last written page number

in each block, and then become garbage blocks. The abort operation is completed by writing the updated part of the block map into the context block. The only side effect of our approach is that there may be many empty pages in data blocks because a physical block is allocated to each transaction. However, those empty pages can be efficiently reused without any erase operation. When the number of garbage blocks falls below a specified threshold value, garbage collection is performed to reclaim free or invalidated space from data blocks. Note that, when a data block becomes a garbage block as a result of garbage collection, LPO is used to indicate the last written page number in that block. Therefore, empty pages in data blocks can be eventually reused without any erase operation. Note also that, in our scheme, the updated pages of the block map is written to flash memory whenever a transaction commits or abort. However, also in log-based recovery or other recovery schemes, at least one page has to be written to flash memory to make the commit or abort of the transaction persistent. Lastly, the proposed transaction recovery may not be suitable for a highly concurrent environment because it is hard to support record-level concurrency control. However, as described in Section 2.2.2, in most mobile devices, page-level concurrency control is often found to be enough for user’s demands in practice. 4.3. Page map management Our proposed transaction processing scheme redirects all write requests from a transaction to its transaction blocks. For this purpose, our proposed DBMS maintains a logical-to-physical page mapping table in RAM as shown in Fig. 6. In Fig. 6, the row Physical page number specifies the physical page number of a logical page that contains the latest version of the logical page written by a committed transaction. The row Transaction page number specifies the physical page number of a logical page that contains the latest version of the logical page that is written by a currently active transaction. The row Transaction ID specifies the currently active transaction that updated a logical page. When a read request to a logical page is issued, its Transaction page number is referred to locate the physical address of the logical page if the transaction that updated the logical page is still active; otherwise the Physical page number is referred. When a write request to a logical page is issued, the requested data is written to the transaction block allocated to the transaction which issued the write request, and the Transaction page number is updated.

Fig. 5. Block status record.

Fig. 6. An example of the page mapping table.

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

The page mapping table is constructed in RAM at system startup time as follows: First, the spare area of the first page in each block is read to find the context block. The spare area of the first page in each block stores information indicating whether this block is the context block or not. The pages in the found context block are then scanned to restore the most recent version of the block map. Then, using the block map, the data block list is constructed. Finally, the page mapping table is constructed by scanning the spare areas of all pages in data blocks in the order of the data block list. At this time, the logical page number stored in the spare area is used. Note that we only need to read the spare areas to construct the page mapping table. The time for reading only the spare area is much less than the time for reading a whole page (Samsung Electronics, 2006). Note that the page mapping table can be lost upon a system failure since it resides only in RAM. However, it can always be reconstructed after a system failure using the above procedure. Note also that the size of the page mapping table is proportional to the number of pages in flash memory. However, in most practical applications, the page mapping table can be maintained in RAM as a whole. Suppose that we use an MLC NAND flash memory chip whose capacity is 2 GB and page size is 4 KB. If the size of an entry in the page mapping table is 8 bytes, then the overall size of the page mapping table is as follows:

Page mapping table size ¼ the entry size  the total number of pages ¼ the entry size 

the capacity of flash memory the size of a page

2 GB ¼ 8 bytes  ¼ 4 MB 4 KB This size is quite practical in most modern mobile devices. Now we estimate the time to construct the page mapping table in RAM. Suppose that we have an MLC NAND flash memory of n blocks, each of which consists of m pages. Let tp and t s be the time to read a whole page and spare area, respectively. The time to find the context block is at most nt s because we need to read only the spare area of the first page in each block. The time to restore the block map is at most mt p because we need to read only pages in the context block. Finally, the time to scan the spare areas of all pages in data blocks to construct the page mapping table is at most nmt s . Therefore, the total time to construct the page mapping table, T total , is at most T total ¼ nt s þ mtp þ nmt s . Note that, in log-based recovery systems, we need to search the entire log to undo or redo transactions when a system failure occurs. Since this process is time-consuming, log-based recovery systems periodically perform checkpoints to limit the amount of log. Consequently, the recovery time of a log-based recovery system greatly varies depending on the amount of log and the frequency of checkpoints. On the other hand, the proposed transaction recovery requires no undo or redo operations. Thus, the recovery time of the proposed DBMS is only the time to reconstruct the page mapping table in RAM. 4.4. Garbage collection Since old data is not overwritten but invalidated, garbage collection is inevitable in flash memory. Garbage collection is a process that reclaims free or invalidated pages from blocks. To reclaim invalidated pages from blocks, garbage collection performs a number of block erase and page copy operations. Thus, it can affect the write performance of the system and result in, potentially, unpredictable run-time latency, depending on the number of block erases and page writes. Therefore, efficient garbage collection is important for the overall system performance. There have been proposed many algorithms for efficient garbage collection (Chang

1453

and Kuo, 2004; Kwon and Koh, 2007). In this subsection, although the proposal of a new garbage collection algorithm is not within the scope of this paper, we discuss the impact of our transaction processing scheme on the performance of garbage collection. Garbage collection can be performed either periodically or ondemand. When a garbage collection is triggered, it selects some number of victim blocks to clean, and moves valid pages in those blocks to other free space so that the blocks have no valid page and can be reused. Therefore, the cost of garbage collection is proportional to the amount of valid pages to be moved. A common technique to reduce the cost of garbage collection is to separate hot data (i.e., frequently updated data) from cold data (i.e., less frequently updated data) into different blocks (Chang and Kuo, 2004; Lim and Park, 2006). Hot data has higher probability for updates in the near future. If a data block stores hot data only, all pages in the block will be eventually invalidated and this block can be erased without moving valid pages. Consequently, the cost of moving valid data decreases. In our transaction processing scheme, recently updated data, i.e., hot data, are written into transaction blocks and appended to the end of the data block list. As a result, frequently updated data are located in the rear of the data block list, while relatively infrequently or never updated data are located in the front of the data block list. That is, hot data and cold data are effectively separated into different blocks as time goes on. Therefore, garbage collection can be performed more efficiently on our transaction processing scheme.

5. Performance evaluation In this section, we present the results of our performance evaluation based on a real implementation on an MLC NAND flashbased device. 5.1. Experimental setup In order to evaluate the performance of the proposed DBMS, we fully implemented the proposed DBMS on an MLC NAND flashbased mobile device. The mobile device used in the experiments has an ARM940T core 200 MHz CPU with 32 MB SDRAM, and is equipped with Samsung K9G8G08U0M 1 GB MLC NAND flash memory, where the page size is 2 KB and the block size is 256 KB. The mobile device is provided with a commercial FTL developed by Samsung Electronics (Samsung Electronics, 2004). We have implemented the proposed DBMS based on AceDB, which is a commercial embedded DBMS developed by Samsung Electronics. AceDB is a conventional disk-based DBMS running on FTL that uses a traditional log-based method for transaction recovery. In order to implement the proposed DBMS, we modified the transaction and storage manager of AceDB. The modified version of AceDB is called AceDB Flashlight. AceDB Flashlight fully implements the proposed transaction processing scheme described in Section 4, and directly runs on MLC NAND flash memory without using FTL. In order to compare the performance of our approach with that of the IPL approach, we have also implemented a simulator modeling the IPL approach. We modified the buffer and storage manager of AceDB to simulate the IPL buffer and storage manager. The performance of the IPL approach was estimated by counting the number of read, write, and erase operations issued by the IPL simulator. In the experiments, we compared the performance of three DBMSs: (1) AceDB(log): the original version of AceDB running on FTL, (2) IPL: a flash-based DBMS employing the IPL approach, and (3) AceDB Flashlight: the proposed DBMS. The first and second DBMSs represent existing flash-based DBMSs. Note that AceDB Flashlight further considers the paired page problem to ensure database consistency in MLC NAND flash memory. Since this issue

1454

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

is about ensuring database consistency, not about improving database performance, we did not deal with this issue in the performance evaluation. Fig. 7 shows two sets of the benchmark schema and queries used in the experiments. The benchmarks in Fig. 7a and b represent typical real-world applications of two types of mobile consumer electronics: mobile phones and MP3 players, respectively. The tables in Fig. 7a and b represent information used by a commercial mobile phone and MP3 player to retrieve, navigate, and manage files stored in the mobile device, respectively. We created seven and ten B+-tree indices on each table, as described in Fig. 7. In both benchmarks, the queries Q 1; Q 2; Q 3 and Q 4 are used to evaluate the write performance of DBMSs, whereas Q 5 and Q 6 are used to evaluate the read performance. Note that the performance of queries is also strongly affected by the performance of the query processor. Since the design of a query processor for various types of queries is beyond the scope of this paper, we focused on these six basic representative queries to evaluate the performance of the proposed transaction processing scheme. In the experiments, we varied the number of records in both tables from 10,000 to 30,000. The average size of a record in two benchmarks in Fig. 7a and b is about 200 bytes and 1 KB, respectively. We counted the number of write, erase, and read operations performed by three DBMSs for executing the benchmark queries. Besides, we also measured the actual time spent by DBMSs for executing the benchmark queries.

write queries Q 1; Q 2; Q 3, and Q 4 in two benchmarks on three DBMSs. We ran all the queries 100 times serially to ensure stable and reliable results. In a series of our experiments, we varied (1) the number of records in both tables (from 10,000 to 30,000), (2) the size of the database buffer pool (from 1 MB to 5 MB), and (3) the size of a database page (from 2 to 10 KB). In this experiment, we set the number of records in both tables to 10,000, the size of the database buffer pool to 1 MB, and the size of a database page to 2 KB. Recall that the IPL approach allocates a fixed number of pages in each block as the log region. In all the experiments, we set the size of the log region in each block to 16 KB (8 pages) as used in Lee and Moon (2007). Fig. 8 shows the results of performance evaluation of three DBMSs for the write queries Q 1; Q 2; Q 3 and Q 4. As shown in Fig. 8, the numbers of write and erase operations are reduced in AceDB Flashlight compared to AceDB(log) and IPL. Accordingly, the execution time of AceDB Flashlight is shorter than those of AceDB(log) and IPL. Note that the execution time of IPL is estimated values calculated from the number of read, write, and erase operations issued by the IPL simulator. The execution time of IPL was estimated using the following formula. (We assume that the execution time is dominated by I/O time.)

Estimated Execution TimeIPL ¼ the number of read operations  60 ls þ the number of write operations  800 ls þ the number of erase operations  1:5 ms

5.2. Write performance In this experiment, we compared the write performance of three DBMSs, AceDB(log), IPL, and AceDB Flashlight. We ran the

The average read access time ð60 lsÞ, write access time ð800 lsÞ, and erase access time (1.5 ms) are taken from Table 1.

Fig. 7. Two benchmarks used in the experiments.

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

1455

Fig. 8. The write performance of three databases.

Table 2 Statistics about log writes performed in the experiments in Fig. 8. Database

AceDB(log)

IPL

(a) Mobile phone benchmark Total number of pages written Total number of log pages written Total number of log records generated Average size of a log record Average number of log records per log flush

356,300 140,500 (39.4%) 1,819,400 8 bytes 13.6

289,150 196,600 (67.7%) 1,819,400 8 bytes 9.7

(b) MP3 player benchmark Total number of pages written Total number of log pages written Total number of log records generated Average size of a log record Average number of log records per log flush

1,135,000 578,900 (50.9%) 4,902,000 59 bytes 8.4

850,600 722,225 (84.9%) 4,902,000 59 bytes 6.8

The reduction of write operations and, consequently, erase operations in AceDB Flashlight is primarily due to the elimination of frequent small log writes. Table 2 shows statistics about log writes performed by AceDB(log) and IPL in the experiment in Fig. 8. Note that AceDB(log) and IPL generate exactly the same log records for the same query. They differ only in the way of writing and reading log records from flash memory (i.e., the location of log records within a block, the unit of writing log records). As shown in Table 2, the average size of a log record is only about 8 bytes and 59 bytes in each benchmark, respectively. Nevertheless, the average number of log records per flush does not exceed 14 and 9 in each benchmark, respectively. Thus, more than 70% of log page space is wasted on average. Fig. 9 shows the total numbers of write operations and the execution time of three DBMSs when we varied the number of records in both benchmarks from 10,000 to 30,000. Here again, we ran the

write queries Q 1; Q 2; Q 3, and Q 4 in both benchmarks 100 times. As shown in Fig. 9, the execution time is proportional to the number of write operations for all three DBMSs. We can also see that AceDB flashlight always outperforms the other DBMSs in terms of the number of write operations and execution time. 5.3. Read performance In this experiment, we compared the read performance of three DBMSs. Fig. 10 shows the total number of read operations performed by three DBMSs for executing the read-only queries Q 5 and Q 6 in both benchmarks 100 times. We varied the number of records in both benchmarks from 10,000 to 30,000. Note that no write or erase operations are performed in this experiment. In Fig. 10, we can see that IPL shows slightly lower performance than AceDB(log) and AceDB Flashlight. This is because IPL has to re-create the current version of a database page by scanning log records when the database page is read from flash memory. Therefore, IPL has to read more pages than AceDB(log) or AceDB Flashlight. Note that AceDB Flashlight has no such performance penalty. 5.4. Varying buffer pool size Since log records are flushed to flash memory when a dirty page is evicted from the buffer pool or the log buffer is full, the number of pages written is strongly related to the size of the database buffer pool. In this experiment, we examined the effect of varying buffer pool size on the performance of three DBMSs. We ran all the queries in both benchmarks 100 times by varying the size of the database buffer pool from 1 MB to 5 MB. We set the number of records in both benchmarks to 30,000 and the size of a database page to 2 KB. Obviously, the number of pages flushed from the buffer pool decreases as the size of the buffer pool increases. Fig. 11

1456

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

(a) Mobile phone benchmark

(b) MP3 player benchmark Fig. 9. Write performance of three DBMSs with varying numbers of records.

(a) Mobile phone benchmark

(b) MP3 player benchmark

Fig. 10. Read performance of three DBMSs with varying numbers of records.

shows the number of write operations and the execution time of three DBMSs as the size of the buffer pool increases. As expected, the execution time decreases as the size of the buffer pool increases because the number of write and erase operations decreases. Here again, we observe that AceDB Flashlight always shows better performance than the other DBMSs. 5.5. Varying database page size The performance of IPL is affected by the size of a database page since the performance advantage of the IPL approach comes from writing log records in a smaller unit than a database page. In all the previous experiments, the size of a database page was set to be the page size of MLC NAND flash memory so that the unit of writing log records is the same as a database page. In this experiment, we studied how the performance of three DBMSs is affected by the size of a database page. We ran both benchmarks 100 times by varying the size of a database page from 2 to 10 KB. If the database page size is smaller than the minimum I/O unit of the underlying system, e.g., a page in MLC NAND flash memory, the I/O efficiency of the system can be reduced because reading or writing

a database page unnecessarily cause the whole page of MLC NAND flash memory containing the database page to be read or written. For this reason, most commercial DBMSs, as well as AceDB(log) and AceDB Flashlight, do not support a database page size smaller than the minimum I/O unit of the underlying system. Accordingly, we did not consider database page sizes smaller than the page size of the MLC NAND flash memory in the experiment. We set the number of records in both tables to 30,000 and the size of the database buffer pool to 1 MB. Fig. 12 shows the performance of three DBMSs with varying size of a database page. We can see that the relative performance of IPL improves as the database page increases. This improvement is due to the relative reduction in the number of log pages written. In Fig. 12, the performance of AceDB(log) degrades more quickly than the other databases as the size of a database page increases because the number of log pages written increases fast due to the large database page size. The performance of AceDB Flashlight also degrades as the size of a database page increases. However, also in this experiment, AceDB Flashlight shows better performance than the other DBMSs. Therefore, we can conclude that the proposed DBMS can be effectively used on MLC NAND flash-based mobile devices.

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

1457

(a) Mobile phone benchmark

(b) MP3 player benchmark Fig. 11. Performance of three DBMSs with varying buffer pool size.

(a) Mobile phone benchmark

(b) MP3 player benchmark Fig. 12. Performance of three DBMSs with varying database page size.

6. Conclusions In this paper, we design and implement an MLC NAND flashbased DBMS, AceDB Flashlight, that fully exploits the unique characteristics of MLC NAND flash memory. Our proposed DBMS uses a transaction recovery scheme optimized for MLC NAND flash-based DBMS, which can significantly reduce the write overhead in MLC

NAND flash memory. Moreover, the proposed DBMS converts randomly ordered write requests from a transaction into physically sequential writes on flash memory, which is an optimized write pattern for MLC NAND flash memory. Consequently, the write performance of the DBMS is substantially improved compared with the existing ones. We also present the results of performance evaluation based on a real implementation on a flash-based mobile de-

1458

K.Y. Lee et al. / The Journal of Systems and Software 82 (2009) 1447–1458

vice. The results show that our proposed DBMS outperforms the previous ones in terms of both operation count and execution time. Acknowledgments This work was done when Ki Yong Lee was at Samsung Electronics. References Aleph One Limited, 2006. Yet Another Flash File System (YAFFS). . Bityutskiy, A.B., 2005. JFFS3 Design Issues. . Bolchini, C., Salice, F., Scheriber, F.A., Tanca, L., 2003. Logical and physical design issues for smart card databases. ACM Transactions on Information Systems 21 (3), 254–285. Chang, L., Kuo, T., 2004. A real-time garbage collection mechanism for flashmemory storage systems in embedded systems. ACM Transaction on Embedded Computing Systems 3 (4), 837–863. Douglis, F., Cacerers, R., Kaashoek, F., Li, K., Marsh, B., Marsh, Tauber, J.A., 1994. Storage alternatives for mobile computers. In: Proceedings of the First Symposium on Operating Systems Design and Implementation (OSDI), pp. 25–37. Gal, E., Toledo, S., 2005. Algorithms and data structures for flash memories. ACM Computing Surveys 37 (2), 138–163. Kang, J., Jo, H., Kim, J., Lee, J., 2006. A superblock-based flash translation layer for NAND flash memory. In: Proceedings of EMSOFT’06, pp. 161–170. Kim, J., Kim, J., Noh, S., Min, L., Cho, Y., 2002. A space-efficient flash translation layer for CompactFlash systems. IEEE Transactions on Consumer Electronics 48 (2), 366–375. Kim, G., Baek, S., Lee, H., Lee, H, and Joe, M.J., 2006. LGeDBMS: a small DBMS for embedded system with flash memory. In: Proceedings of VLDB, pp. 1255–1258. Kim, J., Yoon, S., Woo, N., 2007. Nonvolatile Memory and Apparatus and Method for Deciding Data Validity for the Same. US Patent 2007/0189107. Kwon, O., Koh, K., 2007. Swap-aware garbage collection for NAND flash memory based embedded systems. In: Proceedings of the Seventh IEEE International Conference on Computer and Information Technology, pp. 787–792. Lasser, M., Yair, K., 2006. Flash Memory Management Method That is Resistant to Data Corruption by Power Loss. US Patent, Patent No. US 6,988,175 B2. Lee, S., Moon, B., 2007. Design of flash-based DBMS: an in-page logging approach. In: Proceedings of ACM SIGMOD, pp. 1–10. Lee, S., Park, D., Chung, T., Lee, D., Park, S., Song, H., 2007. A log buffer based flash translation layer using fully associative sector translation. ACM Transactions on Embedded Computing Systems 6 (3). Lim, S., Park, K., 2006. An efficient NAND flash file system for flash memory storage. IEEE Transactions on Computers 55 (7), 906–912. Mohan, C., Haderle, D., Linsay, B., Pirahesh, H., Schwarz, P., 1992. ARIES: a transaction recovery method supporting fine-granularity locking and partial rollback using write-ahead logging. ACM Transactions on Database Systems 17 (1), 94–162. Pucheral, P., Bouganim, L., Valduriez, P., Bobineau, C., 2001. PicoDBMS: scaling down database techniques for the smartcard. The VLDB Journal 10 (2–3), 120–132. Rosenblum, M., Ousterhout, J.K., 1992. The design and implementation of a logstructured file system. ACM Transactions on Computer Systems 10 (1), 26–52.

Samsung Electronics, 2004. eXtended Sector Remapper (XSR). . Samsung Electronics, 2006. NAND Flash-memory Datasheet and SmartMedia Data Book. Sen, R., Ramaritham, K., 2005. Efficient data management on lightweight computing devices. In: Proceedings of IEEE ICDE, pp. 419–420. Silberschatz, A., Korth, H.F., Sudarshan, S., 2005. Database System Concepts, fifth ed. McGraw-Hill. Takeuchi, K. et al., 2007. A 56-nm CMOS 99-mm2 8-Gb multi-level NAND flash memory with 10-MB/s program throughput. IEEE Journal of Solid-State-Circuits 42 (1), 219–232. Woodhouse, D., 2003. JFFS2: The Journaling Flash File System Version 2. .

Ki Yong Lee is a research assistant professor of the Department of Computer Science at KAIST, Daejeon, Korea, from 2008. From 2006 to 2007, He was a senior engineer at Samsung Electronics Co., Korea. He received his B.S. and M.S. degrees in Computer Science from KAIST, Daejeon, Korea, in 1998 and 2000, respectively, and his Ph.D. degree in Computer Science from KAIST in 2006. His research interests include database systems, data warehousing, OLAP, and embedded software. Hyojun Kim received the B.S. degree in electronics engineering from Sogang University, Korea and the M.S. degree in electronics and computer engineering from Hanyang University, Korea. He is now a senior engineer in Software Laboratories, Samsung Electronics Co. Ltd., Korea. His research interests include flash storage systems and embedded, real-time systems. Kyoung-Gu Woo received his B.S. in Computer Engineering from Seoul National University in 1996. And he received his M.S. and Ph.D. degrees in Computer Science from KAIST in 1998 and 2004 respectively. Currently, he is working as a R&D staff member of SAIT, Samsung Electronics. He enjoys solving problems in data processing areas such as indexing, transaction processing, data mining, and distributed system. Yon Dohn Chung received his B.S. degree in Computer Science from Korea University, Seoul, Korea in 1994, and his M.S. and Ph.D. degrees in Computer Science from KAIST (Korea Advanced Institute of Science and Technology), Daejeon, Korea in 1996 and 2000, respectively. He was an assistant professor in the Department of Computer Engineering at Dongguk University, Seoul, Korea from 2003 to 2006. He joined the faculty of the Department of Computer Science and Engineering at Korea University, Seoul, Korea in 2006, where currently he is an associate professor. His research interests include broadcast databases, XML databases, graph databases, and distributed/parallel processing of large-scale data. He is a member of IEEE and ACM. Myoung Ho Kim received his B.S. and M.S. degrees in Computer Engineering from Seoul National University, Seoul, Korea in 1982 and 1984, respectively, and received his Ph.D. degree in Computer Science from Michigan State University, East Lansing, MI, in 1989. He joined the faculty of the Department of Computer Science at KAIST, Taejon, Korea in 1989 where currently he is a full professor. His research interests include database systems, data stream processing, sensor networks, mobile computing, OLAP, XML, information retrieval, workflow and distributed processing.

Suggest Documents