the author's version of aninarticle that hasofbeen in this journal. wereContent made tomay thischange versionprior by thetopublisher prior to Citation publication. This article hasThis beenisaccepted for publication a future issue thispublished journal, but has not beenChanges fully edited. final publication. information: DOI The final version of record isIEEE available at http://dx.doi.org/10.1109/TMSCS.2016.2598746 10.1109/TMSCS.2016.2598746, Transactions on Multi-Scale Computing Systems IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
1
Hybrid RAID: a Solution for Enhancing the Reliability of SSD-based RAIDs S. Alinezhad Chamazcoti, and S. G. Miremadi, Sharif University of Technology Abstract—The failure probability in SSDs increases when the number of Program/Erase (P/E) cycles increases. Traditionally, a group of SSDs are protected with parity disks, called SSD-based RAID. It has been shown that the reliability of RAIDs depends on the distribution of parities among SSDs. There are two main policies to distribute parities among SSDs in RAIDs, i.e., evenly and unevenly. By distributing parities evenly, all SSDs would wear out with the same rate, causing simultaneous failures of SSDs. By distributing parities unevenly, one of the SSDs in RAID may fail much earlier than the others. Both these two drawbacks, i.e., the simultaneous failures of SSDs and the rapid first failure of one SSD, reduce the reliability of SSD-based RAIDs. This paper proposes a Hybrid RAID, called Hy-RAID, as a solution to mitigate the above-mentioned drawbacks of the pure evenly- and unevenly-based RAIDs. Hy-RAID changes the policy of parity distribution when the number of P/E cycles of SSDs reaches to a predefined limit. To evaluate the reliability of Hy-RAID, two new quantitative metrics are proposed. The results show that Hy-RAID enhances reliability as compared with traditional RAIDs by omitting the simultaneous disk failures and postponing the first disk failure. Index Terms— Bit Error Rate (BER), Reliability, Program/ Erase (P/E) cycle, RAID, Solid State Drive (SSD)
—————————— ——————————
1 INTRODUCTION
T
HE use of Solid State Drive (SSD) in storage systems has been increased due to its high performance and low power consumption [1][2], especially for multi-scale computing of big data driven applications. Due to the different inherent characteristics of SSDs as compared to Hard Disk Drives (HDDs), their reliabilities are handled in different ways. The reliability of SSDs mainly depends on the Bit Error Rate (BER), which is a function of the number of Program /Erase (P/E) cycles. The BER of SSDs increases exponentially, when the number of P/E cycles increases [3]-[5]. Traditionally, to improve the reliability of storage systems, an array of disks is applied in these systems, called Redundant Array of Independent Disks (RAID) [6]. The reliability of data disks in RAIDs is achieved by applying parity codes. Generally, there are two policies for distributing the parity code units in RAIDs: 1) to distribute the parity units evenly among all disks; 2) to distribute the parity units unevenly among all disks [7]. The use of SSDs in RAIDs can lead to three main reliability challenges: 1) the number of P/E cycles is limited for each SSD, called the endurance limit, 2) the dependency between the number of P/E cycles and BER in SSDs, leading to variation in the reliability of SSDs in RAID (i.e., SSD-varying BER), and 3) based on the workloads, each SSD has different number of P/E cycles in different instances of time, resulting in time-varying BER. The above discussion shows that: 1) the limited endurance is the main concern that should be considered in the design of SSD-based RAID system, and 2) different BERs of SSDs in ————————————————
S. Alinezhad Chamazcoti is with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. E-mail: Alinezhad@ ce.sharif.edu. S. G. Miremadi is with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. E-mail:
[email protected].
RAID system and also time-varying BER in each SSD are two main problems in modeling the reliability of the SSDbased RAID systems. The above-mentioned challenges have been addressed in previous literature in two ways: 1) In [8], an analytical model is introduced to evaluate the reliability of SSDbased RAIDs considering the time-varying BER in SSDs and the difference in the number of P/E cycles of SSDs (SSD-varying BER), 2) In [9] and [10], different structures for SSD-based RAIDs with respect to the reliability issue have been proposed, as alternatives to the traditional RAIDs, i.e., RAID4 and RAID5. The main distinction among the structures of the existing RAIDs is the distribution of parity units among disks. Evenly distribution and unevenly distribution are two policies for distributing the parity units among SSDs. In the case of evenly distribution, the probability of simultaneous failures may increase when the number of P/E cycles reaches the endurance limit. In the case of unevenly distribution, some of the disks may wear out faster than the other disks, resulting in failures with higher probability. This paper addresses the above two mentioned challenges of SSD-based RAIDs. The contributions of this work can be summarized as follows: 1) A reliable structure for the SSD-based RAID, called Hybrid RAID (HyRAID), is proposed. The key idea of Hy-RAID is to merge the benefits of both evenly and unevenly parity distribution. In this way, Hy-RAID initially distributes the parity units among disks evenly. When the number of P/E cycles reaches to a predefined number, Hy-RAID changes the way of parity distribution, switching to the unevenly distribution. Thereafter, the parity units are distributed unevenly among disks. The point at which the system switches from evenly distribution to unevenly distribution is called switching point in this paper. 2) A quantita-
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society information. Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing
[email protected].
the author's version of aninarticle that hasofbeen in this journal. wereContent made tomay thischange versionprior by thetopublisher prior to Citation publication. This article hasThis beenisaccepted for publication a future issue thispublished journal, but has not beenChanges fully edited. final publication. information: DOI The final version of record isIEEE available at http://dx.doi.org/10.1109/TMSCS.2016.2598746 10.1109/TMSCS.2016.2598746, Transactions on Multi-Scale Computing Systems 2
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
tive model is proposed to evaluate the reliability of HyRAID, using two proposed metrics: a) the safe time, i.e., the duration of time that RAID has no disk with high number of P/E cycles (high BER), b) the variance of dead time, i.e., the time between disk failures in RAID. If a larger number of disks remains in the safe zone1, and also the variance of dead times of SSDs is larger, higher reliability is achieved for the SSD-based RAID. Hy-RAID provides higher reliability in comparison to the traditional RAIDs by mitigating the weaknesses of traditional RAIDs in two ways: a) it postpones the first failure in the RAID by holding the disks in the safe zone for longer time than uneven-distributed-parity RAIDs, i.e., RAID4 and Diff-RAID, b) it prevents from simultaneous failures by increasing the variance of dead time among disks as compared with even-distributed-parity RAIDs, i.e., RAID5. The proposed Hy-RAID is implemented in an SSD-extension of DiskSim simulator, and is compared with three well-known RAIDs, i.e., RAID4, RAID5, and Diff-RAID in terms of the duration of safe time and the variance of the dead time of SSDs (i.e., the variance of time for the first replacement of SSDs in the array). This comparison is done with both synthesis and real workloads with different time intervals, considering different number of disks. The remainder of this paper is organized as follows: in Section 2, the background on SSD-based RAIDs and the related work in this area is stated. Section 3 provides a quantitative model for the reliability estimation regarding the shortcomings of the previous models. The details of the proposed Hybrid RAID are presented in Section 4. In Section 5, the evaluation of Hy-RAID and its comparison with previous RAID5, RAID4, and Diff-RAID are reported. Finally, this paper is concluded, and the future work is stated in Section 6.
2
BACKGROUND AND RELATED WORK
2.1 SSD Specifications and RAID system SSD is an emerging technology for storing digital data [1]. The high performance and low power consumption of this technology makes it suitable for various applications. The inherent properties of SSD technology make it different from HDD technology [1]. Although SSDs provide greater Mean Time To Data Loss (MTTDL) than HDDs because of no moving/mechanical elements, the dominant failure in SSDs is the failure in the underlying bits of flash chip due to the read disturb, write disturb, and clean disturb [4] [11]. Unlike HDDs, SSDs are threatened by permanent and transient failures when the number of P/E cycles is increased. As a permanent failure, SSDs fail to store data with desired reliability when SSDs reaches to the endurance limit. SSDs endure a limited number of P/E cycles that can be committed to disk. This limitation is called write endurance, which is not a concern in HDDs. Moreover, as a transient failure in SSDs, the probability of read/write/erase disturbs is increased when the number of P/E cycles is increased [12]-[13]. There is a direct de1
SSDs are in the safe zone during the safe time.
pendency between BER and P/E cycles in SSDs, so that by increasing the number of P/E cycles, the BER is increased exponentially [4]. It is common for many applications to employ several SSDs in an array of disks, i.e., RAID, to provide higher performance, reliability, and capacity. By stripping data among several disks in a RAID system, read and write operations would be performed faster, resulting in greater performance for the system. Furthermore, a group of SSDs in the RAID system is protected by the parity units to provide the reliability. A RAID system consists of n data disks and m parity disks can tolerate up to m disk failures. The parity units are placed on dedicated disks or distributed among all disks evenly. Two traditional structures for array of disks are RAID4 and RAID5. These two structures consist of n data disks and only one parity disk; as a result, they can tolerate only one disk failure. As shown in Fig. 1, the parity units are stored on one dedicated disk in the array of disks (e.g., RAID4), or distributed among all disks in the array (e.g., RAID5). In RAID5, by distributing parity units among data disks evenly, the P/E cycles are distributed among the disks with almost the same ratio. However, in RAID4 all the parity units are stored in one disk, i.e., parity disk, which leads to increase in the number of P/E cycles of that disk. Besides the benefits of applying SSDs in the RAID systems, they impose new challenges. The main challenges of applying SSDs in the RAID systems are performance and reliability concerns [7]. In terms of performance, the main concerns of SSDs are 1) the size of stripe unit, 2) the size of read and write-unit in SSDs (i.e., a page with 4KB) as compared to HDDs (i.e., a sector with 512 bytes), and 3) unbalanced read and write operations. In terms of reliability, specific reliability challenges of SSDs such as read/write disturbs, endurance limitation, and data retention threaten the reliability of SSDs [7]. Moreover, different number of P/E cycles on disks in a RAID system, and time variant BERs on one disk are the key challenges in modeling the reliability of SSD-based RAIDs [8].
How to store parity units among disks of array affects the reliability of RAID. By dedicating one disk for parity in RAID4, the parity disk imposes a higher number of P/E cycles; as a result, the reliability of the parity disk degrades faster than the other disks. Moreover, the parity disk wears out much faster than the other disks. In other words, the vulnerability to failure of one disk, i.e., parity disk, is much higher than the other disks. Whereas, by distributing parity units evenly among all disks in RAID5, all disks impose the same number of P/E cycles. The same P/E cycles on all disks is desirable when the number of P/E cycles is small. Because when the number of P/E cycles is small, the BER is also low, therefore, the reliability of all disks are high. However, by increasing the number of P/E cycles on all disks of the array, when the number of P/E cycles in all disks reaches to the endurance limit, the reliability of the array would be threatened extremely. In this case, the simultaneous failures of two disks is highly probable. It occurs when the second disk is failed during the restoring process of the first disk.
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing
[email protected].
the author's version of aninarticle that hasofbeen in this journal. wereContent made tomay thischange versionprior by thetopublisher prior to Citation publication. This article hasThis beenisaccepted for publication a future issue thispublished journal, but has not beenChanges fully edited. final publication. information: DOI The final version of record isIEEE available at http://dx.doi.org/10.1109/TMSCS.2016.2598746 10.1109/TMSCS.2016.2598746, Transactions on Multi-Scale Computing Systems AUTHOR ET AL.: TITLE
3
(b) RAID5
(a) RAID4
Fig. 2. NAND raw BER as a function of P/E cycle [23].
(c) RAID6 Fig. 1. The parity distribution in RAID4, RAID5, and RAID6.
2.2 Estimating the Reliability of SSD-based RAID There are several approaches to estimate the reliability of the disk array. Mean Time To Data Loss (MTTDL) is the primary and simple approach which has been used in various literature for reliability evaluation [14]. Although MTTDL is a traditional metric for estimating the reliability of storage systems, there are major problems with MTTDL in measuring the storage system reliability, as discussed in [15]. Several metrics have been proposed in literature as an alternative to MTTDL such as Data Loss events per Petabyte Year (DALOPY) [16], Double-Disk Failures Per 1000 Reliability Groups (DDF pKRG )[17], and Bit Half-Life (BHL) [18]. Due to the inherent characteristics of SSDs, the reliability of an array of SSDs is modeled different from HDDs in several aspects. As mentioned before, SSDs suffer from the write endurance limit. Moreover, the different BERs of each SSD in an array of disks is also a challenge for computing the reliability of SSDs. It is worth to notice that the lifetime as well as the endurance of RAID was considered in recent studies to compare the reliability of SSD-based RAID [19][20]. The main studies on the reliability of SSD-based array are summarized as follows. A comprehensive analysis of different disk arrays is performed in [15] to investigate whether MTTDL provides an appropriate estimation of disk array reliability or not. The results of the investigation show that MTTDL is a proper estimation of long-term reliability, however it underestimates the short-term reliability. In [8] an analytical model of reliability for SSDbased RAID is proposed considering time variant-bit error rate and aging of SSDs. to consider the effect of parity distribution on the reliability of systems in [8], a comparison on RAID5 and Diff-RAID is performed. The result of the comparison showed that RAID5 provides better reliability when the bit error rate is low, while DiffRAID provides higher reliability in the case of similar error rate and recovery rate. In [21], a heterogeneous device with different BERs is considered. They developed a mapping algorithm for erasure codes among devices to provide the the maximum reliability for the system. They used a Monte Carlo reliability simulator, called HighFidelity Reliability (HFR) Simulator[22], to simulate the reliability of (flat) XOR-based codes.
2.3 Reliability Challenges in SSD-based RAID As mentioned before, the use of SSDs in RAID system has been increased to provide better performance and reliability. But, there are several challenges in SSD-based RAID due to the inherent characteristics of SSDs. One of these challenges, which this paper attempts to solve it, is to estimate the reliability of SSD-based RAIDs regarding different BERs. The reliability of SSD-based RAIDs are different from the HDD-based ones due to the dependency between the number of P/E cycles and BER. As shown in Fig. 2, by increasing the number of P/E cycles, the BER is increased exponentially [23]. Therefore, the number of P/E cycles of each disk affects the reliability of that disk [4]. In a system with several SSDs, each SSD may receive different number of P/E cycles, i.e., different BERs during the operation of the system. Moreover, the number of P/E cycles of each SSD is increased by the lapse of time. It means that, in a SSDbased RAID, there are several SSDs with different reliabilities, while the reliability of each disk is varied by the lapse of time. There is a mapping between the number of P/E cycles and the bit error rate (BER). This mapping can be extracted from the real data [23], which is followed an exponential function: 𝐵𝐸𝑅 = 𝑓(𝑃/𝐸) ≈ exp(𝑃/𝐸). In this paper, without lose the generality, only an example of the range of P/E cycles and their corresponding BER is considered. As mentioned in Section 2.1, the parity units are distributed differently in RAID4 and RAID5, resulting in different BERs for disks of the array. Unlike HDD-based array, the parity distribution imposes different BERs for each disk in SSD-based array. In RAID4, one disk, i.e., parity disk reaches to the endurance limit much faster than other disks and should be replaced. On the other hand, in RAID5, by wearing out all the disks with the same rate, when the number of P/E cycles is close to the endurance limit, the simultaneous failures are more likely to occur.
3
THE PROPOSED RELIABILITY MODELS
This paper introduces two models for evaluating the reliability of SSD-based RAIDs: (1) a Quantitative Reliability Model (QRM) which uses two new metrics for reliability evaluating, (2) an Interval Reliability Model (IRM) which computes the reliability of system with regards to the BER of the blocks of SSDs. These two models are introduced in
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing
[email protected].
This article hasThis beenisaccepted the author's for publication version of aninarticle a future that issue hasofbeen thispublished journal, but in this has not journal. beenChanges fully edited. wereContent made tomay thischange versionprior by thetopublisher final publication. prior to Citation publication. information: DOI 10.1109/TMSCS.2016.2598746, The final version of record isIEEE available Transactions at on Multi-Scale http://dx.doi.org/10.1109/TMSCS.2016.2598746 Computing Systems 4
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
the following subsections in detail: Semi-Safe Semi-Safe
3.1 The Proposed Quantitative Reliability Model (QRM) In this subsection, a quantitative model for estimating reliability is presented. This model estimates the reliability regarding two concepts: 1) the time of first disk failure in RAID, and 2) the probability of simultaneous failures of disks. Both of the above-mentioned concepts affect the reliability of the RAID system. As discussed in Section 2.1, the reliability of the traditional RAIDs, i.e., RAID4 and RAID5, are threatened in terms of the first disk failure and the simultaneous disk failures, respectively. As there were no models to consider the above-mentioned concepts in the evaluation of reliability, this paper introduces two metrics for this purpose: a) the duration of the safe time, and b) the variance of the dead time. Before introducing these metrics, four different zones for the duration of SSD‘s operation need to be defined. Due to the effect of P/E cycles on the reliability of SSDs, in this paper the duration of operation for each SSD has been considered in four different zones with respect to the number of P/E cycles. These zones are defined as follows: Safe zone: In this zone, the BER of SSD is near zero, and as a result the reliability of SSD can be assumed to be very close to one in this zone. When all the SSDs of a RAID system are placed in this zone, the reliability of the system can be assumed equal to 1. Semi-safe zone: In this zone, the BER of SSD is not zero, but still low enough and acceptable. So the reliability of SSD is high enough to tolerate the read, write, and erase disturb errors. When at least one of the SSDs enters the Semi-safe zone, the reliability of the system would be less than one. By increasing the number of SSDs which leaves the safe zone and enters the semi-safe zone, the reliability of the RAID system would decrease. Unsafe zone: In this zone, the BER of SSDs is considerable and the reliability of SSD is lower than the two above-mentioned zones. By increasing the number of SSDs in this zone, the reliability of the RAID system decreases rapidly. It is strongly recommended that in the worst-case, at most one SSD places in this zone in each moment of simulation. To minimize the failures caused by read, write and erase disturbs, the SSD that has entered the unsafe zone should leave this zone as soon as possible. When one SSD leaves this zone, it enters the dead zone and should be replaced with a new one. Dead zone: In this zone, the number of P/E cycles of SSDs is reached to the endurance limit. The SSD in this zone is not reliable anymore, and should be replaced with a new one. It is important not to placed more than one SSD in this zone simultaneously due to the replacement penalty. The state diagram of these zones is illustrated in Fig. 3.
UnSafe
Safe
Dead Replacement
Fig. 3. The state diagram of SSD in different zones.
These four zones are defined only to clearly consider four different reliability levels of the system. The boundary of these zones and the separator point would be different for various applications. For example, based on the level of reliability requirement, the unsafe zone can be considered as dead zone (the time of replacement) in some applications. The boundary of unsafe and semi-safe zones is not defined clearly in this paper, because these two zones are not used in the processes of proposed HyRAID. These two zones can be considered as the same zone which is placed between the safe and dead zones. With respect to the above-mentioned zones, in this paper a Quantitative Reliability Model (QRM) is proposed for comparing the reliability of SSD-based RAIDs. This model applies two metrics for evaluating the reliability of SSD-based RAIDs by considering the zones of their SSDs. These metrics are defined as follows. 1) Duration of Safe Time: this metric indicates the duration of time that all SSDs of a RAID system are placed in the safe zone. The time when the first SSD leaves the safe zone indicates the duration of safe time. As the reliability of SSD in the safe zone is high (approximately one); hence, as long as the SSDs of a RAID system stay in this zone, the high reliability is achieved for the system. 2) Variance of Dead Time: this metric indicates the difference between the dead times of SSDs. As the reliability of SSDs in the dead/unsafe zone is significantly low, the minimum number of SSDs should be placed in these zones in each time. The larger number of SSDs is placed in these zones; the probability of simultaneous failures is higher. If the difference between the dead times becomes high enough, the probability of simultaneous failures would be decreased. The policy of reliability evaluation of QRM in this paper is defined based on two claims using the above two metrics. The proof of these two claims is obvious. Appendix A and Appendix B provide an example for Claim I and Claim II, respectively. Claim I. The longer time all SSDs of a RAID system remain in the safe zone, the more reliable is the RAID system. The reliability of a RAID with longer duration of safe time is higher than a RAID with smaller duration of safe time. Claim II. The larger is the variance among the dead time of SSDs in RAID, the smaller number of SSDs is placed in the unsafe zone, decreasing the probability of simultaneous failures.
3.2 The Proposed Model for Reliability of SSDbased RAID: Interval Reliability Model (IRM) In this subsection, an approach for estimating the reliabil-
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing
[email protected].
the author's version of aninarticle that hasofbeen in this journal. wereContent made tomay thischange versionprior by thetopublisher prior to Citation publication. This article hasThis beenisaccepted for publication a future issue thispublished journal, but has not beenChanges fully edited. final publication. information: DOI The final version of record isIEEE available at http://dx.doi.org/10.1109/TMSCS.2016.2598746 10.1109/TMSCS.2016.2598746, Transactions on Multi-Scale Computing Systems AUTHOR ET AL.: TITLE
5
ity of SSD-based RAIDs, called IRM, is proposed. This model computes the reliability of SSD-based RAID regarding the BER of blocks in SSDs. IRM proceeds in three steps: first, the reliability of blocks is computed with respect to the BER of each block; second, the reliability of disks (which consist of several blocks) is computed using the reliability of blocks, and finally the reliability of a RAID system (an array of disks) is computed with respect to the reliability of disks and the level of protection of RAID, i.e., the number of parity disks in RAID system. To consider the number of P/E cycles of blocks in the reliability estimation, k different BER-levels with different BERs are defined for each disk regarding the number of P/E cycles of the blocks. As P/E cycles are distributed among the blocks of a disk, the blocks may belong to each of k different BER-levels. The number of blocks in ith level is denoted by bi. All of the bi blocks are considered with the same BER. The reliability of each block in level i (blki) with constant BERi is computed by the exponential distribution shown in Equation (1). It is notable to mention that BER in Equation (1) is referred to the errors that are not corrected by the ECC. The term BER(t) indicates the BER of a block at time t.
=e
( )
(1)
The serial connection among the blocks of a disk is assumed, and also for the sake of simplicity, we do not consider the reserved blocks in our computation. Therefore, the failure of one block in the disk will result in the whole disk failure. Assuming a serial connection, the reliability of a disk with k different BER-levels is calculated by Equation (2). = ∏ × ∏ × …× ∏ = ∏
∏
= ∏
(
)
(2)
The reliability of the RAID system with n data disks and one parity disk (i.e., RAID (n+1)) is computed. A RAID system with one parity disk can tolerate the failure of one disk. An array consisting of n SSDs works correctly if either of two following conditions is met: 1) all the n disks of array work correctly, or 2) only one disk of the array is failed, while the n-1 other disks work correctly. Thus, the reliability of an array consisting of n data disks and one parity disk (i.e., ) is computed by Equation (3). ( ) The reliability of the whole system is the union of the two conditions. These two conditions are independent and their intersection is zero (i.e., 𝑃(𝐴 ∩ 𝐵) = 0). In this equation, stands for the reliability of ith disk, and 𝑅 stands for the reliability of the failed disk. There is a possibility for any of n disks to fail, while the n-1 other disks work properly (1 REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [11] Y. Cai, E. F. Haratsch, O. Mutlu, K. Mai, ―Error Patterns in MLC NAND Flash Memory:Measurement, Characterization, and Analysis,‖ in Proceedings of the Conference on Design, Automation and Test in Europe (DATE), 2012, pp. 521-526. [12] A. Brand, K. Wu, and S. Pan, ―Novel Read Disturb Failure Mechanism Induced by FLASH Cycling‖, in Proceedings of the 31st Annual International Reliability Physics Symposium, pp. 127-132, 1993. [13] E. Rozier, ―Understanding the Fault-Tolerance Properties of Large-Scale Storage Systems,‖ Ph.D. Dissertation, University of Illinois at Urbana-Champaign, Department of Computer Science, Champaign, IL, USA, 2011. [14] K. M. Greenan, J. S. Plank, and J. J. Wylie, ―Mean Time to Meaningless: MTTDL, Markov Models, and Storage System Reliability,‖ in Proceedings of the 2nd USENIX Conference on Hot topics in Storage and File Systems (HotStorage), 2010, pp. 5-5. [15] J. F. Pâris, T. J. E. Schwarz, D. D. E. Long, and A. Amer ―When MTTDLs Are Not Good Enough: Providing Better Estimates of Disk Array Reliability,‖ in Proceedings of the 7th International Information and Telecommunication Technologies Symposium, 2008, pp 140-145. [16] J. L. HAFNER and K. RAO, ‖Notes on Reliability Models for non-MDS Erasure Codes,‖ Tech. Rep. RJ–10391,IBM, Oct. 2006. [17] .J. Elerath, and M. Pecht, ―Enhanced Reliability Modeling of RAID Storage Systems,‖ in Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2007, pp. 175–184. [18] D. S. H. Rosenthal, ―Bit Preservation: A Solved Problem?‖ in Proceedings of the 5th International Conference on Preservation of Digital Objects (iPRES), 2008. [19] J. Kim, J. Lee, J. Choi, D. Lee, and S. H. Noh, ―Enhancing SSD Reliability Through Efficient RAID Support,‖ in Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems (APSys), 2012, pp. 4-4. [20] S. Lee, B. Lee, K. Koh, and H. Bahn, ―A Lifespan-aware Reliability Scheme for RAID-based Flash Storage,‖ in Proceedings of ACM Symposium on Applied Computing (SAC), 2011, pp. 374379. [21] K. M. Greenan, E. L. Miller, and J.J. Wylie, ―Reliability of Flat XOR-based Erasure Codes on Heterogeneous Devices,‖ in Proceedings of IEEE International Conference on Dependable Systems and Networks With FTCS and DCC, 2008, pp. 147–156. [22] K. M. Greenan and J. J. Wylie, ―High-fidelity Reliability Simulation of Erasure-Coded Storage,‖ Technical Report, HewlettPackard Labs, 2008. [23] http://www.mdpi.com/1424-8220/14/10/18851/htm. [24] J. S. Bucy, The DiskSim Simulation Environment Version 4.0 Reference Manual, 2008. http://www.pdl.cmu.edu/DiskSim [25] SSD Extension for DiskSim Simulation Environment, 2009. [Online]. Available: http://research.microsoft.com/enus/downloads/b41019e2-1d2b-44d8-b512-ba35ab814cd4 [26] J. Katcher, ‖PostMark: a New File System Benchmark‖, October (1997). [27] Exchange Trace. SNIA IOTTA Repository. http://iotta.snia.org/traces/130, accessed Apr. 2010 [28] Build Server Trace. SNIA IOTTA Repository. http://iotta.snia.org/traces/158, accessed Apr. 2010 [29] Norcott, W. D. ‖IOzone‖. http://www.iozone.org.
13
Saeideh Alinezhad Chamazcoti received the B.Sc. degree from Sharif University of Technology (SUT) and the M.Sc. degree from Isfahan University of Technology (IUT), both in Computer Engineering, Iran, in 2007 and 2011, respectively. She is currently a Ph.D. student in Sharif University of Technology. Her research interests include reliability of storage systems using Solid-State Drive (SSD), erasure codes, and the endurance of SSDs. Seyed Ghassem Miremadi is a Professor of Computer Engineering at Sharif University of Technology. As fault-tolerant computing is his specialty, he initiated the "Dependable Systems Laboratory" at Sharif University in 1996 and has chaired the Laboratory since then. The research laboratory has participated in several research projects which have led to several scientific articles and conference papers. Dr. Miremadi and his group have done research in Physical, Simulation-Based and Software-Implemented Fault Injection, Reliability Evaluation Using HDL Models, Fault-Tolerant Embedded Systems, Fault-Tolerant NoCs, and Fault-Tolerant Real-Time Systems. He was the Education Director (1997-1998), the Head (19982002), the Research Director (2002-2006), and the Director of the Hardware Group (2009-2010) of Computer Engineering Department at Sharif University. During 2003 to 2010, he was the Director of the Information Technology Program at Sharif International Campus in Kish Island. From 2010 to 2012, Dr. Miremadi was the Vice-President of Academic Affairs (VPAA) of Sharif University. He is currently VPAA of University. He served as the general co-chair of th the 13 Int'l CSI Computer Conference (CSICC 2008). He is currently the Editor of the Scientia Journal on Computer Science and Engineering. Dr. Miremadi got his M.Sc. in Applied Physics and Electrical Engineering from Linköping Institute of Technology and his Ph.D. in Computer Engineering from Chalmers University of Technology, Sweden, in 1984 and 1995, respectively. He is a senior member of the IEEE Computer Society, IEEE Reliability Society.
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing
[email protected].