1
MDS Product Code Performance Estimations under Header CRC Check Failures and Missing Syncs Suayb S. Arslan, Member, IEEE, Jaewook Lee, Member, IEEE, Jerry Hodges, James Peng, Hoa Le and Turguy Goker
Abstract Data storage systems that use removable media and are equipped with massive storage capacities (at around TB range) rely heavily on strong concatenated Error Correction Coding (ECC) architectures in order to guarantee very low target data loss rates. Particularly, tape drives and compact disk players employ powerful ECC schemes based on a concatenation of an outer Maximum Distance Separable (MDS) code called C2 and an inner MDS code called C1 in order to achieve this performance. In addition to data, these storage systems employ header and synchronization appends (sync patterns) for appropriate allocation of user information on the physical storage medium. Since headers and sync patterns are subject to channel errors as well, the accurately retrieved data may be regarded useless if an error occurs in either of these fields. In order to predict the very low target C2 failure rates (which is typically on the order of 10−17 ) in presence of header and synchronization errors, we propose a semianalytical method in this paper that incorporates the effects of the header and synchronization errors in the output error rate expressions. We use our proposed model with Linear Tape Open (LTO) data Suayb S. Arslan, Jaewook Lee, James Peng, Hoa Le, Turguy Goker are with the advanced development laboratory, Quantum Corporation, 141 Innovation Dr., Irvine, CA, 92617, USA, e-mail: (suayb.arslan @quantum.com, jw.lee @quantum.com, james.peng @quantum.com, hoa.le @quantum.com and turguy.goker @quantum.com) and Jerry Hodges is with Red Digital Cinema Company, 34 Parker, Irvine, CA, 92618, USA, e-mail:
[email protected]. Manuscript is prepared and submitted by May 19, 2014 and resubmitted by July 8, 2014. This study is fully supported by Quantum Corporation. Please direct all correspondence to Suayb S. Arslan should you have any concerns or questions. c ⃝2014 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained by sending a request to
[email protected]. The original version of this paper (edited) is published in IEEE Transactions on Device and Materials Reliability, sept. issue. This is the version from one of the review cycles we had in the process of the publication. November 4, 2014
DRAFT
2
examples to both show the effectiveness of the estimation results and draw some interesting conclusions.
Index Terms Error-correcting codes, reliability, product codes, optical and magnetic recording, linear tape open.
I. I NTRODUCTION Reliability and endurance of storage devices that use removable media are usually impacted by a number of factors that are different from many conventional modern storage devices. Customers and system designers must understand the parameters that influence device longevity and reliability, such as media physics, the efficiency and robustness of the error correction mechanism, and how efficiently the data is spread across the recoding medium, among other factors. The error correction capabilities of such devices usually constitute the major defense mechanism against system abnormalities and failures, thereby playing the key role in determining the impact of device-level failure rates. It is also important to accurately estimate device-level failure rates in order to predict the future reliability aspects of large storage systems that have these devices as subcomponents. To lay the groundwork for understanding the component and device-level reliability and endurance, we shall take a closer look in this paper at error correction capabilities and related intrinsic device properties that impact device durability. Contrary to Hard Disk Drive (HDD) industry [1], that conventionally uses a single channel and a single stage Error Correction Coding (ECC) engine based on Reed-Solomon (RS) and/or Low Density Parity Check (LDPC) codes, storage systems such as tape and optical drives have to offer better error correction capabilities due to their highly susceptible mechanical nature and removable media wear. For these reasons, tape and optical drive designers use concatenated ECC engines known as product ECCs [2], [3]. A product code is a two dimensional error correction code, constructed by encoding a rectangle array of user data with one code along rows and with another along columns. Receiving its major strength from multiple cross parity checks, product codes are excellent ways to form a long code with higher correction capabilities, particularly for small burst type errors, using small length constituent codes. For example, tape and optical storage systems use a product code in which November 4, 2014
DRAFT
3
both the horizontal and vertical encodings are performed in sequential order using high rate systematic shortened/punctured RS codes. Here the systematic code refers to the input data appearing in the encoded data output and shorting/puncturing is a known technique to derive short length codewords from long codewords [3]. For efficient and low complexity decoding of product codes, a modified version of the generalized minimum distance decoding [4] is used in storage systems to obtain target unrecoverable error rates below 10−17 . In otherwords, the expected number of good drives are on the order of one hundred quadrillion before the user sees a defective hard read error. However, such low error rates are extremely hard, if not impossible, to observe in real life simulations. Although the other external factors such as environmental conditions lead to much higher failure rates in reality [5], error rates are still projected to be below some threshold to meet the customer storage reliability expectations. Therefore, neat mathematical models are needed for predicting the low failure rates which might otherwise never be obtained using the available simulation tools based on real data. Reliability analysis of ECC protected storage devices is an active area of research. Recent studies usually focus on the reliability analysis for memory devices and their applications [6], [7]. The methods used for the analysis are based on a particular ECC structure and a relevant stochastic model. Straightforward stochastic models are shown to be quite useless for the prediction of long term reliability. In various applications, real life failures are shown to follow more complicated models [8], [9]. It has been shown that the defective reads and media wear might produce correlation between errors that such simple models cannot capture. For example, dead track is a common problem for magnetic tapes in which many symbols belonging to the same codeword can be effected at the same time [10]. In addition, there are other error mechanisms that can trigger user data loss such as missing syncs and/or corrupted header information that need to be accounted for in the final performance predictions. In order to predict the reliability of such media-driven storage, accurate mathematical models might be needed to incorporate as much real life situations as possible. To do that, drive data captures might be used to extract real life phenomena. Conventionally, real life behavior is captured by processing the collected data waveforms. However, the suitability of the model is critical to be able to effectively use the information provided by the collected data. The main focus of this paper is therefore the presentation of a genuine approach to this problem by introducing a data-driven non-homogenous stochastic process that can predict C2 failure rates November 4, 2014
DRAFT
4
in case of correlated data errors, latent defects [8], bad header and sync information. In other words, the presented model is able to capture to some degree the performance degradation due both to various correlated data errors and defective header/sync information. The rest of the paper is organized as follows. In section II, we introduce the error correction code of interest along with the details of encoding and decoding operations. The relationship between the device-level failures and reliability theory used to predict the long term durability of large storage systems is discussed, before the system model description is given. This section also includes a brief description of data collection methodology. In Section III, we introduce the renewal processes, which will be shown to be useful for the reliability prediction of the internal ECC mechanism of storage devices. Section IV outlines the procedure we use to estimate the decoding failure probabilities. We derive useful expressions for C2 decoder failure and byte error rates. Some numerical results are provided related to Linear Tape Open (LTO) product reliability in Section V. Finally, Section VI concludes the paper. II. BACKGROUND AND S YSTEM M ODEL A. Product code fundamentals In data storage applications, ECC is internal to the design to achieve very high data integrity. In many standards including LTO standard, a product code is used as the ECC mechanism of the read channel for providing a highly reliable data protection. A product code can be constructed by concatenating two block codes: A code A with parameters (n1 , k1 , d1 ) and a code B with parameters (n2 , k2 , d2 ) where ni , ki and di (i = 1, 2) stand for codeword length, number of information symbols and minimum Hamming distance of the code, respectively. A typical construction of the product code C = A × B is shown in Fig. 1 using two systematic block codes. A k1 × k2 data array is first encoded vertically using code B (coding k1 columns using code B) and the encoded data is then horizontally coded using code A (coding n2 rows using code A). The order of encoding operations does not matter if the constituent codes are linear. The parameters of this product code are given by (n1 n2 , k1 k2 , d1 d2 ). In general, any block code with parameters (n, k, d) satisfies the well known singleton bound d ≤ n − k + 1 [3]. Block codes that achieve this bound with equality are called Maximum Distance Separable (MDS) codes. If the constituent codes of the product code have MDS property, the code is named as MDS product code and Reed-Solomon (RS) codes [11] are one of those class of constituent November 4, 2014
DRAFT
5
n1 k1
k2
Data
n2
C2 Parity Bytes
C1 codeword
C1 Parity Bytes Parity of parity bytes
C2 codeword Fig. 1. A construction of a product code using two block codes. Each horizontal and vertical columns are codewords. Other type of constructions are possible to define a product code.
linear block codes. We use systematic RS codes in our study where the code alphabet is chosen over the finite field Fq and q is a prime power. Since the minimum distance of the code is d1 d2 , ideally, this code can correct up to d1 d2 /2 errors by using an appropriate decoding algorithm. B. Decoding The covering radius of a block code of length n is defined as the smallest integer ρ such that all codewords in the containing code space Fnq are within a Hamming distance ρ of some codeword. The covering radius of product codes are in general much larger than the minimum Hamming distance [12]. Therefore, the error correction capability of these codes are much greater than what the minimum distance suggests i.e., it is possible to decode product codes beyond half the minimum distance. Although Maximum Likelihood (ML) decoding is optimal, it is usually very complex to use the potential power offered by the product codes [13]. In this study, we consider low-complexity decoding alternatives. A specialized case of generalized minimum distance decoding is employed that is shown to give performance results close to ML decoding in some particular scenarios [4]. Although such an algorithm can decode beyond half the minimum distance of the code, a dual mode of operation [14] (error/erasure decoding) might be preferable over other modes/settings, particularly in bursty type noisy environments, due to complexity November 4, 2014
DRAFT
6
reasons. The C1 RS decoder is a low complexity hard-decision-based bounded distance decoding. In otherwords, the decoder decodes the correct codeword if and only if the received word falls within the sphere of radius ⌊ n−k ⌋, centered around the original codeword in Fnq . The decoding operation 2 can result in a successful decoding, a decoding failure or a decoding error i.e., a misdetection. If a (n, k, n − k + 1) RS codeword has more than ⌊ n−k ⌋ errors, and no other codeword has 2 ⌋ symbols in common with the bad codeword, then the decoder declares more than n − ⌊ n−k 2 a failure. In case of a decoding failure, the C1 codeword is regarded as “uncorrectable” and associated symbols are subsequently labeled as “erasures” with probability one. On the other hand, C2 RS decoding works in erasure correction mode and takes C1 provided symbol erasures as input and attempts decoding with increased decoding performance and reduced complexity the erasure correction mode. Since erasures are defined to be errors whose locations are known, the decoding operation for C2 will be low complexity because it will be almost-all erasure correction and error locations need not to be computed. Here we say “almost-all” erasures because C1 can miscorrect the codeword and those erroneous codewords will pass the decoder unnoticed, yet in general the probability of this is quite low. Due to the linearity of RS codes, the decoding can be performed in any order with the associated mode of decoding operations. Alternatively, decoder iterations might be allowed between two decoding operations using either hard or soft information [15]. However, we will keep the decoding procedure simple in this study to illustrate the main focus of the paper. In addition, such a simple approach is widely adopted by hardware developers due to low implementation complexity and very good practical performance. C. Relationship to Reliability Theory Traditionally, Markov models are used to evaluate the reliability of erasure-coded storage systems [16]. These stochastic models have provided a great deal of insight into the sensitivity of device failures and repairs on the overall system reliability. Despite these models generally capture an extremely simplistic view of an actual system, the improvements made to the original canonical Markov model are shown to prove useful for predicting the long term reliability. One of these improvements include the incorporation of hard or uncorrectable errors of storage devices [17], which are used as subcomponents of large storage systems. November 4, 2014
DRAFT
7
Frame
w Sync Header
First C1 codeword
Last C1 codeword
Sequence of coded entities in one dimensional data
Fig. 2. A coded entity is constructed by block interleaving ω C1 codewords with sync and header appends. Data consists of many concatenation of coded entities separated by format related patterns/bit streams.
Modern storage systems require maintenance and repair in order to maintain the data integrity. In order to repair the data content of a failed device, the data stored in a subset of other devices must be accessed, read and processed. Random data read process increases the probability of hitting a hard or uncorrectable error, which must be incorporated into the system reliability model. The accuracy of these reliability models are therefore dependent upon the accuracy of hard error estimations. The main focus and results of this paper are for the accurate estimation of these hard or uncorrectable error rates of the storage devices which rely heavily on removable media. D. System Model Let us describe the system model we use based on the layout given for tape systems [10]. An encoded data block is generated by block interleaving the columns of ω n1 × n2 encoded data arrays or subdata sets of Fig. 1. Each row in the interleaved array is identified to be a coded entity known as CodeWord Interleaved (CWI) in tape community. To each coded entity of size ωn1 (interleaved ω C1 codewords, see Fig. 2), some header information is appended to make up a frame. Headers typically contain header index, mechanical related parameters, location and error detection check sequences. Each frame is appended a block of synchronization information to
November 4, 2014
DRAFT
8
form the final bit stream that will later be line encoded1 and recorded. A frame with header and synchronization information appends is shown in Fig. 2. The error detection check sequence in the header field is a Cyclic Redundancy Check (CRC) code for identifying any possible changes in the header field made during the read operation. If the CRC check fails, there is no way to use the encoded data even if the decoding operations are successful. Drive systems also use some bit patterns to sync to the beginning of each frame. When there is an error in sync fields, the frames are missed and the data section is no longer useful. Header and sync information integrity is very crucial to the overall system reliability. In this study, neither the header field nor the sync patterns are protected by any type of ECC. If either headers and/or sync information are part of the ECC [18], our results in this paper will still be applicable although the header/sync errors will in this case be correlated with the decoder failures, making the analysis a bit more complicated. The recorded data is read and processed and recorded bits are detected using advanced detection algorithms such as list noise predictive ML detector [19], [20]. Detected bit stream is rearranged and reformatted to form subdata sets for C1 and C2 decoding. This system model is briefly shown in Fig. 3. E. Data Collection We assume either missing sync information and/or headers with CRC check failures shall render the entire frame useless i.e. all ω codewords are assumed to be failed. If both sync and headers are error-free, the number of failed codewords could be 0, 1, . . . , ω where each possibility occurs with some probability. Thus, there is a probability distribution underlying the number of uncorrectable codewords, denoted by ρR (r), that are usually a function of system parameters2 . For convenient analysis, we collect and divide the actual data waveforms into five disjoint sets A1 , A2 , A3 , B and C as shown in Table I. The main reason for having disjoint sets is that error processes are assumed to be independent to make our data analysis tractable. Data formatter distributes the frames along and across the storage medium to achieve media level decorrelation as much as possible. 1
Here the line encoding corresponds to some form of run length constrained coding that will ensure some spacing between
repeating zeros and/or ones for appropriate magnetic recording. 2
For storage systems for example, operating density is one of the parameters that effect this distribution. Others might be
head and media types. November 4, 2014
DRAFT
9
Case
Sync
Header
Number of useless CWs
A1
×
×
ω
A2
X
×
ω
A3
×
X
ω
B
X
X
0
C
X
X
{1, . . . . , ω} TABLE I
O UTCOMES IN DIFFERENT ERROR SCENARIOS . X: N O ERROR , ×: FAILURE
III. H EADER /S YNC E RRORS AS R ENEWAL P ROCESSES Our device failure modeling is based on the observation that header and/or sync errors appear to arrive at the input of the C1 decoder at certain times and they can be regarded as “arrivals” of a stochastic process. As shown in Fig. 3, some of the frames are illegible for the drive read heads to extract intelligible information and therefore are regarded as arrivals i.e., failure occurrences from a decoding point of view. Since such arrivals has an impact on the overall decoding performance, it might be of interest to model this arrival behavior so that an accurate
Fig. 3. A rough illustration of the system block diagram. Signal processing details are not shown. The details of the tape data format (the Formatter) can be found in one of the IBM’s patent [10]. DS and SDS denote a data set and a subdata set, respectively.
November 4, 2014
DRAFT
10
decoding performance estimation can be performed. Note that this frame level faulty behavior is an extra source of error in addition to the codeword level failures. Next, we shall discuss the basics of renewal processes and their immediate relevance to our study. A. Renewal Processes A renewal process is a special case of a more broader class of stochastic processes called point processes and are used to model occurrences of events in which the time or space between any pair of occurrences can be approximated by independent and identically distributed (i.i.d.) random variables [21]. Let Yj , j ≥ 1 be a sequence of i.i.d. random variables, also known as the ∑ inter-event times. For all n ≥ 0, we define Sn , nj=1 Yj . The number of renewals in the time interval [0, t] is given by the following counting function, expressed as follows N ([0, t]) = N (t) , sup{n : Sn ≤ t} =
∞ ∑
1[0,t] (Sn )
(1)
n=0
where the indicator function for set A, 1A : R → {0, 1}, satisfies the relationship 1 if x ∈ A, 1A (x) = 0 if x ̸∈ A
(2)
For example, if Yj (t) is distributed exponentially with rate λ, i.e., Yj ∼ exp(λ), then the count function is given by the Poisson process with rate or density λ i.e., P r{Nλ (t) = κ} = e−λt
(λt)κ κ!
(3)
For another example, if Yj ∼ W eibull(β, θ), i.e., βtβ−1 −(t/θ)β Yj (t) = e θβ
(4)
then the count function is more complicated and studied in [24]. Note that having an interarrival distribution that accepts more parameter arguments naturally has a better potential for modeling the real life situations. In our study, frames are chosen to be the minimum atomic unit over which the stochastic modeling will be described although this is not necessary. Thus an arrival in our study is a frame that either contains a bad header, a wrong sync information or at least one codeword that causes C1 decoder failure. This stochastic process formulation can be used to model the data behavior by finding the appropriate parameters of the random process. The procedure for finding the model parameters will be explained in the next subsection. November 4, 2014
DRAFT
11
Since our atomic units are frames for the characterization of arrivals, we need to slightly generalize the original arrival process to encompass the effect of uncorrectable codewords. Let us assume that at each renewal time Sn , we are given a random penalty, Rn , where in our study Rn ∈ {0, 1, . . . , ω} is the number of failed codewords in a given CWI. We further assume that random variables {Rn , n ≥ 0} are independent, and the sequence R1 , R2 , . . . , Rn is i.i.d. i.e., the probability distribution ρR (r) does not change with time, implying that overall process is stationary. The renewal penalty process is defined to be ∑
N (t)−1
R(t) ,
Ri =
i=1
∞ ∑
Ri 1[0,t] (Si )
(5)
i=1
where R(t) is the number of uncorrectable codewords (penalty) up to time t. Thus, the C1 uncorrectable rate calculation is transformed into estimating the output limiting behavior pC1 = lim
t→∞
R(t) E[R1 ] = t E[Y1 ]
(6)
where E[.] is the expectation operation. We finally note that equation (6) falls from the known result in renewal theory that limt→∞ N (t)/t = 1/E[Y1 ] [21]. B. Non-Homogenous Poisson Process Interarrival distributions are not necessarily exponential. In order to model for arbitrary interarrival distributions, we need to extend the homogenous Poisson process to account for nonhomogenous cases with time dependent rate or intensity. Non-homogenous Poisson process is a generalization of the one dimensional regular Poisson count process in which the intensity ∫t function is time dependent λ(t). Thus, N (t) has a Poisson distribution with mean 0 λ(s)ds, i.e., P r{Nλ(t) (t) = κ} = e−
∫t ( λ(s)ds)κ 0 0 λ(s)ds κ!
∫t
(7)
As we shall observe later in our numerical results, that in high density recording, the interarrival times are rarely exponentially distributed. However, if a simple transformation is applied to unit rate exponentials, interarrival distributions that assume different functional forms can be obtained. In that transformation operation, appropriate adjustments of the measure µ(.)3 of time and count 3
The concept of ”measure” generalizes the time measure to multiple dimensions and is denoted by µ(.). For example, a
measure of the time interval [0, t] is µ([0, t]) = t called Lebesgue measure. November 4, 2014
DRAFT
12
model will allow us to quickly obtain non-homogenous Poisson processes. For example, let Ei denote a unit rate (λ = 1) exponential random variable and N1 ([0, t]) be the corresponding Poisson count process. Let also F : R → R be a linear transformation function defined by F(x) = x/λ. If Nλ ([0, t]) is a Poisson count process with constant intensity λ, one can show that the following count processes are equivalent [21], Nλ ([0, t]) ≡ N1 ([0, µ(F −1 ([0, t]))]) ≡ N1 ([0, λt])
(8)
where F −1 is the inverse of F and µ(.) is assumed to be Lebesgue measure [21]. Finally, one can deduce the following equivalence for non-homogenous intensity function λ(t), (∫ t ) Nλ(t) ([0, t]) ≡ N1 λ(s)ds
(9)
0
Therefore, in order to get a non-homogenous Poisson process, it is sufficient to obtain the renewals using unit rate exponentials and then appropriately modulate the sampling speed based on the time dependent rate or intensity function λ(t). This approach will be adapted in our numerical calculations for the accurate estimation of time dependent rate function. C. Data fitting and parameter estimations We obtain the parameters of the interarrival distributions using basic data fitting techniques [22]. The recorded data is assumed to be a concatenation of frames as shown in Figs. 2 and 3 and the time between two renewals is measured in terms of number of frames. We could have chosen bytes or bits as our basic discrete time unit as well, but that may complicate our computations and would not change our conclusions. For data fitting purposes, we consider the natural logarithm of the survival or reliability function based on the estimated values, ΨYj (t) , P r(S1 > t) = P r{Nλ(t) (t) = 0} = e− ∫ t and ln(ΨYj (t)) = − λ(s)ds.
∫t 0
λ(s)ds
(10)
0
In essence, we fit an appropriate function to data points in logarithm domain so that the error due to approximation is minimized. More precisely, if Yj ∼ exp(λ) then, ln(ΨYj (t)) = −λt is a linear function of time t. A linear regression [23] on data points will yield an estimate value for the rate. Similarly, if Yj ∼ W eibull(β, θ), then ln(ΨYj (t)) = −θ−β tβ = −btc . This time we need November 4, 2014
DRAFT
13
a non-linear regression and the data fitting process in this case will yield two estimate values: bb and b c. We can find the appropriate rate function as follows, ∫ t csbc−1 λ(s)ds = bbtbc =⇒ λ(s) = bbb
(11)
0
Therefore, instead of using complicated count models found in [24], one can transform the original non-homogenous process and use unit-rate homogenous Poisson processes to approximate the functional data behavior. D. Normalizations Since the recorded data in real life is processed in large chunks of finite size frames and Nλ (t) has usually an infinite support, the distribution function for Nλ (t) must be truncated and rescaled in order to increase the prediction accuracy. By truncation, we mean the distribution that arises when all realizations of a random variable which exceed some threshold τ are thrown out. Let FNλ (t) (x) be the cumulative distribution of the count of renewals Nλ (t). The truncated cumulative distribution function is given by FN (t) (x)/FN (t) (τ ) x < τ λ λ (τ ) FNλ (t) (x) , 1 x≥τ
(12)
(τ )
and the corresponding count function is referred as Nλ (t) where the superscript τ is the truncation threshold. In a clear read waveform, we note that there can be lots of consecutive processed frames without having bad sync, header or C1 decoder failures i.e., P r{Ri = 0} can be close to unity. This can lead to inaccuracies for estimating the rate of renewals. Thus assuming a constant throughput during normal operation, although our smallest atomic unit is a frame, we use clean frames for estimating the time between renewals. Let us define the sequence of normalized random variables, denoted by Ri , such that P r{Ri = 0} = 0 and for j ∈ {1, . . . , ω}, we have P r{Ri = j} P r{Ri = j} = P r{Ri = j} = ∑ω (13) 1 − P r{Ri = 0} k=1 P r{Ri = k} Upon using a law of large numbers argument, estimation of E[Ri ] will be given by the counts of coded entities with at least one C1 decoding failure. Statistical analysis tools such as Markov chains might yield better insights to the modeling of these sequence of random variables. Yet, we consider only averages in our model for simplicity. November 4, 2014
DRAFT
14
C2 codeword has three erasures
Fig. 4. An illustration of C1 decoder output used by the C2 decoder as erasures for improved product code decoding performance.
IV. D ECODER P ERFORMANCE E STIMATIONS Before giving the details of our performance estimations, let us consider Fig. 4. The symbols of three C1 codewords are labeled erasures due to three different reasons. If the data is not correctable i.e., the number of errors go beyond the correction capability of the code, all of the symbols in the codeword are labeled as erasures even though some of the symbols might be correct. This is because we do not assume any mechanism in our study that helps distinguish which C1 codeword symbol is in error. Also, header and sync errors cause missing codewords. Due to the format structure, these missing codewords can be identified by the system firmware and codeword locations are labeled as erasures. Thus, C2 decoder works in erasure correction mode using the information provided by C1 decoder and the Formatter. A. C2 failure rate estimations Let us consider a data set consisting of S equal-sized subdata sets and each subdata set contains m frames. Each frame has a specific subdata set allocation and the allocation policy may vary from one application to another. For example, [10] proposes one for tape systems. Let us divide the frames (including the associated sync information) of any data set into five disjoint sets A1 , A2 , A3 , B and C based on sync, header and C1 codeword failure statuses as outlined ∪ in Table I (see Section II Subsection D). Let us also define for convenience that A , i Ai so that A, B and C constitute mutually disjoint sets. Based on the collection of frames in November 4, 2014
DRAFT
15
each of these sets, we can estimate the parameters of the model interarrival times and counting functions assuming multiple independent renewal processes have acted on the data during the read/write operations. Since decoding operations are independent for each data set, they are indistinguishable in our analysis and we will use all the frames in all of the subdata sets as if we consider a specific data set that is recorded and read multiple times. This approach leads to the assumption that the error characteristics are independent across subdata sets, which is quite reasonable due to the genuine frame allocation policies used for storage devices with removable media [10]. For the time being, let us start by assuming that interarrival times are accurately modeled by exponential distributions. We use all the frames present in set A to compute an estimate rate λA and set C to compute an estimate rate λC using one of the linear regression techniques summarized in Subsection C of the previous section. Since the total number of frames is Sm, (Sm)
(Sm)
the corresponding truncated count functions are denoted as NλA (t) and NλC (t), respectively. We use counting arguments to estimate the expected value of Ri . Using a renewal penalty process for set C based on the Poisson assumption, we have the C1 failure probability given by pC1 = λC E[R1 ]. As can be noticed, there are two non-overlapping (due to disjoint sets) and independent renewal processes that give rise to erasure information for C2 decoding. (Sm)
For notation convenience, we define ηA
(Sm)
(a) , P r{NλA (t) = a} for a ≥ 0 if only set A is
used for estimating the count function. Let us condition on the number of sync or header errors a ≤ n2 , conditional probability of having m of these belong to a specific subdata set is given by the hyperbolic distribution for 0 ≤ m ≤ min{n2 , a} P r{L =
(Sm) m|NλA (t)
(n2 )(Sm−n2 )
= a} =
m
(Sma−m )
(14)
a
where L is the random variable used to represent the number of header and/or sync errors in a particular subdata set. The unconditional probability expression can be given by (n2 )(Sm−n2 ) Sm ∑ (Sm) ) P r{L = m} = ηA (a) m (Sma−m a=0
(15)
a
for 0 ≤ m ≤ min{n2 , a}. Let us assume that C2 decoder is set to correct up to υ2 ≤ d2 − 1 erasures. The main reason for allowing such a margin is to handle possible C1 miscorrection issues. Note that there remains November 4, 2014
DRAFT
16
only υ2 −m possible erasures due to C1 failures that can successfully be decoded by C2 decoder. Otherwise, the C2 decoder is assumed to declare a failure. Conditioning on m yields the following C2 failure probability ∑n2 −m
j=υ2 −m+1
(n2 −m) j
pjC1 (1 − pC1 )n2 −m−j
1
if m ≤ υ2
(16)
otherwise.
Here C1 failure probabilities are assumed to be independent. As mentioned before, this is a reasonable assumption because most storage systems that use removable media spreads the frames across the magnetic medium to achieve media level decorrelation and help improve the overall performance. Therefore, the unconditional C2 failure rate can be found by averaging over m as given by equations (18) and (19). Using the renewal process argument described earlier, the model parameters of the system can be estimated and plugged into this equation to find the estimate C2 failure probability. These expressions can simply be extended to cover nonhomogenous Poisson process modeling as well by assuming the time dependence and estimation of λA (s) and λC (s) using equation (10). Although C2 decoder can be set to correct up to υ2 < d2 − 1 erasures, this does not eliminate the probability of C1 decoder errors/miscorrections and this is taken care of by the subsequent C2 decoder operation. However, we note that if υ2 is chosen reasonably, this pathological case can be ignored and our expressions will become quite accurate. B. C2 byte error rate estimations As mentioned earlier, if C1 decoding fails, it is possible that most of the bytes are still correct. Likewise, if a header CRC check fails or sync information is missed, most of the data bytes
pC2 =
n2 ∑ Sm ∑ m=0 a=0
( )(Sm−n2 )
n2 (Sm) ηA (a) m
a−m (Sm ) a
( )(Sm−n2 )
( ) n2 − m j pC1 (1 − pC1 )n2 −j−m j −m+1
n∑ 2 −m j=υ2
(18)
(n2 )(Sm−n2 ) ( ) υ2 ∑ Sm ∑ n2 − m j (Sm) a−m n2 −j−m = pC1 (1 − pC1 ) +1− ηA (a) m (Sm ) j a m=0 a=0 m=0 a=0 j=υ2 −m+1 (n2 )(Sm−n2 ) ( ) υ2 ∑ Sm υ∑ 2 −m ∑ n2 − m j (Sm) a−m pC1 (1 − pC1 )n2 −j−m =1− ηA (a) m (Sm (19) ) j a m=0 a=0 j=0 υ2 ∑ Sm ∑
November 4, 2014
n2 (Sm) ηA (a) m
a−m (Sm ) a
n∑ 2 −m
DRAFT
17
might still be correct. Therefore, byte error rate is expected to be lower than the estimate C2 failure rate pC2 . If there are m byte erasures due to sync or header errors and j byte erasures due to C1 decoder failures, then we have j + m bytes with erasure labels per C2 codeword. For example in Fig. 4, a case with m = 2 and j = 1 is shown. Since each byte is equally likely to be labeled as erasure, probability of a randomly chosen byte to be labeled as erasure is given by (j + m)/n2 . Out of j + m erasures, assume that γ fraction of them are actual byte errors. Therefore, the C2 byte error rate is given by υ2 ∑ Sm υ∑ 2 −m ∑ (Sm) pC2byte = 1 − ηA (a) m=0 a=0 j=0
(n2 )(Sm−n2 )
( ) (j + m)γ n2 − m j pC1 (1 − pC1 )n2 −j−m (20) n2 j The fraction γ is related to C1 decoding performance. Let us assume independence and that m
(Sma−m ) a
C1 decoder is set to correct up to υ1 errors where υ1 is a fixed integer satisfying 2υ1 ≤ d1 − 1. If the codeword suffers υ1 or fewer errors, the decoding will be successful. On the other hand, if the number of errors exceeds υ1 , but is less than d1 − υ1 , then the decoding fails with probability one. Finally, if d1 − υ1 or more errors occur, either the decoder will fail or find a codeword other than the recorded codeword i.e., decoder error or a miscorrection occurs. Thus, using an approximation given in [25], we can express the total failure probability as follows ) d1 −υ 1 −1 ( ∑ n1 i pC1 ≈ pbyte (1 − pbyte )n1 −i i i=υ1 +1 ( ) n1 ∑ n1 i +Q pbyte (1 − pbyte )n1 −i (21) i i=d1 −υ1 ∑ 1 (n1 ) where Q = (q −d1 +1 −q n1 ) υs=0 (q−1)s , pbyte is the byte error rate at the input of C1 decoder s and the C1 code is defined over Fnq 1 . Based on this expression, the decoder error probability (n1 ) i ∑ 1 can be approximated by pC1,e ≈ (1 − Q) ni=d pbyte (1 − pbyte )n1 −i . From a C2 decoding i 1 −υ1 perspective, if we assume byte error probability is the same for both C1 decoder failure and error scenarios, the probability that any byte being in error will be given by ( ) n1 ∑ Z × i n1 i pbyte (1 − pbyte )n1 −i γ= i n 1 i=υ +1
(22)
1
where the scaling coefficient Z = ∑n1 i=υ1 +1
November 4, 2014
(n1 ) i
1 pibyte (1
− pbyte )n1 −i
.
(23)
DRAFT
18
Density
25%
32%
39%
49%
55%
b b
7.7e-4
2.5e-4
8.7e-4
7.37e-3
2.7e-2
b c
0.855
0.92
0.89
0.778
0.71
99.85%
99.98%
99.94%
99.91%
99.93%
1.85e-4
1.93e-4
3.47e-4
1.09e-3
2.8e-3
Elevations
Fit accuracy Homogenous
(λ)
Poisson TABLE II DATA MODEL PARAMETER ESTIMATES USING M EDIA 1.
Density
25%
32%
39%
49%
55%
b b
4e-4
2.5e-4
5.9e-4
2.3e-2
6.5e-2
b c
0.92
1.02
1.01
0.74
0.75
99.93%
99.95%
99.96%
99.97%
99.92%
1.81e-4
3.13e-4
6.52e-4
3.2e-3
1.44e-2
Elevations
Fit accuracy Homogenous
(λ)
Poisson TABLE III DATA MODEL PARAMETER ESTIMATES USING M EDIA 2.
is necessary because γ is conditioned on having a C1 decoder failure. Finding the right value of γ for correlated error scenarios might be challenging and will vary from one application to another. It may be a good idea to measure γ for correlated error scenarios and use the derived expressions for C2 decoder failure and byte error estimations. V. N UMERICAL R ESULTS In our numerical results, we consider a set of real tape data captured using LTO drive with two different media types: Media 1 and Media 2, operating at different recording densities. Media 1 can be regarded as “good” magnetic recording medium relative to Media 2 as our test results shall reveal it shortly. We assume a set of elevated densities 25%, 32%, 39%, 49% and 55%. That is to say, we quantify the test densities as percentage increase of the nominal operating density November 4, 2014
DRAFT
19
Probability
0
−1
−2
Data Non−homogenous Poisson
0.2 0.15 0.1
j
ln(ΨY (t))
0.05 0
−3
0
5 10 # of header errors
15
−4
−5 Data Exponential Weibull
−6 0
1000
2000
3000 4000 Header index (t)
5000
6000
7000
Fig. 5. Different interarrival times and corresponding Poisson Processes at an elevated density of 49% using Media 1.
Probability
0
−1
−2
Data Non−homogenous Poisson
0.2 0.15 0.1
j
ln(ΦY (t))
0.05 −3
0
0
10 20 # of header errors
30
−4
−5 Data Exponential Weibull
−6 0
200
400
600
800 1000 1200 Header index (t)
1400
1600
1800
2000
Fig. 6. Different interarrival times and corresponding Poisson Processes at an elevated density of 49% using Media 2.
of an LTO drive. Note that using set C at a given density elevation, we can estimate an average C1 decoding failure probability. However, in order to see the effect of different C1 failure rates on the C2 decoding, we plot our results for a range of C1 failure rates instead. Set A is used to find the model parameters concerning header and sync errors. We assume the interarrival times are either exponentially distributed or Weibull distributed, although our non-homogenous model description can approximate any form of interarrival distributions. Let us first establish the accuracy of the estimated model parameters described earlier. The November 4, 2014
DRAFT
20
Tables II and III show the estimated parameter values of the non-homogenous Poisson model for different density values for both types of media. These tables also show the single estimated parameter of the homogenous Poisson process for comparison. As can be seen, up to 39% elevated density, the arrival of drive header and/or sync errors can be well approximated by a standard Poisson renewal process. This can be seen by looking at the estimated non-homogenous Poisson model parameters, i.e., b c ≈ 1 and bb ≈ λ. Since the Poisson distribution is the limiting distribution for binomial, one can conclude that the header and/or sync errors at the output of the read channel can be regarded as random and independent. However, as we go to higher elevations we notice that standard Poisson is no longer able to model the data behavior. As apparent from the accurate fitting performances, a non-homogenous Poisson model is a better approximation. Let us plot ln(ΨYj (t)) and the regression results on top of each other to pictorially see how well the regression performs. Fig. 5 and 6 show ln(ΨYj (t)) for two different interarrival distributions at an elevated density 49% using Media 1 and Media 2, respectively. This high density elevation is chosen so that some of the inaccuracies in the model are visible in our plots. As can be seen Weibull interarrival distribution models the data behavior adequately at this elevation. The corresponding non-homogenous Poisson count process is also shown on right top of the same figures. Yet, we observe that there seems to be an inconsistency with the data using this count process. This is probably because another interarrival distribution (perhaps not one of the known distributions) could have modeled the data count process and the behavior better relative to the Weibull distribution. Finally, we remark that for lower elevations exact functional fits are observed, suggesting that the model developed in this paper is pretty good in predicting the product reliability estimations for a large range of density elevations. Next, we present C2 decoder failure rate performance for various elevated densities. We assume a range of C1 failure rate probabilities pC1 and a target C2 decoding failure probability of 10−17 as our product specification. Let us use (240, 230) C1 code defined over F240 256 with v1 = 5 and (96, 84) C2 code defined over F96 256 with v2 = 10 although C2 code can correct up to 12 erasures. Remember that this is done to minimize the effect of C1 decoding error and error propagation for the next decoding stage. Figs. 7 and 8 show the decoding performance as a function of pC1 for various elevated densities. Although the density at which the drive is operating more or less determines the operational pC1 , we provide here the performance for the full range of pC1 for completeness. As expected, at higher elevated densities, the decoding performance degrades. November 4, 2014
DRAFT
21
−10
10
−15
C2 Failure Rate
10
−20
10
Ideal 25% 32% 39% 49% 55% Target
−25
10
−30
10
−35
10
−4.2
−4
−3.8
−3.6
−3.4 −3.2 log10(pC1)
−3
−2.8
−2.6
Fig. 7. Probability of C2 decoding failure as a function of pC1 at different elevated densities using Media 1. −5
10
−10
10
−15
C2 Failure Rate
10
−20
10
Ideal 25% 32% 39% 49% 55% Target
−25
10
−30
10
−35
10
−4.2
−4
−3.8
−3.6
−3.4 −3.2 log10(pC1)
−3
−2.8
−2.6
Fig. 8. Probability of C2 decoding failure as a function of pC1 at different elevated densities using Media 2.
Another observation is that the header and sync pattern errors can significantly impact the C2 decoder failure probability, especially at higher elevated densities. We note that the measurable C2 decoder performance can only give reliable estimates at around 10−4 to 10−5 error/failure rates using the amount of collected data. At these C2 decoding failure rates, the recording channels are mostly close to useless and various types of errors other than the ones we are interested in this study impact the performance. It might be pretty challenging to model and quantify these errors. Thus, we have not included them in our figures. However, the amount of collected data November 4, 2014
DRAFT
22
is shown to be sufficient to model the data behavior and hence to accurately estimate the failure rates of interest. For reliability estimations, let us take a different presentation approach based on an LTO tape drive system used in conjunction with a cartridge of a 2.5TB native user capacity. More specifically, we present the average number of cartridges before a customer sees a single decoder failure based on (19) and LTO format specifications [10]. This is maybe a more relevant quantitative performance metric from a business standpoint. For this, we assume a set of operational C1 decoding failure rates under various elevations as shown in Table IV. This choice is completely arbitrary and can change based on the product operating points for each vendor. Also included in Table IV are the average number of cartridges in an ideal scenario, i.e., independent and random errors with no header or sync failures. As can be seen, higher density elevations lead to less average number of cartridges without a failure for both media types. We observe that Media 2 is effected seriously by density elevations and the performance must be improved either through using novel header and/or sync mechanisms or more powerful error correction codes (or some combination of both). For both media selections, we can also compare the performance to the ideal performance and observe how dramatic the effect of higher elevations can get on the average number of cartridges before the customer sees a failure. VI. C ONCLUSION Storage devices with removable media are advertised to provide long life expectancy while at the same time guarantee a very reliable operation. However, the product specification failure rates of 10−17 is impractical to verify using current state-of-art computer simulations. Output error rates, computed by using traditional methods based on independence and straightforward probabilistic assumptions are no longer accurate for the reliability estimation of data storage systems due to many correlated processes taking place internally. In this study, we presented a useful method that is applicable to storage systems protected by a form of error correction coding. The presented method is able to find accurate expressions for system failure rates by incorporating header and sync errors in addition to correlated codeword failures. We used LTO data to show the effectiveness of a simplified version of the proposed model that makes use of two independent renewal processes.
November 4, 2014
DRAFT
23
Density
25%
32%
39%
49%
55%
-3.5
-3.2
-3
-2.7
-2.5
1e13
5.4e9
3.5e7
1.9e4
1.3e2
1e-23
5e-22
4e-19
3.2e-15
1.8e-12
3e10
6.5e8
8.3e5
1e2
1.8e-1
6e-24
5.2e-21
2.15e-18
3.3e-13
1e-8
5.5e10
6.2e7
1.5e5
10e-1
3e-5
Elevations log10 pC1 Average # of Cartridges (Ideal) pC2byte (Media 1) Average # of Cartridges (Media 1) pC2byte (Media 2) Average # of Cartridges (Media 2) TABLE IV R ELIABILITY OF S ET OF TAPE DRIVES AND MEDIA COMBINATIONS AT DIFFERENT ELEVATED DENSITIES .
VII. ACKNOWLEDGEMENT We greatly acknowledge Pete Walsh of Samtech Corporation, UK, who provided us the perl script for data capture process from LTO tape drives. We would like to extend our appreciation to Chris Williams and Carl Hoerger of Hawlett Packard, UK, who had valuable comments that improved the accuracy and presentation of this paper. We are also extremely thankful for all the efforts of anonymous reviewers whose contribution was vital and necessary. R EFERENCES [1] R. Galbraith and T. Oenning, “Iterative detection read channel technology in hard disk drives,” Hitachi white paper, Nov. 2008. [2] H.-C. Chang, C. B. Shung, and C.-Y. Lee, “A Reed-Solomon product-code (RS-PC) decoder chip for DVD applications”, IEEE Journal of Solid-State Circuits, vol. 36, pp. 229–238, Feb. 2001. [3] S. Lin and D. J. Costello, Error Control Coding, 2nd Ed. Upper Saddle River, NJ: Pearson Education. [4] G. D. Forney, “Generalized minimum distance decoding, IEEE Trans. Inform. Theory, vol. IT-12, pp. 125–131, Apr. 1966. [5] B. Schroeder and G. A. Gibson. “Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?” In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), Feb. 2007.
November 4, 2014
DRAFT
24
[6] P. Reviriego, J. A. Maestro, and C. Cervantes, “Reliability analysis of Memories suffering multiple bit upsets,” IEEE Trans. On Device and Materials Reliability, vol. 7, no 4, pp 592–601, Dec. 2007. [7] P. Reviriego , M. Flanagan and J. A. Maestro ”A (64,45) triple error correction code for memory applications”, IEEE Trans. On Device and Materials Reliability, vol. 12, no. 1, pp. 101–106, Mar. 2012. [8] J. Elerath and M. Pecht. “Enhanced Reliability Modeling of RAID Storage Systems”, In Proceedings of the International Conference on Dependable Systems and Networks (DSN-2007), Edinburgh, United Kingdom, Jun. 2007. [9] D. G. Feitelson, “Workload Modeling for Computer Systems Performance Evaluation,” Online Book, June 2002. Ver. 0.36. [10] R. D. Cideciyan, E. S. Eleftheriou, H. Matsuo, T. Mittelholzer, P. J. Seger and K. Tanaka, “Rewrite-efficient ECC/interleaving for multi-track recording on magnetic tape,” U.S. Patent 7,876,516 B2, Jan 25, 2011. [11] I. S. Reed and G. Solomon. “Polynomial codes over certain finite fields,” Journal of the Society for Industrial and Applied Mathematics, 8:300– 304, 1960. [12] G. D. Cohen, M. G. Karpovsky, H. F. Mattson, and J. R. Schatz, “Covering radius – survey and recent results,” IEEE Trans. on Information Theory, Vol. IT31, No. 3, 1985, pp. 328–343. [13] V. Guruswami and A. Vardy, “Maximum-Likelihood Decoding of Reed-Solomon Code is NP-Hard,” IEEE Tran. on Inform. Theory, vol. 51, no. 7, pp. 2249–2256, Jul. 2005. [14] Y. Han and W. E. Ryan. “LDPC Coding for Magnetic Storage: Low Floor Decoding Algorithms, System Design, and Performance Analysis”, PhD thesis, University of Arizona, Tucson, AZ, USA, 2008. [15] R. Pyndiah, “Near-optimum decoding of product codes: Block turbo codes,” IEEE Trans. Commun., vol. 46, pp. 1003–1010, Aug. 1998. [16] D.A. Patterson, G. Gibson, and R.H. Katz, “A Case for Redundant Arrays of Inexpensive Disks,” ACM SIGMOD, June 1988. [17] J. L. Hafner and KK Rao, “Notes on reliability models for non-MDS erasure codes,” Technical Report RJ10391, IBM, Oct. 2006. [18] R. D. Cideciyan, H. Matsuo, T. Mittelholzer, K. Ohtani, P. J. Seger and K. Tanaka, “Integrated data and header protection for tape drives,” U.S. Patent 8,479,079 B2, Jul. 2, 2013. [19] S. S. Arslan, J. Lee and T. Goker, “Error Event Corrections Using List-NPML Decoding and Error Detection Codes,” IEEE Trans. Magn., vol. 49, no. 7, pp. 3775–3778, Jul. 2013. [20] J. Lee, S. S. Arslan, T. Goker, “Bit error detection and correction with error detection code and list-npmld,” U.S. Patent 2014/0173381 A1, Jun 19,2014. [21] Sidney I. Resnick, Adventures in Stochastic Processes, 1st ed., Birkh¨auser, 1992. [22] H. J. Motulsky, A. Christopoulos, Fitting models to biological data using linear and non-linear regression: a practical guide to curve Ftting, Oxford University Press, Oxford, 2004. [23] S. Weisberg, Applied Linear Regression, 3rd Ed. New York: Wiley, 2005. [24] B. McShane, M. Adrian, E. T. Bradlow and P. S. Fader, “Count Models Based on Weibull Interarrival Times”, Journal of Business and Economic Statistics 26 (2006), no. 3, 369378. [25] R. J. McEliece and L. Swanson, “On the Decoder Error Probability for Reed-Solomon Codes”, IEEE Trans. Inform. Theory, vol. IT-32, pp. 701–703, Sept. 1986.
November 4, 2014
DRAFT