This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2637820, IEEE Transactions on Circuits and Systems for Video Technology 1
Realizing Low-Cost Flash Memory Based Video Caching in Content Delivery Systems Xuebin Zhang, Danni Xiong, Kai Zhao, Chang Wen Chen, and Tong Zhang
Abstract—To implement caching devices in content delivery systems, flash memory is more preferable than hard disk drives from the performance perspective. Nevertheless, the higher bit cost of flash memory is one major obstacle for the wide real-life deployment of flash-based video caching. This paper presents a set of design solutions to address this cost issue. First, we present a flash memory error tolerance design strategy customized for video data storage, which can enable the use of lower-cost lessreliable flash memory chips for video storage. The cost challenge can also be addressed by reducing the video storage footprint through on-the-fly transcoding. However, direct transcoding suffers from high implementation cost. We propose two design techniques that can largely reduce the transcoding complexity at minimal storage overhead in flash memory. All the developed design solutions share the common feature of cohesively exploring the characteristics of video coding and flash memory device physics. Their effectiveness has been well demonstrated through experiments with 20nm MLC NAND flash memory chips and extensive simulations with representative video sequences. Index Terms—Video caching, NAND Flash memory, error correction, transcoding, content delivery network.
I. I NTRODUCTION As a widely studied topic [1]–[3], in-network caching is one of the most effective means to improve the quality of Internet and wireless video delivery and meanwhile reduce the network data traffic workload. In conventional practice, hard disk drives (HDDs) are used by node servers to store the cached content. However, HDD-based content caching fundamentally cannot well serve multiple concurrent requests due to its very limited internal operation parallelism. Apparently, with their very high internal operation parallelism, solid-state drives (SSDs) can much better serve multiple concurrent requests and hence achieve high-quality video service [4], [5]. In addition, SSD-based content storage/caching can noticeably reduce the energy consumption. However, in spite of the steady bit cost reduction of NAND flash memory, commercial SSDs tend to be more expensive than HDDs, which may seriously limit or at least slow down the large scale deployment of SSD-based in-network caching in the real world. This work aims to develop techniques that can reduce the cost of SSD-based video caching. Essentially, we only have This work was supported in part by NSF under Grant No. 1405594 and No. 1406154. X. Zhang and T. Zhang are with Electrical, Computer and Systems Engineering Department at Rensselaer Polytechnic Institute (RPI), NY, USA (email:
[email protected];
[email protected]). D. Xiong is with Qualcomm, Beijing, China. K. Zhao is with SanDisk, MA, USA. C. Chen is with Computer Science Department at University at Buffalo, NY, USA.
two options in this regard: (i) reducing the bit cost of the underlying flash memory technology, and (ii) reducing the footprint of video content in SSDs. The bit cost of flash memory relies on the flash memory manufacturing technology (e.g., 20nm or 16nm) and the number of bits being stored in each memory cell (e.g., 2bits/cell MLC flash memory and 3bits/cell TLC flash memory). However, technology scaling and aggressive use of MLC or TLC storage inevitably come with storage reliability degradation, which is reflected as the increase of raw bit error rate and the reduction of flash memory program/erase (P/E) cycling endurance. In the context of the second option above, runtime video transcoding [6] is clearly a very natural choice. Content publishers can apply transcoder to on-the-fly adapt the video content to various terminal devices, formats, spatial resolutions, temporal resolutions, bitrates. Without transcoding, video content publishers have to prepare many different versions for the same video content. The use of transcoding in content delivery systems has been well studied [7]–[10]. Transcoding will be more beneficial as mobile video delivery becomes increasingly pervasive. However, as the industry keeps improving the video compression ratio, video encoding computational complexity tends to largely increase. This results in noticeable silicon cost and energy consumption penalty for supporting runtime transcoding on in-network caching devices. To reduce the transcoding computational complexity, one plausible solution is to complement the source video sequence with a control stream that can facilitate the transcoding process [11]. The most natural way of obtaining control streams is to simply remove the residual information from the target video sequence. Nevertheless, the control streams still need a remarkable storage space. Therefore, in order to reduce the cost of SSD-based video caching, we must effectively embrace the poor storage reliability of high-density low-cost flash memory and support control stream assisted transcoding at minimal storage overhead. To achieve these objectives, this paper presents a set of design techniques that share the common feature of cohesively exploring the characteristics of video coding and flash memory device physics. As elaborated later in Section II-A, flash memory exhibits significant variation in terms of raw storage reliability among different memory cells. In conventional practice, flash-based data storage devices aim at providing general-purpose storage service that ensures direct access of any individual 4kB sector. Nevertheless, such a finegrain random data accessibility is not necessary for video data storage, where the data access is performed at least in terms of GOP (group of picture). Throughout this paper, we assume the use of closed-GOP which begins with an
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2637820, IEEE Transactions on Circuits and Systems for Video Technology 2
“11”
“10”
“00”
“01”
A
B
C
D
Vupper1
Vlower
(a)
Vupper2
Wordline b0: LSB b1: MSB
vth
Upper page Lower page
MLC memory cell b0 b1
b0 b1
b0 b1
b0 b1
(b)
Fig. 1: Illustration of (a) data storage within each MLC memory cell, and (b) multi-page data organization of MLC memory cells on each wordline.
instantaneous decoding refresh (IDR) picture. Hence each closed-GOP can be decoded independently because pictures will not use references prior to IDR pictures. We propose a technique, called adaptive intra-GOP data error correction, that can naturally leverage the GOP-based video data accessibility to embrace the flash memory reliability variation and hence achieve a much stronger error correction strength. This can directly enable the more aggressive flash memory technology scaling and use of multiple bits per cell storage, leading to lower bit cost for video data storage. To support control stream assisted transcoding at minimal storage overhead, we develop two design techniques, including (i) approximate control stream construction, and (ii) reliabilityaware control stream placement. Both techniques leverage the fact that, in contrast to normal video bitstreams, control streams are consumed within the content delivery systems and never reach end users. As a result, the control streams do not have to strictly follow the standard video coding syntax and/or demand the stringent storage reliability. The first design technique exploits this feature to gracefully adjust the data storage vs. transcoding computation trade-off through two key ideas, including (1) to modify the definition or context of certain types of syntax elements (including motion vector difference and reference index) in order to reduce control stream size at small increase of encoding computational complexity, and (2) to modify the control stream syntax element ordering to improve the entropy coding efficiency. The key idea of the second technique is to intentionally place control stream data in the least reliable flash memory portions that cannot reliably store video bitstream even with the strongest error correction and hence cannot store normal video bitstream anyway. For normal video bitstream storage, error correction code (ECC) of the storage device should be strong enough to ensure an extremely low decoding failure probability (e.g., 10−15 and below), which can be practically considered as error free. In contrast, control stream may not necessarily demand such error-free storage, because control stream merely aims to facilitate the realization of transcoding. In particular, if certain portion of control stream is lost due to storage failure, we simply perform the full transcending for the effected video content without using the lost control stream. Thus, an ECC decoding failure on control stream simply results in an increase of video encoding computational complexity without impacting the quality of trancoded video. Therefore, a much higher storage device ECC decoding failure probability (e.g., 10−3 or even 10−2 ) is sufficient for maintaining most gain of using control stream assisted video transcoding. Hence,
such reliability-aware control stream placement can minimize the penalty on effective storage capability for video bitstream storage in flash memory. We carried out extensive experiments and simulations to evaluate the effectiveness of the developed design techniques. First, we performed measurements on 20nm MLC NAND flash memory chips to quantitatively reveal and demonstrate the significant reliability variation in flash memory. Based upon the measurement results, we show that the proposed adaptive intra-GOP data error correction scheme can tolerate almost 2× higher raw flash memory bit error rate (BER), compared with conventional design practice. This higher error tolerance translates into the fact that the proposed ECC scheme will increase the flash memory P/E cycling from 3250 to 4150, which indicates a lifetime increase of over 27% for the 20nm NAND flash memory being used in this study. The proposed approximate control stream design solution can reduce the storage capacity overhead by over 62% at the cost of around 10% higher transcoding computation complexity. The proposed reliability-aware control stream placement can enable the use of flash memory portions with more than 2× longer P/E cycles than those for storing normal video bitstream at almost negligible impact on video transcoding computational complexity reduction. And in addition, what should be pointed out is that, other than H.264/AVC, the proposed techniques are also applicable to other video compression standards such as MPEG, HEVC, etc. The results clearly demonstrate the potential promise of the cross-layer exploration strategy underlying this work. II. BACKGROUND A. Basics of NAND Flash Memory Each flash memory cell is a floating-gate transistor whose threshold voltage can be configured (or programmed) by injecting certain amount of electrons into the floating gate layer. Non-volatile data storage in NAND flash memory is realized by programming the threshold voltage of each memory cell into two or more non-overlapping voltage windows. Before one memory cell can be programmed, it must be erased (i.e., to remove all the charges from the floating gate layer, which sets the threshold voltage to the lowest voltage window). Memory cells on each NAND flash memory die are organized in a plane⇒block⇒wordline hierarchy: each memory die contains few independent planes, each plane consists of a large number (i.e., thousands) of blocks, each block contains a number (e.g., 64 or 128) of wordlines, and each wordline drives a very large number (i.e., tens of thousands) of memory cells. Since
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2637820, IEEE Transactions on Circuits and Systems for Video Technology 3
memory cells driven by the same wordline can be programmed or read simultaneously, NAND flash memory handles data programming and read in the unit of page with the typical size of 4kB or 8kB. For high-density multi-bit per cell NAND flash memory, different bits within each memory cell belong to different pages. This can be illustrated in Fig. 1 for MLC NAND flash memory that stores 2 bits in each memory cell. The two bits within each memory cell belong to lower and upper pages, respectively. The bits stored in flash memory cell is determined by sensing memory cell’s threshold voltage. Ideally, threshold voltage distributions of adjacent storage states should be sufficiently far away from each other to ensure a high raw storage reliability. However, in practice, due to various effects such as background noises and interference, threshold voltage distributions may be very close to each other or even overlap, leading to non-negligible raw bit error rates. In addition, the upper page bit is more susceptible to errors compared to lower page bit [12]–[14], which results in a remarkable error variance between upper pages and lower pages. B. SSD-based Video Caching in Content Delivery Networks Caching is widely adopted in video content delivery systems to provide a faster service and reduce the overall network traffic [2], [15], [16]. Content delivery systems distribute caching nodes geographically and replicate the content (including whole video or partial video clips [16], [17]) at these nodes that are closer to the clients. Clearly, content requests of clients can be served promptly if the requested contents are available in their neighbouring caching nodes. Meanwhile, the network bandwith usage pressure and workload on the original server will be alleviated as well [18], [19]. Traditional caching nodes use HDDs as the storage medium. But the scheduling and latency will be a problem especially when multiple requests arrive at the same time. HDDs have to frequently change read/write magnet head position to server these concurrent requests, leading to a huge response latency. In comparison, SSDs can achieve much better I/O performance and meanwhile lower power energy consumption [4], [5]. Nevertheless, SSD-based caching is not always a perfect solution [20], [21] due to SSDs’ limited lifetime and high cost. SSDs wear out after certain number of program/erase (P/E) cycles, and this problem is even severe in video caching storage because of the large data traffic in video delivery network. C. Flash Memory Storage Reliability and Error Correction The operational noise margin of NAND flash memory cells gradually degrades with P/E cycling [12], [22]–[24]. This leads to an important reliability metric of NAND flash memory: P/E cycling endurance, which directly determines the usable lifetime of flash memory. Even under the same P/E cycling, reliability of flash memory pages exhibits a large degree of variations mainly because of the manufacturing process variation and use of multi-bit per cell storage. Process variation results in non-uniformity of flash memory device characteristics and hence storage reliability variation among different
memory blocks. And more importantly, as explained earlier in Section II-A, different types of pages (e.g., lower and upper pages in MLC NAND flash memory) have different amount of noise margin, leading to different storage reliability. This will be quantitatively demonstrated using measurement results from 20nm MLC flash memory chips later in Section IV. Solid-state storage devices must use ECC to ensure the overall data storage integrity. In NAND flash memory chips, all the pages have the exactly same storage capacity, e.g., in the 20nm MLC chips being used by this study, each page is 9kB, aiming to store 8kB of user data and 1kB of ECC redundancy. Operating as I/O devices in computing systems where the I/O sector size is typically either 512B or 4kB, generalpurpose solid-state storage devices must provide the individual accessibility of each sector and minimize the sector access latency. Hence, each sector is protected by ECC independently from other sectors, and each sector and its associated ECC coding redundancy are always stored in the same flash memory page. As a result, all the flash memory pages store the same number of sectors, and all the sectors are protected by the same ECC with a fixed code rate (e.g., for 9kB flash memory page size and 4kB sector size, each flash memory stores two sectors and the ECC code rate is 8/9). Although such sector-based error correction can ensure random data sector accessibility demanded by general-purpose data storage, it does not match to the unequal storage reliability (hence unequal error characteristics) among different flash memory pages. This mismatch results in non-optimal utilization of coding redundancy, leading to one fundamental drawback of short storage device lifetime: With equal error protection among all the pages, the worst-case flash memory pages essentially determine the storage device P/E cycling endurance and hence lifetime, even though majority of flash memory pages can sustain higher cycling endurance. D. Control Stream Assisted Video Transcoding Another effective approach to extend the lifetime of SSD caching is to reduce the data amount written into it. Therefore, instead of storing tens of versions (including multiple resolutions, bitrates or formats, etc) of the same video content, we could only store a few popular versions and employ transcoding to on-the-fly generate other versions if needed. Transcoding [6], [25] aims to convert one existing video sequence to another version with a lower spatial and/or temporal resolution or even different compression format. Transcoding often employs the first-decoding-then-encoding procedure to minimize the size (or bitrate) of the output target video sequence, particularly for mobile video data delivery. This however results in a huge computational complexity due to the computation-intensive nature of video encoding. It is well known that video encoders spend different computational complexity to obtain different parts of video stream. For example, motion estimation (i.e., the computations for determining macroblock partitioning modes, reference frames, and motion vectors) accounts for up to 95% of total encoding time [26], while motion estimation results occupy less than 38% of entire video streams. Intuitively, if we can get access to
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2637820, IEEE Transactions on Circuits and Systems for Video Technology 4
these motion estimation results directly, nearly 95% encoding computation will be saved. An assisted transcoding design strategy was presented in [11] to reduce the video encoding computational complexity at the cost of data storage and/or transmission overhead. The key is to complement the source video sequence with a “control stream” that can facilitate the video encoding process. The control stream is obtained by removing the residual information from the target video sequence. Given the control stream, the encoding process in transcoding will become much more computation-efficient. Of course, we need to pay extra storage/transmission overhead for control stream. For example, our studies on representative video sequences with H.264/AVC show that full control streams account for on average 38.7% of the entire target video sequences. Meanwhile, as content delivery network infrastructure becomes increasingly heterogeneous, especially with the emergence of heterogeneous wireless networking [27], [28], different nodes (e.g., edge proxy servers) may have different resources in terms of computation and storage. Therefore, it is highly desirable for control stream assisted transcoding to reduce the control stream size and could gracefully explore the data storage vs. encoding computation trade-off space. III. P ROPOSED D ESIGN S OLUTIONS As pointed out in Section I, although flash-based video caching can noticeably outperform its HDD-based counterpart, flash memory is significantly more expensive than HDDs. Hence, we must minimize the flash memory bit cost for video storage in order to enable the wide real-life deployment of flash-based video caching in content delivery systems. This section presents a set of design techniques that reduce the bit cost from two aspects, including (i) enable the use of low-cost flash memory by improving video data storage error tolerance, and (ii) enable the more practical use of video transcoding to reduce video data footprint in flash memory. A. Improving Memory Error Tolerance for Video Storage 1) Motivation and Rationale: In modern video compression standards such as H.264 [29] and the latest HEVC (high efficiency video coding) [30], one video bitstream contains successive GOPs, and each GOP is a group of successive image pictures in the video sequence. Each GOP always starts with an intra-coded picture (consisted of I-slices) followed by predictive-coded pictures (consisted of P-slices or B-slices). To support functions such as replaying, skipping or rewinding, instantaneous decoding refresh (IDR) pictures are inserted in the video stream periodically. After decoding of an IDR picture, all the following pictures can be decoded without any pictures prior to the IDR picture. A GOP starting with an IDR picture is called a closed-GOP, which is commonly used in practice [31]. Thus, random video access and edition must be done in the unit of a closed GOP. Therefore, the storage of video bitstream can be naturally viewed as objectbased storage where each object is a closed GOP starting with an IDR picture. The size of one closed-GOP typically is at least a few tens of kB and can be up to hundreds of kB or even few MB, dependent upon various runtime factors
such as the picture resolution and the number of frames per GOP. However, the semantic information of GOP structure is lost at the storage device I/O interface in current practice [32], because the entire video bitstream is sent to the storage device as successive 512B or 4kB I/O sectors, and the storage device protects and stores each sector individually and ensures the sector-based random accessibility. Although such a conventional practice can use commodity general-purpose storage devices for video data storage without any change on the I/O interface, it is subject to the nonoptimal utilization of ECC coding redundancy that further leads to the drawback of short storage device lifetime as pointed out above in Section II-C. We note that the nonoptimal utilization of ECC coding redundancy is essentially caused by the fact that general-purpose storage devices must ensure random accessibility of any sector with minimal access latency. Nevertheless, such sector-based random accessibility is completely unnecessary for video data storage that only demands GOP-based random accessibility. The essential objective of this section is to study the potential of improving ECC coding redundancy utilization by only supporting coarse-grained GOP-based random accessibility for video data storage. Improved coding redundancy utilization directly leads to reduced storage capacity waste and enhanced storage device cycling endurance. Intuitively, we want to make the video data error correction strength matched to the unequal reliability of underlying flash memory pages. Because each GOP is stored in a large number of flash memory pages and we do not need to realize intra-GOP random data access, we can dynamically adjust the amount of coding redundancy that are used to protect the video data in adaptation to the storage reliability of underlying flash memory pages. In summary, the key is to realize adaptive data error correction by migrating from conventional fine-grained sector-based random data accessability to coarse-grained GOP-based random data accessibility. 2) Error Correction Codes in Current Practice: ECC are used in flash memory to ensure the data storage integrity by adding extra redundancy to information bits. BCH (Bose, Chaudhuri, and Hocquenghem) codes [33]) are widely used in NAND flash memory as ECC. Let n and k denote the ECC codeword length and information length, and let p denote the bit error rate. We can estimate the decoding failure probability of a t-error-correcting ECC as n X n s Pdec = p (1 − p)n−s . (1) s s=t+1 For a BCH code being constructed over GF(2m ), its codeword length n is less than 2m and the coding redundancy r = m · t. Therefore, given the total available redundancy space in each flash memory page, we can calculate the maximum allowable raw bit error rate p subject to the Pdec < 10−15 constraint. Accordingly, we can estimate the achievable P/E cycling endurance. We note that all the parameters being used throughout the paper are listed in Table I. 3) A Theoretical Study: To exam the potential of the proposed migration, we carried out a theoretical study under two
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2637820, IEEE Transactions on Circuits and Systems for Video Technology 5
TABLE I: List of symbols and parameters. Symbol
Description
Pdec
storage codeword decoding failure rate (e.g., below 10−15 )
n
ECC codeword length
k
ECC information length
r
ECC redundancy length
p
raw bit error rate of flash memory pages
t
error correction capability of ECC code
rl , ru
required redundancy length for flash memory lower and upper pages
tl , tu
error correction capability for flash memory lower and upper pages
pl , pu
raw bit error rate of flash memory lower and upper pages
rint
required redundancy length in the ECC interleaving case (Fig. 5)
limit. As pointed out in Section II-C, in current design practice geared to sector-based data accessibility, each 4kB sector and its associated ECC coding redundancy must reside in the same flash memory physical page. As a result, given the fixed physical page size (e.g., 9kB in the 20nm MLC NAND flash memory chips being used in this study), all the sector data are protected by the same ECC throughout the entire flash memory lifetime. In contrast, for our interested video data storage that only demands coarse-grained GOP-based data accessibility, we do not have to fit each complete ECC codeword into the same physical page. This fundamentally enables a large design space for improving flash memory cycling endurance, where the key is to adjust data error correction in adaptation to the flash memory reliability information. This can be illustrated in Fig. 3. Information bits
Redundant bits
One GOP
ideal assumptions: (i) perfect knowledge of flash memory page reliability, and (ii) infinite ECC codeword length. We model flash memory data storage as a binary channel and estimate the distribution of channel parameters (i.e., bit error rate) based upon error statistics measured from 20nm MLC NAND flash memory chips under different P/E cycling. Given each set of channel parameters, we can estimate the required coding redundancy from the information-theoretical perspective [34], [35]. Accordingly, we can estimate the overall informationtheoretical coding redundancy under different P/E cycling by modelling the flash storage as a binary channel. The channel entropy corresponds to the minimum redundancy for tolerating the bit error rate. As shown in Fig. 2, under the code rate of 8/9 (i.e., the redundancy of 11.1%), the theoretical limit of P/E cycling endurance is around 12000. In comparison, current practice employs the same rate-8/9 ECC to protect each individual flash memory page. We assume the use of length4kB, rate-8/9 BCH code and set the target decoding failure rate as 10−15 , based upon which we estimate the achievable P/E cycling endurance is about 3250. The results clearly show the big gap between what is achievable under current practice and the information-theoretical limit. Theoretical Redundancy Limit
BCH Redundancy
30% Redundancy
25% Endurance in Current Practice
20% 15%
Endurance in Theory
11%
10% 5%
P/E Cycling Limit Difference
0% 0
1
2
3
4
5 6 7 8 9 10 11 12 13 14 P/E cycling (k times)
Fig. 2: Estimated information-theoretical coding redundancy of 20nm MLC flash memory under different P/E cycling. 4) Adaptive Intra-GOP Data Error Correction: Motivated by the above information-theoretical study, we develop a simple design solution to reduce the gap from the theoretical
More reliable flash pages Page 1
Page 2
Page n
One GOP Less reliable flash pages Page 1
Page 2
Page n+m
Fig. 3: Illustration of adaptive intra-GOP data error correction. Clearly, the design of adaptive intra-GOP data error correction strongly depends on the granularity of estimated flash memory reliability information. As discussed in Section II-C and experimentally demonstrated later in Section IV-A, flash memory reliability experiences significant type-of-page variation. The flash memory reliability is quantified in terms of raw BER per physical page, and throughout the paper we use MLC flash memory as a test vehicle to present and evaluate the proposed design schemes. Given the P/E cycling, we have two worst-case per-page BER values, including pl for lower pages and pu for upper pages. To realize adaptive ECC, the most straightforward option is to use different ECC code rates (hence error correction strength) for different types of pages. To simplify the practical implementation, we set that all the ECC codewords contain the same amount of information bits (e.g., 1kB or 2kB). We denote each ECC with three parameters (k, r, t), where k and r are the number of information bits and redundant bits in each codeword, and t is the number of bit errors that can be corrected. Given the target ECC decoding failure rate (e.g., 10−15 and below), let (k, rl , tl ) and (k, ru , tu ) denote the ECC that can tolerate the worst-case BER pl and pu , respectively. Therefore, we can protect the data being stored on lower and upper pages using these two ECC, as illustrated in Fig. 4. The two worst-case BER pl and pu gradually increase with the P/E cycling, and flash memory controller could obtain accurate knowledge about the run-time bit error rate statistics by comparing the data before and after ECC decoding. Thus we can determine the required redundancy adaptively and correspondingly increase the parameters r and t throughout
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2637820, IEEE Transactions on Circuits and Systems for Video Technology 6
P where ∀i+j=s Pl (i)Pu (j) denotes the probability that there are in total s errors occurring in a codeword, Pl (i) denotes the probability that there are i errors in lower page part and Pu (j) denotes the probability there are j errors in upper page part. Thus we can have: ( Pl (i) = ni · pil · (1 − pl )n−i (3) Pu (j) = nj · pju · (1 − pu )n−j
the flash memory lifetime. Information bits
ru
k
k
Redundant bits
ru
k
ru
Upper
rl
k Lower
One codeword
Fig. 4: ECC scheme I: Differentiated error correction for two different types of page (Using multiple ECC coderate). Another option for embracing the flash memory reliability variation is to use ECC codeword interleaving as illustrated in Fig. 5. Instead of using two different ECC as above, we only use a single ECC at a time and each ECC codeword evenly distributes among one lower and one upper page, as shown in Fig. 5. In another word, each ECC codeword is interleaved among two different pages on the same wordline. The basic concepts of these two ECC schemes strategies can be briefly described and compared as follows: • In the first design strategy, each complete ECC codeword resides in either a lower page or upper page as in current practice. To accommodate the different raw bit error statistics of lower and upper pages, the ECC codewords in lower and upper pages have different code rates and hence different error correction strength. • In the second design strategy, each complete ECC codeword distributes among lower and upper pages. As a result, all the ECC codewords experience similar raw bit error statistics, which is roughly average of lower and upper pages. Hence, we only need to support a single ECC code rate. Information bits
k/2
r/2
k/2
r/2
k/2
r/2
k/2
r/2
Redundant bits
Upper
Lower One codeword
Fig. 5: ECC scheme II: ECC codeword interleaving to embrace type-of-page reliability variation (Using single ECC coderate). Given all the parameters (i.e., pl , pu and k), we can accordingly estimate the required error correction strength tint and coding redundancy rint for realizing the codeword interleaving. Similar with the approach in Section III-A2, assume we use binary BCH code as the ECC and let n = (k + rint )/2 (i.e., the codeword length is 2n) and let Pdec denote the target decoding failure rate (e.g., 10−15 and below), we can calculate tint and n (hence rint ) using the following formulations: rint = dlog2 (k + rint )e · tint Pk+rint P Pdec = s=tint +1 ∀i+j=s Pl (i)Pu (j)
(2)
where pl and pu denote the raw bit error rate of flash memory lower and upper page, respectively. As demonstrated later in Section IV, the above two different adaptive ECC design strategies have almost the same coding efficiency (i.e., they incur the same amount of total coding redundancy under the same flash memory raw reliability). Nevertheless, they involve different trade-offs regarding the ECC codec implementation: The first adaptive ECC design strategy demands that two different code rates should be supported at the same time, which can complicate the ECC codec design. On the other hand, since flash memory programming/read operations are carried out in the unit of page, the second adaptive ECC design strategy demands the use of buffer to interleave/de-interleave the ECC codewords among two physical pages, leading to on-chip memory overhead. B. Low-Cost Control Stream Assisted Transcoding In addition to improving flash memory error correction strength, reducing video footprint in flash memory is another effective means to reduce effective bit cost for flash-based video caching in content delivery systems. Nevertheless, the runtime video transcoding could demand significant computational complexity and hence implementation cost, especially for latest video compression standards such as HEVC. As pointed out earlier in Section II-D, a control stream assisted transcoding strategy was presented in [11] to trade modest storage overhead for a large computational complexity reduction. Nevertheless, as modern video compression standards continue to improve the motion compensation efficiency and hence reduce the residual information volume, the size of control stream tends to be relatively more significant. In the following, we present two orthogonal techniques for reducing the control stream storage overhead, where the first one aims to reduce control stream size and the second one aims to reduce the impact of control stream storage on effective flash memory storage capacity for video caching. 1) Approximate Control Stream Construction: We propose two schemes to largely reduce the control stream size at small transcoding computational complexity penalty. The first scheme aims to relax the completeness of syntax elements in control streams. Typical video stream consists of various syntax elements such as skip flag, mode, motion vector difference (MVD), reference index(Ref), etc. Different types of syntax elements not only have different size but also have different importance from computational complexity reduction perspective. Aiming to seek more graceful storage vs. encoding computation trade-offs, we propose to appropriately relax the completeness of syntax elements MVD and Ref. First, we note that, in normal video compression, each motion
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2637820, IEEE Transactions on Circuits and Systems for Video Technology 7
vector difference and reference index element is represented with variable-length binary string to better match the unequal probability of all the possible values. Therefore, if we intentionally make each syntax element incomplete and only indicate whether the most likely scenarios occur or not, we can reduce the control stream size and meanwhile largely maintain the reduction of encoding computational complexity. In particular, we propose to relax the completeness of motion vector difference and reference index as described below: • MVD-proximate control stream: We replace each motion vector difference (consisting the difference on both X and Y dimensions) with a 1-bit motion vector flag. As illustrated in Fig. 6, we define a motion vector minisearch window with a very small size (e.g., 3×3 or 5×5) around the predicted motion vector. If the best-matching motion vector falls into the mini-search window around the searching center, we set the 1-bit motion vector flag as ‘1’, otherwise we set it as ‘0’. During the control stream assisted transcoding, if the 1-bit motion vector flag is ‘1’, the video encoder searches for the best-matching motion vector within the mini-search window around the searching center, otherwise the video encoder searches within the much larger normal search window.
within the video storage and delivery infrastructure, hence we may not have to strictly follow the standard syntax. Intuitively, the same type of syntax element among adjacent macroblocks tends to have a stronger correlation. Therefore, we propose to re-order the input bitstream to the entropy encoder, as shown in Fig. 7, so that the same type of syntax element in each slice is grouped together. This can increase the bitstream data correlation and hence improve the entropy coding efficiency, leading to reduced control stream size. A few bits are added at the start of each group to distinguish each other. During the decoding process, we can rely on the dependencies among syntax elements to reconstruct the original stream. One Slice Macroblock N ...
Macroblock 2 ...
Macroblock 1
...
...
Entropy Encoder
...
...
Entropy Encoder
Re-ordering ...
...
... Syntax element type 1
Syntax element type 2
Syntax element type M
Fig. 7: Illustration of syntax element based re-ordering. 0 1-bit MVDproximate flag
1 Normal Search Window
Predicted Motion Vector Mini-search Window
Fig. 6: Illustration of the use of mini-search window in the proposed MVD-proximate control stream. Ref-proximate control stream: We replace each reference index element with a 1-bit flag as well. If the bestmatching reference frame is the most likely reference frame, which is typically set as the first reference frame in the reference picture list,we set the flag as ‘1’, otherwise we set it as ‘0’. During control stream assisted transcoding, if the 1-bit reference frame flag is ‘1’, the reference index is immediately available, otherwise the encoder needs to exam all the possible reference frames to search for the best-matching reference frame. The second scheme reduces the control stream size by appropriately re-ordering syntax elements in control streams. The last step of video encoding is lossless compression of all the syntax elements using entropy coding, e.g., contextadaptive variable-length coding (CAVLC) and context-based adaptive binary arithmetic coding (CABAC). As illustrated in Fig. 7, the input bitstream to the entropy encoder is organized with the unit of macroblock (in H.264/AVC), e.g., all the syntax elements associated with the same macroblock are consecutive in the bitstream. Regardless to the specific entropy coding algorithm, the compression efficiency will increase as the correlation among adjacent data becomes stronger. As mentioned above, control streams are generated and consumed •
2) Reliability-Aware Control Stream Placement: In order to guarantee the end user data integrity, storage devices must ensure an essentially error-free (e.g., ECC decoding failure probability of 10−15 and below) video bitstream storage. However in contrast, we can relax this constraint to some extent for control stream storage. That is because, control streams are solely used to reduce the video transcoding computational complexity. Hence, loss of a few kBs in control streams only result in an slight increase of video transcoding computational complexity while the end user perceptional experience will not be affected. Motivated by the unequal storage reliability requirements of video bitstream and control stream, we develop a simple design technique that can minimize the impact of control stream on effective storage capacity for video caching. The key idea is to intentionally place control stream data in the least reliable flash memory portions. We partition all the flash memory blocks into two regions, Rn and Rc , with the size of Sn and Sc , respectively. We can store either video bitstream or control stream in the region Rn , while can only store control stream in the region Rc . Let fv and fc (where fv