I IIIII IIIIIIII II llllll llll lllll lllll llll lllll llll lllll lllll lllll llll ... - NUCSRL!

IIIIIIIIIIIIII IIllllllllllllllllllllllllllllllllllllllllllllllllllllllll111111111111111 US 20180067961Al

United States c12) Patent Application Publication c19)

Yang et al. (54)

(43)

ADAPTIVE CACHING REPLACEMENT MANAGER WITH DYNAMIC UPDATING GRANULATES AND PARTITIONS FOR SHARED FLASH-BASED STORAGE SYSTEM Applicant: Samsung Electronics Co., Ltd., Suwon-si (KR)

(72)

Inventors: Zhengyu Yang, Boston, MA (US); Thomas David Evans, San Marcos, CA (US); Jiayin Wang, Dorchester, MA (US) Appl. No.: 15/400,835

(22)

Filed:

(51)

Int. Cl. G06F 17/30 (2006.01) G06F 9/455 (2006.01) U.S. Cl. CPC .. G06F 17/30132 (2013.01); G06F 17/30233 (2013.01); G06F 2009/45579 (2013.01); G06F 2009/4557 (2013.01); G06F 2009/45583 (2013.01); G06F 9/45558 (2013.01)

(57)

ABSTRACT

A method of adjusting temporal and spatial granularities associated with operation of a virtualized file system, the method including analyzing past workloads of a plurality of virtual machines associated with the virtualized file system, and adjusting the temporal and spatial granularities to be similar to average re-access temporal and spatial distances of data sets corresponding to the past workloads.

Jan. 6, 2017 Related U.S. Application Data

(60)

Pub. No.: US 2018/0067961 Al Pub. Date: Mar. 8, 2018 Publication Classification

(52)

(71)

(21)

c10)

Provisional application No. 62/384,078, filed on Sep. 6, 2016.

'00\ ,------------------------------------------------------,

I

I I I I I I I I I I ____________

VM FILE svSTEvl \

\ /

_J

--'90

/

'I

I

jCJ ~---14C

______________________________________________________

I I I I I I I I

_J

Patent Application Publication

Mar. 8, 2018 Sheet 1 of 10

0

0

'l"""

1.0

1

1

1 ,---------,

US 2018/0067961 Al

,1--------, 0

n

u 0

0

m

0

lO

2

w

f----(_/)

>(f)

~

.

w _J

u

LL

~

~

> :-'>

_Q

c:i C"-1

~LL---------

0 0

_J ______________

_J

L _________

_J


I

I

I

I

I

Mar. 8, 2018 Sheet 2 of 10

I

"'"' "' N

d

.

", ",

~

"

~

./.

0 /

N /

/ 0

0

_O

0

u 0

u 0

US 2018/0067961 Al


Mar. 8, 2018 Sheet 3 of 10

C> W---,,

::.-">/___

w::r::E f-----U --

vJI (/Ju wf----0::: ->

~

. d ~

c_LL-

En

C) C)

:s

w 0:::

_) 0::::

z

--::r:: :.r_

> 0:::: ,",I C)

Ul f-----0:::

f--- f--- ('-. :~ f---

f---

f--- C

L_J

_J

_J

'-----,-

,/\ D '\W

\ 0 \

Ul

~2

c\,(n

G W

f---

C,-----,

Ii'

w

w

\

sB - II

~\

LLJ

I

;'~\u1

w \~ _ (' .~

1\ L

IX

;w\

>\Ul

(

\

/ \ \

\

z

0

'

T1

3

bin ID

I

1

~-820

5

T3

812

~

81 1

820 -------

0

Tl~~E

813

2

810~~

0

" L

"

r

/

3

T6

T4

3///j

V

2

/3

4

I"

/

2

6

0

7

820

813

812__) 820

3 T7

0

4

8

To

820

//'9 /

4

820

r//3/

I

______ _C_9~~

REUSE DISTANCE OF bin3=3 (bin2, binO M~D bin4)

~~~---

2

5

81 1

6/

J

FIG. 8

/

\

',

""O

-

....

O'I

.... > ....

l,O

CIO 0 0 O'I ---l

0

.... ---

N

c rJJ

0

.... ....

CIO 0

.....

=('D ('D

rJJ

CIO

N

0

~CIO

:-:

~

~

0

~

(')

0

~

(')

'e

.... ..... .... = ""O = O" .... ..... .... =

-

t

('D

..... = .....

~


Mar. 8, 2018 Sheet 9 of 10

,----------------------------------,

0 0

rn I

I I I

\J

C"-1

'tj-

rn

rn

~

2

z

0

w u

~

0

f---L

0 2

'.l'..

f'.l'..

UJ

'.l'..

'-.._j

~

0

O), [0103] then the ACRM will fill the SSD 120 with paddingBins, which may be from long-term IO access history or from the 20 minute history of the workload monitor sliding window "Sw." Similarly, the ACRM may remove data from the SSD 120 by checking a victim data's "dirty flag." [0104] Although the ACRM may treat the tier of the SSD 120 as a "unibody" cache tier (i.e., treat a tier of one or more SSDs 120 as a single SSD, and may treat a tier of the one or more HDDs 130 as a single HDD). That is, the ACRM may have multiple disks in each of the SSD tier and the HDD tier), and the ACRM may adopt both RAID and non-RAID mode disk arrays in each tier. [0105] FIG. 9 is a block diagram of an ACRM including an ACRM main framework and an ACRM storage monitor daemon, according to an embodiment of the present invention. [0106] Referring to FIG. 9, in the present embodiment, the ACRM 900 has an additional component that may be referred to as a "storage monitor daemon" 920 for conducting the following operations: periodically tracking disk information (e.g., spatial capacity, throughput capacity, life cycle, worn-out degree, etc.); retiring and replacing disks that are reaching preset performance lower bounds (e.g., heavily worn-out or otherwise defective disks); balancing the workload among disks; and triggering SSD-level wear leveling to balance write cycles or P/E cycles. Accordingly, the storage monitor daemon 920 may have a wear level manager 921, a load balance manager 922, a disk retire and replace manager 923, and a disk health states tracker 924. [0107] During operation of the wear level manager 921, the ACRM 900 may use the second optimization framework to conduct wear leveling with minimized migration cost, and to balance workload among disks. [0108] FIG. 10 depicts a comparison of content update and wear leveling processes using both an ACRM that is storage

monitor daemon-enabled and an ACRM that has no storage monitor daemon, according to embodiments of the present invention. [0109] Referring to FIG. 10 an upper portion (e.g., FIG. lO(a)) depicts an example of a partition solution for a next epoch 220 according to the aforementioned ACRM optimization frame work. However, instead of a single SSD 120, as depicted in FIG. 1, there are four SSD disks 120a, 120b, 120c, and 120d in the SSD tier. FIG. lO(b) shows an allocation solution by anACRM that has no storage monitor daemon, and therefore does not consider program/erase (P/E) cycle and life time of each disk. FIG. lO(c), however, depicts an example of a partition solution using the storage monitor daemon (e.g., the storage monitor daemon of FIG. 9) to perform SSD-level wear leveling. The ACRM may attempt to balance the P/E cycles of each of the SSDs 120a-120d. Because the SSDs 120 in the SSD tier can be either homogenous or heterogonous, the ACRM may use the remaining P/E cycles as the main target to balance. Instead of straightforwardly assigning partitions of each of the VM llOa-llOd for the next epoch based on information of one or more recent epochs (as performed by the ACRM shown in FIG. lO(b)), the storage monitor daemon 920 of the ACRM may further split each of one or more of the VMs llOa-llOd into more fine-grained sub-partitions, and may assign the sub-partitions across the SSDs 120a-120d of the SSD array to balance a worn-out degree (P/E cycles). [0110] In each of lO(b) and lO(c), VMs llOa-llOd are partitioned, and assigned to corresponding regions of SSDs 120a-120d. In lO(b), the VMs llOa-llOd are straightforwardly partitioned (i.e., simply placed in the SSDs 120a120d in order, from beginning to end). In lO(c), the ACRM also considers the wear level in determining where to partition the VMs llOa-llOd, and also uses another optimization framework to allocate the partitions to each VM. Accordingly, in FIG. lO(b), the partition corresponding to the second VM 110b occupies the first and second SSDs 120a and 120b. However, it may be that this is a suboptimal optimization (e.g., maybe second VM 110b corresponds to a lot of write IOs, or maybe the second SSD 120b is relatively old/worn). Accordingly, in lO(c ), the majority of second VM 110b is partition into the first SSD 120a, while a smaller portion of the second VM 110b is partitioned into the second SSD 120b, thereby enabling balancing/leveling across bits. Accordingly, by using an objective function to decide an improved or optimal partitioning scheme, the ACRM considers multiple parameters (e.g., wear leveling, partitioning, remaining life cycle, estimated/expected life cycle balance, etc.) [0111] It should be noted, however, that simply balancing the remaining P/E cycle may be insufficient, because the ACRM may also consider characteristics of the different SSDs 120a-120d (e.g., different write amplification functions (WAF) and total P/E cycles), and may also consider cost associated with migrating some datasets for wear leveling, as well as throughput utilizations. The ACRM may use the following optimization framework to address balancing between these parameters [0112] The ACRM may minimize the following global equation a·CV(E(Cl,eml)+i3·Cst(Mig)+y·CV(ThuptUtil)

US 2018/0067961 Al

Mar. 8, 2018 9

[0113]

While seeking to satisfy the following equations

= ~-----------

CV(E(Clceml)

E(Clceml

~

(W[iJ]

·Seq[@]

)ED[i]

Clna,[i] = W[i] · WAF;[i] = W[i] · f;

[

.Z.: W[i1] )ED[i]

~ W[ij]=W[j] )ED[i] iE[l,ND]

iE [l, ND]

j E [l, NwL]

[0114] In detail, by seeking to reduce or minimize the objective function, the ACRM aims to reduce or minimize the load unbalance of estimated remaining P/E cycles of each of the SSDs 120a-120d, and also aims to reduce or minimize the corresponding migration cost, as well as the throughput utilization. As above, a, ~' and y are tunable parameters for datacenter managers to perform adjustments based on system requirements. "CV" may be referred to as a coefficient of variation, which reflects the difference degree of a batch of the input. Here, the ACRM may use the estimated remaining P/E cycle of each disk (i.e., E(Clrem)) under an iterated allocation and partition plan in the value range. E(Clremli]) is the estimated remaining P/E cycle of disk i, and E(Clrein)is the average value of the estimated remaining P/E cycles of all of the disks. That is, lowering the objective function realized by the above global equation achieves greater balance. Furthermore, the global equation may allow the adjusting of the a, ~' and y parameters to control weight. Although the global equation described above is linear, in other embodiments, the global equation may be more complicated (e.g., a logarithmic equation, etc.). [0115] Accordingly, by monitoring/measuring different aspects associated with IO operations, the ACRM is able to determine what behaviors result in what types of physical effects, the ACRM may achieve a finer resolution feedback. [0116] As shown above, theACRM may calculate E(Clrem [i]) by subtracting the estimated physical write P/E cycles during the predicted next epoch (of all VMs 110 assigned to disk "i") from the remaining P/E cycle of disk "i." [0117] Additionally, the ACRM may estimate the Clnexhl, noting that the physical write amount of the disk during a next epoch can be calculated by multiplying a total amount logical write of disk "i" during a next epoch (W[i]) with the WAF function of disk "i" (WAF,). Furthermore, the WAF function is a piecewise function (f,) of overall sequential ratios of all write streams to the disk

~

(W[ij]·Seq[iJ])l

)ED[i]

[ e.g.,

.Z.: W[iJ] )ED[i]

.

The equations above demonstrate that an assigned partition of each VM can be split into different across-SSD subpartitions. [0118] According to the above, the ACRM of embodiments of the present invention is able to improve disk allocation in a virtualized file system by dynamically adjusting temporal and spatial granularities associated with IO requests by monitoring metrics associated with the workloads of various virtual machines, and by determining a global best solution for the temporal and spatial granularities in consideration of the workload patterns of the virtual machines. What is claimed is: 1. A method of adjusting temporal and spatial granularities associated with operation of a virtualized file system, the method comprising: analyzing past workloads of a plurality of virtual machines associated with the virtualized file system; and adjusting the temporal and spatial granularities to be similar to average re-access temporal and spatial distances of data sets corresponding to the past workloads. 2. The method of claim 1, further comprising adjusting partition sizes inside a fast drive of the virtualized file system for each virtual machine disk of the virtual machines. 3. The method of claim 2, further comprising detecting IO changes of each virtual machine disk, and wherein adjusting partition sizes is based on the detected IO changes. 4. The method of claim 1, wherein adjusting the temporal and spatial granularities comprises adjusting a time interval of a workload monitor sliding window for updating content of a fast drive used as a cache of the virtualized file system. 5. The method of claim 4, wherein the analyzing past workloads of the plurality of virtual machines occurs at an end of the workload monitor sliding window. 6. The method of claim 1, wherein adjusting the temporal and spatial granularities comprises adjusting a prefetching bin size corresponding to an amount of data retrieved from a fast drive or a slow drive of the virtualized file system. 7. The method of claim 1, wherein adjusting the temporal and spatial granularities comprises adjusting a cache size or an epoch updating frequency of each of the virtual machines based on the past workloads. 8. The method of claim 1, further comprising updating a fast drive as a cache of the virtualized file system by prefetching data in a most recent epoch based on the adjusted temporal and spatial granularities. 9. The method of claim 1, further comprising conducting load balancing, wear leveling, disk health monitoring, and disk retirement/replacement based on the past workloads. 10. The method of claim 9, further comprising: separating the virtual machines into sub-partitions; and assigning the sub-partitions across a plurality of fast drives of an array to balance wear of the fast drives. 11. A virtualized file system, comprising: a plurality of virtual machines; one or more slow drives; a fast drive as a cache for the one or more slow drives; a memory; and a processor coupled to the memory, the processor executing a software component that is configured to: analyze past workloads of a plurality of virtual machines associated with the virtualized file system; and

US 2018/0067961 Al

Mar. 8, 2018

10 adjust temporal and spatial granularities to be similar to average re-access temporal and spatial distances of data sets corresponding to the past workloads. 12. The virtualized file system of claim 11, wherein the software component is further configured to adjust partition sizes inside the fast drive for each virtual machine disk of the virtual machines. 13. The virtualized file system of claim 12, wherein the software component is further configured to detect IO changes of each virtual machine disk, and wherein the software component is configured to adjust the partition sizes based on detected IO changes. 14. The virtualized file system of claim 11, wherein the software component is configured to adjust the temporal and spatial granularities by adjusting a time interval of a workload monitor sliding window for updating content of a fast drive used as a cache of the virtualized file system. 15. The virtualized file system of claim 11, wherein the software component is configured to adjust the temporal and spatial granularities by adjusting a prefetching bin size corresponding to an amount of data retrieved from a fast drive or a slow drive of the virtualized file system. 16. The virtualized file system of claim 11, wherein the software component is configured to adjust the temporal and spatial granularities by adjusting a cache size or an epoch updating frequency of each of the virtual machines based on the past workloads.

17. The virtualized file system of claim 11, further comprising updating a fast drive as a cache of the virtualized file system by prefetching data in a most recent epoch based on the adjusted temporal and spatial granularities. 18. The virtualized file system of claim 11, wherein the software component is further configured to conduct load balancing, wear leveling, disk health monitoring, and disk retirement/replacement based on the past workloads. 19. The virtualized file system of claim 18, wherein the software component is further configured to: separate the virtual machines into sub-partitions; and assign the sub-partitions across a plurality of fast drives of an array to balance wear of the fast drives. 20. A method of adjusting temporal and spatial granularities associated with operation of a virtualized file system, the method comprising: determining whether an end of a workload monitor sliding window corresponding to a plurality of virtual machines of the virtualized file system is reached; updating one or more of a prefetching granularity, a cache size, and an update frequency of a content update epoch sliding window for each of the virtual machines; determining whether an end of the content update epoch sliding window of a corresponding one of the virtual machines is reached; and updating content of the corresponding one of the virtual machines.

* * * * *