Conference: Ubiquitous Information Technologies & Applications, 2009. ICUT '09. ... A probabilistic cost model considering real-time throughput for RFID jobs in. Data Grids ... Department of Computer Science and Information Engineering.
A probabilistic cost model considering real-time throughput for RFID jobs in Data Grids Jih-Sheng Chang and Ruay-Shiung Chang Network Innovation Technology Laboratory Department of Computer Science and Information Engineering National Dong Hwa University, Hualien, TAIWAN {jschang, rschang}@mail.ndhu.edu.tw Abstract RFID is a rapid-developing technology for wireless identification lately. How to speed up the processing procedure of RFID data is a critical issue needing to be addressed in terms of a real-time RFID application. In order to work out the processing speed problem, Data Grids would be a promising solution to the huge amount of RFID real-time data due to the high performance computing capability and distributed storage system. However, there are still differences between RFID system and Data Grids such as protocols, programming interface, and security issues. Therefore, we suggest an intermediate architecture called RFID-Grid proxy between RFID networks and Data Grids system with a view to interconnecting two different systems. In this architecture, we have evaluated a probabilistic cost estimation model of M/M/m queuing system for RFID jobs considering the system load balance, real-time throughput, and deployment cost. Keywords: RFID, Data Grids, M/M/m queuing system
1. Introduction With the advance in wireless and contactless technology, RFID [5, 6] is becoming more and more popular for wireless identification in recently years. The huge amount of RFID real-time data need to be promptly dealt with in order to interact with the downstream and upstream system such as an international warehousing system. How to speed up the RFID jobs becomes a critical issue needing to be addressed regarding a real-time RFID application. In order to work out the processing speed problem, Data Grids [1-4] would be a promising solution to the huge amount of RFID real-time data due to the high performance computing capability and distributed storage system. Data grid is one of the most popular applications in grid computing concerning data management system and data replication technologies. With a considerable amount of storage capacities and effective data management technologies, it can deal efficiently with the issues of large scale data processing and distributed data access.
Nevertheless, there are still differences between RFID system and Data Grids such as communication protocols, programming interface, and security issues. Therefore, we suggest an intermediate architecture called RFID-Grid proxy between RFID networks and data grids system with a view to interconnecting two different systems as shown in figure 1. In this architecture, we delegate RFID jobs to RFID-Grid proxies. Thereafter RFID-Grid proxies recognize jobs and pass them to computing resources in data grids for further processing. In terms of the intermediate architecture, the real-time throughput without suffering from serious delay becomes a decisive point. On the other hand, if too many grid hosts are dedicated to serve as RFID-Grid proxies, the overall system performance may be encumbered with low usability of grid resources. Accordingly, the tradeoff between real-time throughput and deployment cost for RFID jobs should be taken into consideration carefully. Consequently, in the paper, we have put forward a probabilistic cost estimation model of M/M/m queuing system for RFID jobs considering the system load balance, real-time throughput, and deployment cost.
Job arrivals
RFID-Grid Proxies
Data Grids Resources
Figure 1. System Architecture
2. Related Work In [12], authors had proposed an analytical model of RFID networks with an M/M/1/N queuing system. The main contribution of this paper lies in how to optimize the number of RFID sensors in an area. A numerical simulation is posed to approve of the theoretical result. The efficiency of RFID sensor network also improves through the theoretical model. [11] proposes a job scheduling algorithm for grid computing on RFID network. It integrates grid system into RFID EPC networks in order to deal with the large amount of EPC data. Simulation results show the high adaptability to the integrated RFID EPC networks by means of the proposed algorithm.
3. The proposed model 3.1 Problem definition We intend to transform the intermediate architecture into an M/M/m queuing model [7, 10] as shown in figure 2. Therefore, the problem becomes how to decide the proper number of RFID-Grid proxies by taking care of system load balance, average real-time throughput, and deployment cost.
In accordance with global balance equations, we can get the following formula:
λ p n −1 = n μ p n λ p n −1 = m μ p n
n≤m
(a)
n>m
⎧ (mρ )n , n≤m ⎪⎪ p0 n! pn = ⎨ m n ⎪p m ρ , n > m ⎪⎩ 0 m!
(b)
where ρ is given by ρ = λ mμ ∞
By
∑p n =0
n
= 1 , the formula of P0 can be obtained:
∞ ⎡ m−1 (mρ )n (mρ )n ⎤ +∑ p0 = ⎢1 + ∑ n−m ⎥ n = m m! m ⎣ n =1 n! ⎦ ∞ ⎡ m−1 (mρ )n mm ρ n ⎤ = ⎢1 + ∑ +∑ ⎥ m! ⎦ n=m ⎣ n =1 n!
⎡ m−1 (mρ )n m m ρ m 1 ⎤ = ⎢∑ + ⎥ m! 1 − ρ ⎦ ⎣ n =0 n!
−1
−1
−1
(c)
The probability that an arriving job has to be queued in the system is given below: ∞
pQ = ∑ p0 n=m
m m ρ n p0 m m ρ m ∞ n − m = ∑ρ m! m! n=m
(d)
3.3 Proposed model We define the average real-time throughput as the following formula: Figure 2. M/M/m queuing system
3.2. M/M/m queuing model Here we define several terms in M/M/m queuing system given in table 1: Table1: variables in M/M/m queuing model Terms Meaning Average arrive rate of RFID jobs λ µ
Average service rate of RFID-Grid proxies
n
The number of jobs within the queuing system
Pn
The probability of n jobs in the queuing system There are m servers in the queuing system
m
T ( M = m) =
λ (1 − pQ ) M
(e)
The variable M means the number of RFID-Grid proxies. Therefore function T refers to the average real-time throughput of incoming jobs without suffering from the queuing delay in the case of m dedicated RFID-Grid proxies. On account of the demand for real-time processing in RFID applications, the higher real-time throughput the system has, the better performance users will gain in terms of real-time RFID applications. On the other hand, more dedicated RFID-Grid proxies running in the system means that less resource can be used for the overall grid system from the point of view of system load balance. Therefore, how to give a comprehensive consideration
to the assignment of dedicated RFID-Grid proxy becomes a critical issue. First of all, we would like to inquire into which factors will affect the average real-time throughput significantly. We will examine with a simulation observation from figure 3. Given that the average arrival rate varies from 1000 to 15000 with fixed service rate, how the average real-time throughput varies with the number of dedicated RFID-Grid Proxies is particularly noteworthy. Take findings of arrival rate of 10000 for example; although the rate of throughput rises to 1200 as the number of proxies increases, there is a gradual dip from the peak. The same observation holds in the cases of other average arrival rate. These results lead to the conclusion that an increase of the number of RFID-Grid proxies without considering average real-time throughput may give rise to a result of poor performance.
CRi indicates how many hosts we can make use of during a period of i in consideration of current system load. loadi expresses the information of current load in a grid system during a period of i, which can be retrieved from the grid information service [4]. For example, suppose that loadi is 60%, CRi will be 50 * (1-0.6) =20 if the number of all hosts available is 50. Accordingly, a heavier system load leads to a less usability of computing resources for candidates of RFID-Grid proxy considering the system load balance, and vice versa. Here we pose a candidate range (CRi) principle to select an appropriate range for candidates of RFID-Grid proxy by taking the system load into consideration. We now return to the example given above. Now that there are 20 hosts available at present, we now evaluate the average real-time throughput by Eq.(e) from m=1 to 20 in terms of an arrival rate 2000 and a service rate of 2500, as indicated in Table 2. Here we use a Max function as shown in Eq.(g) to select a group in top θ% of throughput which is stored in an array of SC (selective candidates). In addition, θ is determined depending on an administrator’s decision. Take table 1 as an example. If θ is 30% while CRi is 20, SC will be an array of 6 elements with higher average throughput (m=2, m=3, m=4, m=1, m=5, m=6), which is derived from 20*30%=6.
SC = Max (CR i )
(g)
θ
Figure 3. Average real-time throughput versus the number of dedicated RFID-Grid Proxies On the grounds that the average real-time throughput is the key point we are highly concerned with, we select a candidate range on the basis of high throughput to be candidates at first. However, in addition to considering the average real-time throughput, what has to be noticed is the system load balance. If we always select a great number of dedicated RFID-Grid proxies to serve RFID jobs in order to gain a high throughput, the performance of overall grid system may decline. In other words, too many dedicated RFID-Grid proxies running in a grid system will reduce the usability of computing resources in terms of other kinds of job. Therefore, an appropriate number of dedicated RFID-Grid proxies can not only enhance the average real-time throughput for RFID jobs but also help to improve the system load balance and computing resources’ usability for other types of job. Here we suggest a candidate range (CRi) principle given below:
CR i =
(1 - load i ) ∗ All availa ble hosts
(f)
We transform the problem of proxy selection into a minimum cost hypothesis test model [10]. Given that H is a random variable with m hypotheses 0,1,..,m-1 and S is a observation random variable with observed sample value s. Here the m hypotheses corresponds to the situation of M=0,1,…, m-1 RFID-Grid proxies. The point we are concerned about is how we make a correct decision between possible hypotheses that minimizes the average cost when observing a sample value s. p( H=i | S=s ) refers to the posteriori probability of hypothesis Hi conditional on s in accordance with Bayes’ theorem .
p (H = i | S = s ) =
p (s ∩ Hi ) p(s | Hi ) p( Hi ) = p (s ) p (s )
(h)
The term Cij is defined as the cost of choosing hypothesis i while hypothesis j is true. The expected cost is defined as the following formula: m −1
E [Cost of choosing H = i | S = s ] = ∑ p ( j | s ) Cij j =0
(i)
Table 2. Average Real-Time Throughput with an arrival rate of 2000 m P0 PQ Throughput Rank 1 2 3 4 5 6 7 8 9 10
0.2 0.428571 0.447154 0.449102 0.449307
0.8 0.228571 0.052033 0.009581 0.001461
399.999976 771.42857 631.978318 495.209581 399.415759
4 1 2 3 5
0.449327 0.449329 0.449329 0.449329 0.449329
0.000189 2.11E-05 2.08E-06 1.82E-07 1.45E-08
333.270412 285.708255 249.999481 222.222182 199.999997
6 7 8 9 10
Cij can be modeled as the following formula where DCij means the deployment cost of proxy servers when choosing hypothesis m=i, whereas hypothesis m=j is true. Rij represents the revenue when choosing hypothesis m=i, whereas hypothesis m=j is true. In other words, Rij is an accounting mechanism concerning how to charge users for the computing services in a grid system. Therefore, a less deployment cost and higher revenue lead to a less total cost. In conclusion, the deployment cost and accounting parameter hinges on system’s policy and administrator’s strategy.
Cij = DC ij − Rij
(j)
Having examined the formula of hypothesis cost with Eq.(i), we would like to find the minimal cost decision described by Hmc allowing for the observed sample value s: i
m −1
= arg min ∑ p ( j | s ) Cij = arg min ∑ j =0
m −1
=
p (s | j ) p ( j ) Cij p (s )
arg min ∑ p (s | j ) p ( j ) Cij i
[
s∈Ri
p (s | j ) p ( j ) ds
]
(l)
Let m = 2, the following average cost will be obtained in the form of binary hypothesis testing: 1
1
[
C = ∑∑ Cij ∫ i =0 j =0
= C00∫
s∈R 0
s∈Ri
p(s | H = 0) p(H = 0) ds + C01∫
s∈R 0
+ C10∫
s∈R1
s∈R1
s∈R1
+ C10∫
s∈R1
p(s | H = 1) p(H = 1) ds
p(s | H = 0) p(H = 0) ds + C11∫
⎡
= C00⎢⎢1 − ∫ ⎣
]
p(s | j ) p( j ) ds
p(s | H = 1) p(H = 1) ds
p(s | H = 0) p(H = 0) ds⎥⎥ + C01⎢⎢1 − ∫ ⎤
⎡
⎦
⎣
p(s | H = 0) p(H = 0) ds + C11∫
s∈R1
s∈R1
p(s | H = 1) p(H = 1) ds⎥⎥ ⎤ ⎦
p(s | H = 1) p(H = 1) ds
[(C10 − C00) p(s | H = 0) p(H = 0) - (C01 − C11) p(s | H = 1) p(H = 1) ] ds + C00 p(H = 0) + C01 p(H = 1) (m)
=∫
s∈R1
Since C00 * p(H=0)+C01 * p(H=1) is identical to every hypothesis selection i, all we should pay attention to is:
∫ [(C s∈R 1
10 −
C 00 ) p(s | H = 0) p(H = 0) - (C 01 − C11) p(s | H = 1) p(H = 1) ] ds (n)
The objective is to minimize the average cost, in other words the less value Eq.(n) is, the smaller cost it will be if the observation sample s is in the range of H=1 (R1). Therefore, Eq.(n) turns out to be the following result: D =1
(C10 − C00) p(s | H = 0) p(H = 0)
< >
(C01 − C11) p(s | H =1) p(H =1)
D=0
⇒ C00 p(s | H = 0) p(H = 0) + C01p(s | H =1) p(H =1) D =1 (o)
D=0
j =0
m −1
i
j =0 i =0
> C10 p(s | H = 0) p(H = 0) + C11p(s | H =1) p(H =1)
MC(i) = ∑p(s | j) p( j) Cij < j=0 D=i
m−1
∑p(s | j) p( j) C
kj
= MC(k) (p)
j=0
With the definition of minimal cost function MC from Eq.(p), a decision with minimal cost between two arbitrary hypotheses i and k from m hypotheses can be determined. We next proceed to focus attention on how the best decision among the selected candidates allowing for the cost problem. As discussed above, an increase of the number of proxies continuously without considering system load balance and cost may not guarantee to an expected highly average throughput as we can observe from figure 3. The point we next have to consider is how many proxies the system actually need without wasting unnecessary computing resources. Since the system load balance and average real-time throughput are taken into account by Eq.(f) and Eq.(g), here we put forward a bubble-sort-like algorithm called minimal cost bubble sort (i.e. MC_BubbleSort) further by means of Eq.(p) in order to make a minimal cost decision with an appropriate number of RFID-Grid proxies allowing for m possible hypothesis in SC. int MC_BubbleSort( int SC[], int array_size ) { int t, u, temp; for (t = (array_size - 1); t >= 0; t--){ for (u = 1; u < D = 2
4
∑ p (s = 5 | j ) p ( j ) C
3 j
= MC ( 3 )
j=2
] + [P n
= 5, m = 3
∗ p ( j = 3 )∗ C
∗ p ( j = 2 ) ∗ C 32 ] + [P n
= 5, m = 3
∗ p ( j = 3 ) ∗ C 33 ] + [P n
∗ p ( j = 2 )∗ C
22
23
] + [P n = 5 , m
= 4
∗ p ( j = 4 )∗ C
24
]
D = 3 >