NON LINEAR TRAFFIC MODELING OF VBR MPEG-2 VIDEO SOURCES Anastasios D. Doulamis, Nikolaos D. Doulamis and Stefanos D. Kollias National Technical University of Athens, Electrical & Computer Engineering Deptartment Email:
[email protected]
ABSTRACT In this paper, a neural network scheme is presented for modeling VBR MPEG-2 video sources. In particular, three non linear autoregressive models (NAR) are proposed to model the aggregate MPEG-2 video sequence, each of which corresponds to one of the three types of frames (I, P and B frames). Then, the optimal mean-squared error predictor of the NAR model is implemented using a feedforward neural network with a tapped delay line (TDL) filter. A novel algorithm is also introduced, which handles the significant effect of the correlation among the I, P and B frames on the estimation of network resources. Furthermore, a new mechanism is proposed to improve the modeling accuracy, especially at high bit rates, based on a generalized regression neural network. Experimental studies and computer simulations illustrate the efficiency and robustness of the proposed model as predictor of the network resources compared to conventional models.
1. INTRODUCTION Digital video is expected to be one of the major traffic components of future communication networks [1] since demands of video services will rapidly increase in the forthcoming years. Examples include high definition TV (HDTV) [2], videophone or video conferencing applications [3], and content-based retrieval from large video databases [4]. However, uncompressed video is hardly transmitted, even over high-speed cables, due to large bandwidth requirements. For this reason, several coding algorithms have been proposed in the literature to perform efficient and effective video compression, while maintaining an acceptable picture quality. Among the most popular video coding algorithms is the MPEG-2 standard which it now appears in many applications such as distribution of digital cable TV or networked database services via ATM [2]. Variable Bit Rate (VBR) video transmission (variable rateconstant quality) presents many advantages compared to Constant Bit Rate (CBR) transmission (constant rate-variable quality). In a VBR mode, the output rate of VBR encoders fluctuates according to video activity and scene complexity. Thus, if the allocated bandwidth is set equal to source peak rate the network is under-utilized. For this reason, several independent VBR video sources are multiplexed in a common buffer with constant output rate, so that the aggregate bit rate tends to smooth out around the average, as the law of large numbers indicates [1]. Then, losses occur in case that the incoming amount of data exceeds the buffer size (buffer overflow). Since, for a given video quality, the loss probability should not exceed a certain limit, it is necessary to build some protection against losses so that an acceptable Quality of Service (QoS) level should guarantee to the users. Consequently, statistical characterization and modeling of VBR video sources is
necessary for video traffic characterization and estimation of the network resources. Several VBR video models have been proposed in the literature. In [1] a single video stream of a telephone scene is analyzed and two models are shown to fit well the experimental data. A multi-state Markovian chain is proposed in [5] as video source model for VBR teleconference streams, while scene modeling is performed by Frater et al. [6] for video films presenting a long-range dependence. However, the previous models cannot directly be applied to VBR MPEG coded video sources, since the coding method used significantly affects the traffic characteristics [7]. Some statistical properties and basic characteristics of MPEG coded video steams have been recently analyzed, such as the higher rate of I frames than P, B frames and the periodicity of the MPEG autocovariance function. In this paper, a non-linear model is used for estimating the network resources of VBR MPEG-2 video traffic. In particular, a feedforward neural network architecture with tapped delay line (TDL) inputs has been adopted to approximate the rates of I, P and B frames. However, if the generated rates are merged independently to construct the aggregate sequence, a severe underestimate of network resources is observed. This is due to the fact that the correlation among the I, P and B frames is not satisfied. For this reason, a correlation mechanism is proposed to generate correlated I, P and B rates. Furthermore, a new mechanism is also introduced to improve modeling accuracy, especially at high bit rates, based on a generalized regression neural network (GRNN).
2. VBR MPEG VIDEO SOURCE CHARACTERISTICS Video source modeling depends on the adopted compression scheme, since the latter significantly affects the statistical properties of the actual video traffic. Thus, it is useful before performing modeling of VBR MPEG encoders to briefly describe the MPEG coding algorithm and its characteristics. The MPEG algorithm uses three different types of frames: Intraframe (I), Predictive (P) and Bidirectionally-Predictive (B). These three types of frames are organized as a group (Group of Picture) defined by the distance L between I frames and the distance M between P frames. If the cyclic frame pattern is …IBBPBBPBBPBBI…, L=12 and M=3. In this paper, we have used these values for L and M. The average rate of Intra frames is much higher than the respective rate of Inter ones, due to the fact that the former are coded only in spatial direction. On the contrary, Inter frames present much higher fluctuation, especially in case that the motion compensation algorithm fails [2]. Furthermore, there is correlation between the rates of I, P and B frames, mainly due to motion estimation algorithm and the continuity of the actual video traffic. In the following, three MPEG-2 coded video
sources have been examined (Jurasic Park and James Bond and a TV series) which are also called Source1/2/3 respectively.
3. MODELING VBR MPEG-2 VIDEO SOURCES
4.1 Error Modeling
Let us denote as x c (n) , c ∈ {I , P, B} the rate of the nth sample of I, P or B frames. It should be mentioned that x c (n) does not refer to the nth sample of the aggregate video sequence, but to the nth sample of the c-stream. We estimate the rates of I, P and B frames using NAR models of order p c (NAR(pc)), as
x c ( n) = g c ( x c (n − 1), x c (n − 2),
K, x c (n − p)) + e c (n)
(1)
where c ∈ {I , P, B} and g c (⋅) is a non-linear function. The error ec(n) is an independent and identically distributed (i.i.d) random variable. In the following analysis, we omit superscript c for simplicity purposes since it is involved in all equations. The main difficulty in implementing a NAR model is that function g (⋅) is actually unknown. However, in [8], it has been shown that a feedforward neural network, with a tapped delay line (TDL) filter as its input, is able to approximate a NAR model within any acceptable accuracy. In this case, the network output provides a non-linear approximate gˆ (⋅) of function g (⋅) and thus an estimate, say xˆ (n) of x(n) ,
y w (x(n − 1)) ≡ xˆ (n) = gˆ ( x (n − 1), x( n − 2),
L, x(n − p))
(2)
where y w (x(n − 1)) represents the network output in case that x(n − 1) =[x(n-1)…x(n-p)]T is fed as network input, while subscript w indicates the dependece of the network output upon its weights. A training set of N samples is used to estimate the neural network weights w. As desired response of the network output, the actual value of the signal x(n) is used, while the previous p-samples as input vectors. A modification of the Marquardt-Levenberg training algorithm has been adopted in our case to estimate the network weights [9], due to its efficiency and fast convergence. Furthermore, a cross validation method is also used for improving network generalization. The 10% of the available data are used a validation set.
4. NETWORK RESOURCES ESTIMATION Once the network has been trained to approximate the unknown function g (⋅) , it is, then, used as video generator for estimating the network resources such as buffer size, the corresponding delay and cell or frame losses. In this case, however, the actual video traffic is not available and therefore the p-actual samples cannot be used as inputs to the network. Instead, the estimated data are fed as input to the network resulting in the following equation, xˆ (n) = gˆ ( xˆ (n − 1), xˆ (n − 2), , xˆ ( n − p )) + r ( n) (3) where r (n) is an i.i.d. process corresponding to the estimation error (noise) and presents the same statistics as the e(n) , described in equation (1). This means that the network, once the training procedure has been completed, operates in a recursive autonomous mode (closed loop operation), generating a sequence of frame rates, for each I, P and B stream, based on p initial
L
samples. Figure 1 presents the block of the proposed neural network architecture.
The pdf of error process {r(n)} is estimated using the NAR filter relationship over all samples of the training set. Particularly, using several experiments of VBR MPEG-2 coded video sequences it can be shown that a Gaussian distribution provides an accurate approximation of the error pdf. Thus, two parameters are required for determining the pdf of {r(n)} (i.e., the mean value µ and the variance b 2 ). Particularly, it is held that
µ = E{x( n)} − E{xˆ (n)}
(4a)
b 2 = E{x(n) 2 } − E{x(n) xˆ (n)} (4b) where E{⋅} denotes the expectation operator, while estimators of
E{x (n)}, E{xˆ (n)}, E{x(n) 2 }, E{x(n) xˆ (n)} are provided using the actual data and the predicted data aver all the samples of the training set. For convenience, in the following analysis, we normalized the error r (n) so that it has zero mean and variance equal to one as ε (n) = (r (n) − µ ) / b
5. MPEG-2 VIDEO SOURCE CONSTRUCTION The sequences {xˆ c ( n)} generated by the network model are deterministically merged, according to the L and M values of the GOP pattern, to form the aggregate VBR video sequence. If uncorrelated prediction errors r c (n) or equivalently ε c (n) are
used as filter inputs, the aggregate MPEG-2 sequence will contain uncorrelated I, P and B components. In real cases, however, I, P and B frames are correlated, as was discussed in section 2. Models that do not satisfy the correlation property severely underestimate the network resources. Since the prediction errors follow the same pdf (i.e., the Gauss), one way to correlate xˆc(n) , for different c, is to correlate their respective errors. For this reason, a reference error is generated, following the Gaussian pdf with zero mean and variance equal to one. A simple approach is to consider ε B (n) as reference error and generate the prediction errors of I and P frames by using a proportion of the error of B frames. In particular, the normalized error of I frames, ε I (n) is related to ε B (n) through the following equation ε I (n)
where
= ε B ( N B * n + 1) = ε B2 ( n)
(5)
N B indicates the number of B frames within a GOP
period and is equal to 8, in our case (L=12, M=3). The ε B 2 ( n) corresponds to the prediction error of the second B frame within a GOP; this is also denoted as B2 in the following. With a similar procedure, the error ε P (n) is created. In particular, let us denote as
N P the number of P frames within a
GOP. In our case, where L=12 and M=3, the parameter N P = 3 .
Then, the ε P (n) is split into
N P error sequences, denoted by
ε Pi (n) , i=1,2,…,Np, each of which corresponds to the error of
the ith P frame, say Pi, within a GOP. Each ε Pi (n) is related to the reference error ε B (n) as follows
ε Pi (n) = ε B ( N B * n + α )
(6) with α=1 for i=1, α=3 for i=2 and α=5 for i=3. where, without loss of generality, equation (6) has been expressed in the case of N P = 3 . Therefore, P1, P2 and P3 frames of a GOP are generated using the same prediction error as the B2, B4 and B6 frames providing the correlation of P and B frames. Correlation between the errors of P and I frames is indirectly achieved through equations (5) and (6). The proposed correlation mechanism is illustrated in Figure 2. In this figure, the symbol ↓ 8 , denotes a downsampler of 8 samples and has been used to explain equations (5) and (6) with simplicity.
6. IMPROVEMENT OF MODELING ACCURACY In a VBR MPEG-2 transmission mode, the rates of I, P and B frames present fluctuations, as was observed in section 2. This is particularly evident in Inter frames, due to motion compensation algorithm. However, the learning algorithm, used to train the neural network, is in fact based on a least squares fitting. Consequently, if we imposed the network to track the abrupt changes of frame rates, belonging to the training set with high accuracy, this would result in poor generalization in data outside the training set (network overtraining). To increase the model performance, especially at high frame rates the following scheme is adopted. Let us assume that the neural network training has been completed and thus the estimates xˆ c ( n) of x c (n) can be obtained. Let us also form a set containing the time indices of high rates of the estimated I, P or B frames, i.e., I c = {n ∈ { p + 1, p + 2, , K + p} : xˆ c (n) > Q c } (7)
L
where Q c is an appropriate threshold, while the time indices {[ p + 1, p + 2, , K + p]} correspond to the time instances of the
L
initial training set. The threshold Q c expresses the level that the rates of I, P or B frames can be considered as high. Using the index set I c , the sequence of the estimated values { xˆ c ( j )} j∈ I c are mapped to the actual values {x c ( j )} j∈ I c , through a nonlinear function. A Generalized Regression Neural Network (GRNN) can be used for this function approximation.
7. EXPERIMENTAL RESULTS The 20% of data of Source1 and Source2 have been used to appropriately train the networks. Once training is completed, the generalization performance is evaluated using Source3, which is different from those used during the learning phase. A common buffer is used to smooth out the VBR rates of all sources. A FIFO (first in first out) policy is considered for the statistical multiplexing i.e., frames/cells are stored and leave the buffer in the same order as they enter it. The starting times of the sources
are uniformly distributed in the first 40ms interval of simulation time (40ms is the frame period). Figure 3 presents frame loss probability, obtained from the proposed traffic model (dotted line), versus buffer size, along with those obtained by varying the rate of real data of Source3 +/-1% for 20 multiplexed video sources. In Figure 3, we observe that the traffic model provides a good approximation of frame loss probability at utilization of 75%, 80% and 85% respectively. In the same figure, we have also depicted the results obtained using linear models which merge the I, P and B frames independently (linear case). A significant underestimate of the network resources is noticed in this case. Figure 3 also depicts the loss probability, obtained by the model without implementing the algorithm, described in section 6. This is mentioned as “model (No GRNN)” in the figure. In this case, a slight underestimate of frame losses is observed, mainly due to slight underestimate of high frame rates. Figure 4(a) presents the frame losses versus buffer size for three different number of multiplexed sources (15, 20 and 25) at utilization of 80% and data of Source3, along with data obtained from the proposed model. We observe that the loss rates decay more rapidly as the number of multiplexed sources increase. Furthermore, the model provides a very good approximation of frame losses in all cases. Similar results are presented in Figures 4(b,c) where the delay is plotted versus the number of multiplexed sources (Figure 4(b)) and utilization degree (Figure 4(c)).
8. REFERENCES [1] B. Maglaris, D. Anastassiou, P. Sen, G. Karlsson and J. D. Robbins, “Performance Models of Statistical Multiplexing in Packet Video Communication,” IEEE Trans. on Comm., Vol. 36, pp. 834-843, 1988. [2] T. Sikora, MPEG Digital Video Coding Standards. Digital Consumer Electronics Handbook, McGRAW-Hill. [3] N. Doulamis, A. Doulamis, D. Kalogeras and S. Kollias, “Low Bit Rate Coding of Image Sequences Using Adaptive Regions of Interest,” IEEE Trans. on CSVT Vol. 8, No. 8, pp. 928-934, December 1998. [4] N. Doulamis et. al., "Efficient Summarization of Stereoscopic Video Sequences," IEEE Trans. on CSVT, to appear in June 2000. [5] D. Heyman, A. Tabatabai and T. V. Lakshman, “Statistical Analysis and Simulation Study of Video Teleconference Traffic in ATM Networks,” IEEE Trans. on CSVT, Vol. 2, pp. 49-59, 1992. [6] M. R. Frater, J. F. Arnold, and P. Tan, “A New Statistical Model for Traffic Generated by VBR Coders for Television on the Broadband ISDN,” IEEE Trans. on CSVT, Vol. 4, pp. 521-526 1994. [7] N. Doulamis, A. Doulamis. G. Konstantoulakis and G. Stassinopoulos, “Efficient Modeling of VBR MPEG-1 Video Sources,” IEEE Trans, on CSVT, to appear in the early of 2000. [8] J. Connor, D. Martin and L. Altas, “Recurrent Neural Networks and Robust Time Series Prediction,” IEEE Trans. on Neural Networks, Vol. 5, No. 2, pp. 240-254. [9] S. Kollias and D. Anastassiou, “An Adaptive Least Squares Algorithm for the Efficient Training of Artificial Neural Networks,” IEEE Trans. on CS Vol. 36, pp. 1092-1101, 1989.
Correlation Mechanism
1 Lag Shift
8
1 Lag Shift
8
3 Lags Shift
8
5 Lags Shift
8
r(n)
p-elements
TDL z-1
Autonomous Mode
εB(n)
z-1
Feedforward Neural Network
^ x(n-1)
z-1
Σ
^ x(n)
εP(n) Autonomous Mode P-Frame Stream Error Sythensis P-Frame
Autonomous Mode B-Frame Stream B-Frame
(b) Figure 2. Graphical representation of the correlation mechanism of I, P and B frame streams
-1
-1
-1
10 Data +/-1 % Linear Case Model (No GRNN) Model
-2
Frame Loss Probability
-2
10
10 Data +/-1 % Linear Case Model (No GRNN) Model
-3
10
-4
10
-5
10
10
-3
10
-4
10
-5
10
-6
10
15
20 25 Buffer Size [ms]
30
35
10
10
-3
10
-4
10
-5
10
-6
5
Data +/-1 % Linear Case Model (No GRNN) Model
-2
Frame Loss Probability
10
Frame Loss Probability
Autonomous Mode I-Frame Stream I-Frame
^ x(n)
(a) Figure 1. The proposed neural network structure used for VBR MPEG-2 modeling.
10
εI(n)
-6
5
10
15
20 25 30 Buffer Size [ms]
35
40
10
45
0
20
40 60 Buffer Size [ms]
80
100
(a) (b) (c) Figure 3. Cell/frame loss probability versus buffer size using data of Source3, along with that provided by the proposed model, the model without the GRNN improvement and the linear case. In all cases, 20 sources are multiplexed to the system. (a) U=75%. (b) U=80%. (c) U=85%. -1
10
200 N=15 N=20 N=25 Model
160
-4
10 + + + Model
50
-3
10
-4
10
-5
40 30 20
10
10
-6
10
-5
Delay [ms]
+ + +
10
Delay [ms]
Frame Loss Probability
-2
10
10 -6
60
0
10
20
30 40 50 Buffer Size [ms]
(a)
60
70
80
N=15 N=20 N=25 + + + Model
120
80
40
15
20 Number of Sources
(b)
25
0 55
60
65
70 75 Utilization [%]
80
(c)
Figure 4. The performance of the proposed model. (a) Frame loss probability versus buffer size for different number of multiplexed sources at U=80%. (b) Delay versus number of multiplexed sources, along with the proposed model, for different frame losses at U=75%. (c) Delay versus utilization for different number of multiplexed sources, along with the proposed model in case of 10-5 frame losses.
85