950
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 3, MAY 2007
[7] X. Liao and J. Wang, “Algebraic criteria for global exponential stability of cellular neural networks with multiple time delays,” IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 50, no. 2, pp. 268–275, Feb. 2003. [8] X. Liao, J. Wang, and Z. Zeng, “Global Asymptotic stability and global exponential stability of delayed cellular neural networks,” IEEE Trans. Circuits Syste. II, Exp. Briefs, vol. 52, no. 7, pp. 403–409, Jul. 2005. [9] V. Singh, “A generalized LMI-based approach to the global asymptotic stability of delayed cellular neural networks,” IEEE Trans. Neural Netw., vol. 15, no. 1, pp. 223–225, Jan. 2004. [10] ——, “Global asymptotic stability of neural networks with delay: comparative evaluation of two criteria,” Chaos Solitons Fractals, vol. 31, pp. 1187–1190, 2007. [11] Z. Zeng, J. Wang, and X. Liao, “Global exponential stability of a general class of recurrent neural networks with time-varying delays,” IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 50, no. 10, pp. 1353–1358, Oct. 2003. [12] J. Zhang, “Absolutely exponential stability in delayed cellular neural networks,” Int. J. Circuit Theory Appl., vol. 30, no. 3, pp. 395–409, Mar. 2002. [13] D. Zhou, L. Zhang, and J. Cao, “On global exponential stability of cellular neural networks with Lipschitz-continuous activation function and variable delays,” Appl. Math. Comput., vol. 151, pp. 379–392, 2004.
Packet Loss Rate Prediction Using the Sparse Basis Prediction Model Amir F. Atiya, Sung Goo Yoo, Kil To Chong, and Hyongsuk Kim
Abstract—The quality of multimedia communicated through the Internet is highly sensitive to packet loss. In this letter, we develop a time-series prediction model for the end-to-end packet loss rate (PLR). The estimate of the PLR is needed in several transmission control mechanisms such as the TCP-friendly congestion control mechanism for UDP traffic. In addition, it is needed to estimate the amount of redundancy for the forward error correction (FEC) mechanism. An accurate prediction would therefore be very valuable. We used a relatively novel prediction model called sparse basis prediction model. It is an adaptive nonlinear prediction approach, whereby a very large dictionary of possible inputs are extracted from the time series (for example, through moving averages, some nonlinear transformations, etc.). Only few of the very best inputs among the dictionary are selected and are combined linearly. An algorithm adaptively updates the input selection (as well as updates the weights) each time a new time sample arrives in a computationally efficient way. Simulation experiments indicate significantly better prediction performance for the sparse basis approach, as compared to other traditional nonlinear approaches. Index Terms—Adaptive nonlinear prediction, packet loss, packet loss prediction, sparse basis. Manuscript received May 16, 2006; revised November 17, 2006; accepted January 5, 2007. This work was supported by the Ministry of Information and Communication (MIC), Republic of Korea under the IT Foreign Specialist Inviting Program (ITFSIP) supervised by the Institute of Information Technology Assessment (IITA). A. F. Atiya is with the Department of Computer Engineering, Faculty of Engineering, Cairo University, Giza 12211, Egypt (e-mail:
[email protected]). S. G. Yoo, K. T. Chong, and H. Kim are with the School of Electronics and Information Engineering, Chonbuk National University, Duckjin-Dong, Duckjin-Gu, Jeonju 561-756, South Korea (e-mail:
[email protected];
[email protected];
[email protected]). Digital Object Identifier 10.1109/TNN.2007.891681
I. INTRODUCTION Packet loss in the Internet can especially severely degrade the quality of delay-sensitive multimedia applications. It usually occurs during periods of heavy congestion as packets have to be discarded when the buffers of the intermediate routers become full. Therefore, any mechanism that leads to lessening the effect or reducing the packet loss rate (PLR) would be a significant achievement. The motivation for predicting the PLR arises from several considerations. Generally, real-time applications use UDP-based transmission, as the TCP protocol is based on a complex retransmission algorithm that is not suitable for the realtime nature of such applications. The problem, however, is that if congestion occurs on a link carrying TCP as well as UDP traffic, TCP will react by reducing the traffic rate, while UDP will not. This will aggravate the congestion situation and also violate the fairness issue. It has, therefore, been recommended that UDP traffic uses similar traffic reduction mechanisms as TCP, and several TCP-friendly rate control mechanisms have been developed (see [18]). The rate adjustment is based on the key quantities: PLR and the round-trip time (RTT). Rather than using previously measured values of PLR and RTT as is commonplace, a better approach is to use predictions of these quantities. Such a predictive approach will be quicker to track congestion conditions than the typically used reactive approach. The other motivation for predicting PLR is that for real-time multimedia traffic transmitted using UDP, a procedure to recover lost packets is adding redundancy using forward error correction (FEC). These are extra packets that can be used to reconstruct lost packets. The FEC packets, however, represent a bandwidth overhead, and it is, therefore, imperative to send only as much of these as needed. An accurate prediction of PLR will be considerably useful in that regards. Work on predicting PLR has been very scarce in the literature. There are some analytical approaches that model packet loss, but they focus on the level of a single link. Estimating end-to-end PLR is analytically intractable. There are some empirical studies that investigate the relationship between PLR and some quantities such as RTT. Moon [9] and Paxson [12] observed a causal effect of RTT on PLR. As anticipated, a rising RTT is indicative of congestion buildup and, hence, a possible impending packet loss. Su et al. [16] derived the packet loss probability, conditioned on past loss rates, assuming the Gilbert model, which is a simple two-state Markov model. Mehrvar and Soleymani [8] developed a neural network model to predict PLR at the level of the queue, for general non-Poisson traffic. They extracted certain traffic descriptors using some traffic statistics, and these are the inputs to the neural network. Salvo Rossi et al. [15] modeled end-to-end PLR for UDP traffic using a hidden Markov model. Roychouduri and Al-Shaer [14] developed an empirically determined formula that predicts end-to-end PLR as a function of available bandwidth, delay variation, and trend. We have found no paper in the literature that considers a time-series approach in predicting PLR. In [19] and [20], we considered a time-series prediction approach for predicting end-to-end PLR and RTT. Specifically, we used a neural network prediction model. In this letter, we develop a new prediction model for end-to-end PLR using a novel prediction approach called sparse basis prediction method, developed by Atiya et al. [2]. There is a general acknowledgement of the need to use more and more data-driven approaches in many of the communications networks problems (see, for example, [13]), and so this presented letter is a contribution in this direction. The proposed approach is based on the premise of having a very large dictionary of inputs extracted from the time series, for example, through various moving averages, nonlinear transformations, and so on. An algorithm has been developed to adaptively select very few inputs out of the dictionary, as well as the
1045-9227/$25.00 © 2007 IEEE
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 3, MAY 2007
951
weights that combine these inputs linearly. It is an adaptive method, where the input selection and the weight update are done and updated online through some recursive algorithm. The reason why, in particular for the PLR prediction problem, the adaptive approach is desirable is the following. Every possible source/destination pair has its unique time-series characteristics (that also could vary for instance with the time of the day). For every such pair it is very impractical to have lengthy offline training sessions supervised by the designer. This letter is organized as follows. Section II describes the sparse basis prediction approach. Section III presents the PLR prediction results, followed by a conclusion in Section IV. II. OVERVIEW OVER THE SPARSE BASIS PREDICTION MODEL A. Introduction The inspiration for the proposed prediction approach comes from a topic in the area of data compression called sparse basis selection. This topic has been a very active research topic for the last ten years. Instead of the traditional approach of representing signals as superpositions of sinusoids, a large dictionary of possible basis functions is considered, such as Gaussian functions, Gabor functions, wavelets, etc. From this large dictionary, only few of the best basis elements that can most accurately represent the given signal (or signals) are selected. The selection procedure is performed through some algorithms (such as the forward greedy approach, the backward greedy approach, or the forward backward approach), whereby some exchange between the selected basis elements and the dictionary takes place in an attempt to minimize the error. The literature on sparse basis representation is vast, but see, for instance, [1], [4], [6], [10], and [17]. Atiya et al. [2] proposed the sparse basis prediction model. It is an adaptation of the sparse basis selection methodology to a time-series prediction setting. (It also has some aspects of Pao’s functional link networks [11].) The typical approach for nonlinear time-series prediction has been to use a sophisticated nonlinear model such as neural networks or support vector machines (SVMs). A small number of inputs is usually selected, often by trial and error, in order to minimize overfitting. In Atiya et al. approach, the nonlinearities are deferred to the input stage and only a linear model is used. This means a large dictionary of original inputs, and many nonlinear transformations thereof, is used. For example, if the original inputs are ui ; i = 1; . . . ; I , then nonlinear transformations of each of these inputs can as well be used in the dictionary, such as
au
e
p
bu
i+c
1=(1 + e0du );
i
= 1; . . . ; I
(1)
where a; b; c; and d are constants. From among the input dictionary elements only few of the best are selected. In [2], an efficient adaptive backward–forward algorithm has been proposed that recursively updates the basis selection and the combination weights (of the linear model) as new data points arrive. There are a number of distinctive advantages for this approach. In the traditional approach, the nonlinear model (such as neural network, SVM, etc.) is usually hard to train. Vulnerability to local minima (in case of neural networks) or slow convergence (for example, for SVMs) necessitates some care and experimentation from the designer. It is, therefore, difficult to use these technologies unsupervised by the designer in an adaptive setting, as many things can go wrong (such as getting stuck to a local minimum, too slow or too fast adaptation, etc.). In the sparse basis prediction model, we do not have these problems, and so the method is well suited to adaptive prediction. It is also very fast in training and in prediction (recall). The model output is linear in the weights, and there is no local minima problem. On the other hand, even though greedy subset or input selection procedures, such as the
forward greedy approach or the backward greedy approach, have no guarantees to obtain a globally optimal choice, most simulation studies established the fact that they produce results close to the optimal (see the experimental comparison of Jain and Zongker [5]). This is more so in the case of using a linear regression (than in the case of a linear classifier), as we have found out using a simple experimental study. B. Sparse Basis Prediction Method
Consider that we have a dictionary of K inputs. Let xT (i) 2 R12K and y (i) 2 R represent, respectively, the training set input vectors and target outputs (the values to be forecasted), where i indexes time. Let us arrange the input vectors in a matrix X 2 RN 2K with the rows being T x (i) (N is the number of training set vectors). Let us also arrange the target outputs y (i) in a column vector y . As aforementioned, the sparse basis time-series prediction method is based on selecting a small number of inputs from among the pool of available inputs. This selection is adaptive in the sense that it is updated as new data points arrive. Let S f1; 2; . . . ; K g denote the set of selected inputs and let S f1; 2; . . . ; K g0 S represent the remaining inputs that were not selected. Let XS denote the matrix constructed from X by selecting the columns indexed by the set S . The problem can be posed as follows: find the weight vector w and the set S of size J K that minimize E
= kX S w 0 y k 2
(2)
where the solution is to be updated recursively every new time step as new measurements arrive. The main idea of the algorithm is based on a forward–backward update, whereby one iterates between selecting the best input to add (to the selected set S ) and removing the worse input (from S ). Specifically, the algorithm can be summarized in the following steps. 1) Start at time n = n0 . Initialize by randomly selecting any set S of J inputs. Compute the weight vector w and some accompanying matrices that are needed for the solution. 2) Remove the element from S that increases the resulting error by the least amount. Do this by computing the new sum of square error E recursively with the help of some matrix update identities. (By “new” we mean after removing one input from S .) 3) Add the element of S in S that reduces the resulting error by the most amount. Again, this is done by computing E (after each input addition) recursively from the original value of E . 4) Repeat the last two steps until desired convergence is achieved. 5) Now, S is fixed and the vector w is obtained. A prediction is obtained for the time step n + 1. 6) Increment the time n by 1. A new measurement [i.e., x(n) and y (n)] has arrived. Update all matrices recursively to reflect this new data point, and repeat the backward–forward steps of 2)–4). 7) Keep moving forward in time by applying steps 2)–6) until you arrive to the end of the time series. The detailed algorithm is presented in the Appendix. For a derivation of the algorithm, refer to [2]. III. PACKET LOSS PREDICTION RESULTS A. Data Collection Setup We have designed a setup to collect data, by installing a transmission processor at Chonbuk National University, Jeonbuk, South Korea, and a receiving-retransmission processor at Seoul National University, Seoul, South Korea. The transmission processor transmitted packets using the TCP-friendly mechanism (we used the transmission rate derivations from [7] as guidelines). The packet size is 625 B, and that includes 64 B reserved for the probe header. The probe header
952
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 3, MAY 2007
keeps track of the transmission time and order of the packet. These information are used to estimate PLR and RTT (round-trip time). We have used 2 s as our basic time unit, so we created a time series of PLR and RTT values, each measured over each 2-s time interval. The RTT value within each interval is the average of the individual packets’ RTT. (The RTT is used as input in the prediction model for one of the developed models.)
TABLE I OUT-OF-SAMPLE PREDICTION ERROR OF PROPOSED MODEL AND THE COMPETING MODELS
B. Experimental Setup We have created a time series of 13 158 data points (this corresponds to several data transmission and collection sessions). We have used the first 2000 data points as the training set, and the remaining data as an out-of-sample test set. This training set is used to adjust some parameters of the algorithm, such as the adaptation rate (or forgetting factor), and the number of selected basis J . Based on these runs, these parameters are selected as follows: = 0:99 and J = 5. The number of iterations (NITER) in the backward–forward update was fixed at NITER = 30, as this number was generally found adequate for convergence in [2]. We have developed the following two PLR prediction models: one using inputs extracted only from the PLR time series and the other using inputs extracted from both the PLR time series and the RTT time series. To assess the comparative performance of the proposed method, we compared it with the typical prediction approach, whereby the prediction of the next data point is a nonlinear function of the previous data points in some moving window (of, for example, size W ), i.e.,
y^n = f (yn01 ; yn02 ; . . . ; yn0W )
(3)
where yn is the actual time-series value and y^n is the predicted value. We have used the following three models to obtain the prediction function f : an artificial neural network (ANN) prediction model, an SVM regression approach, and a K -nearest neighbor (KNN) regression model. The training set for the three models is the same as that of the proposed method. To obtain the parameters of the competing models, such as the window size W , the number of hidden nodes for the ANN model, the parameter (for the -insensitive tube) for SVM-regression, and the number of neighbors K for the KNN model, we used fivefold validation using the training set. Other parameters, such as the learning rate for the ANN model, are obtained by experimenting with a few choices on the training set. Concerning SVM, we used a Gaussian kernel. Any other parameters are selected according to the default values in the MATLAB toolbox we used (developed by Canu et al. [3]). The error measure used is the typically used normalized mean square error (nmse) , defined as nmse =
N (yn 0 y^n )2 n=n N yn2 n=n
(4)
where N is the length of the test set and n0 is the start of the test set period. C. Input Dictionaries We created the following pool of inputs (let n be the current time): I1) the latest sample PLR(n); I2) an input detecting large and small values of PLR (i.e., it equals 1 if PLR is larger than the mean by a constant and 01 if smaller than the mean by a constant, and 0, otherwise); I3) the previous sample PLR(n 0 1); I4) the sample PLR(n 0 2); I5) first backward difference b(n) =PLR(n)0PLR(n 0 1);
I6) spike-dectecting input: gives 1 (01) for big positive (negative) jump in PLR [when bigger in magnitude than mean(jPLR(n)0PLR(n 0 1)j)]; I7) PLR(n 0 1)0PLR(n 0 2); I8) PLR(n 0 2)0PLR(n 0 3); I9) difference between two moving averages, one using a small window of size 1 and the other using a large window of size 3; I10) log of the input I1) appropriately scaled; I11)-I15) difference between two moving averages of the input I10) (i.e., the log time series), one using a small window of sizes 2, 4, 6, 8, and 10, and the other using a large window of sizes 6, 12, 18, 24, and 30, respectively; I16)-I20) exponential of inputs I11)–I15) appropriately scaled; I21) exponential of input I1) appropriately scaled; I22)-I26) difference between two moving averages of the input I21) (i.e., the exponential of the time series), one using a small window of sizes 2, 4, 6, 8, and 10, and the other using a large window of sizes 6, 12, 18, 24, and 30, respectively; I27)-I31) log of the inputs I22)–I26) appropriately scaled/shifted; I32) second backward difference b2(n) = PLR(n) 0 2PLR(n 0 1)+ PLR(n 0 2); I33) smoothed estimate of the second derivative c(n); define a1(n) = MA2 0 MA6 and a2(n) = MA4 0 MA12; then, c(n) = a1(n) 0 a2(n) where MAi is the moving average with window size i; I34) standard deviation of the past ten PLR values; I35) standard derivation of the past 20 PLR values; I36) absolute value of the input I5); I37)-I41) exponential of the inputs I32)–I36) appropriately scaled; I32)-I46) log of the inputs I32)–I36) appropriately scaled/shifted; I47)-I51) sigmoid of the inputs I32)–I36) appropriately scaled where sigmoid is the function 1=(1 + exp(0x)). For the model that uses RTT inputs in addition to the PLR inputs, we have created a similar dictionary as previously, but applied to the RTT time series. For such a model, the two dictionaries are combined, so the value of K (the dictionary size or the dimension of the x(i) vector) for the first and second models is 51 and 102, respectively. D. Prediction Results and Discussion Table I summarizes the prediction results on the test period for the proposed methods as well as the competing methods (the test period covers the period from sample 2001 to sample 13 158 at the end of the time series). The notation SPARSE (PRL&RTT-INP) means the sparse basis method with PLR- and RTT-based inputs, while SPARSE (PRL INP) means the same method, but with PLR-based inputs only. Fig. 1 depicts the prediction versus actual value for a portion of test period. One can see from the table that the proposed method outperforms all other methods. Not only predictive performance, but also speed is considerably better for the proposed method. We have not made a formal speed comparison, but it took around a day to train the ANN method, mainly because of having to repeat training five times in the fivefold validation setup. For the SVM, it took around two
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 3, MAY 2007
953
Fig. 1. Prediction and actual time series for a portion of the out-of-sample period for the SPARSE PLR-INP model. The time axis unit is “sample number,” while the PLR axis unit is percent lost packets.
days, while, for the KNN, about an hour was needed. For the proposed method, it took around 10 min for the tuning of the parameters and J , and a few minutes to run the final model on the test data set. We also wish to mention that the proposed method is not very sensitive to variations in the tuning parameters. For any choice of J between two and eight, the performance on the training set will not be too different. Concerning , generally the range of 0.99–0.999 gave the best results in most applications we tried. Overall, tuning the parameters is not time-consuming. The surprising fact is that the model based on only PLR time series outperformed a little the one based on inputs from the PLR as well as the RTT time series. This contradicts studies that indicate that RTT has a somewhat causal effect on PLR. Perhaps this effect is fast enough so that past few values of PLR already reflect a good portion of the RTT change. As mentioned, the distinctive feature of the proposed approach is its adaptive nature. This can be useful when dealing with several sourcedestinations combinations. Each route’s traffic has its different characteristics, and it will be advantageous to adapt the prediction model to these characteristics. Using lengthy training sessions with available prediction models such as neural networks will not be practical in this situation. Concerning the issue of model complexity versus training sample size, we note the following. The true complexity is determined by the number of selected inputs (i.e., J ) which is usually chosen as a low number. The size of the input pool is merely “the search space” (akin to the weight search space for neural networks). However, it is recommended for applications that have thousands rather than hundreds of training data points. For smaller training sets, there is no point to use adaptation in any way. IV. SUMMARY AND CONCLUSION In this letter, we considered the problem of PLR prediction. An accurate prediction can improve the quality of real-time multimedia traffic,
as rate adjustment mechanisms could then be predictive rather than reactive. A relatively novel prediction model is used, that is based on dynamically selecting few inputs from among a large dictionary of inputs. The proposed method outperformed all tested traditional approaches in terms of the prediction accuracy. In addition, it is faster than the other methods. Its adaptive feature gives it robustness, as it can be left to train online by itself with little needed supervision by the designer. It is, therefore, worth to investigate further the premise of this method for other forecasting problems.
APPENDIX I STEPS OF THE ALGORITHM The algorithm steps are given as follows. The detailed derivation can be found in [2]. 1) Initialize: Start at time n = n0 . Choose the set S randomly as any J numbers in f1; . . . ; K g. Let S = f1; . . . ; K g 0 S . Let 0 < < 1 be the discounting factor (or the forgetting factor). 2) Compute the following initial matrices:
C (n) = X T RX (5) d(n) = X T Ry (6) where, as mentioned in Section II-B, X is the matrix containing the training vectors and y is the vector of target values, both up to time step n. Concerning R, it is the n 2 n diagonal matrix with ith diagonal element being n0i . Define CQZ (n) as the submatrix of C (n) consisting of rows indexed by set Q and columns indexed by set Z , for any given sets Q and Z (they can also be scalars). An analogous definition applies to the vectors dQ (n) and xQ (n). Define
D(n) = (CSS (n))01 :
(7)
954
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 3, MAY 2007
3) For n = n0 + 1 to N do the following. a) Update the matrices to reflect the new measurements at n
C (n) = C (n 0 1) + x(n)xT (n) d(n) = d(n 0 1) + y(n)x(n) T D(n) = 1 D(n 0 1) 0 2 b(n)bT (n) + xS (n)b(n)
(8) (9) (10)
where
b(n) = D(n 0 1)xS (n):
(11)
b) For I cycles, repeat steps c) then d). c) Backward recursion: Compute
k^ = arg maxk [D(n)dS (n)]k ; Dkk (n)
k 2 S:
^ from S : S S 0 fk^g and S Remove k partition the matrix D(n) as follows: D (n ) = 5
(12)
= S [ fk^g. Let us
Hk^ hk^ 5T hkT^ Dk^k^ (n)
(13)
^th where 5 represents a permutation matrix that moves the k ^th row to the last column and row, respeccolumn and the k tively, and Hk^ and hk^ are of dimension (J 0 1) 2 (J 0 1) and (J 0 1) 2 1, respectively. Then, obtain the new D(n), according to D(n) Hk^ 0
hk^ hkT^ : Dk^k^ (n)
d) Forward recursion: For all k
(14)
2 S
A1k = D(n)CSk (n) A 2 = D (n )d S (n ) T (n )A gk = Ckk (n) 0 CSk 1k 2 T d ( n ) 0 C Sk (n)A2 k = arg maxk k : gk
from S to S : S S 0fkg and S Move k D(n) as follows: D (n )
D(n) + A1k AT1k =gk 0AT1k =gk
(15) (16) (17) (18)
= S [fkg. Update
0A1k =gk : 1=gk
(19)
REFERENCES [1] J. Adler, B. D. Rao, and K. Kreutz-Delgado, “Comparison of basis selection methods,” in Proc. 30th Asilomar Conf. Signals Syst. Comput., Pacific Grove, CA, Nov. 1996, vol. 1, pp. 252–257. [2] A. F. Atiya, M. Aly, and A. G. Parlos, “Sparse basis selection: New results and application to adaptive prediction of video source traffic,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1136–1146, Sep. 2005.
[3] S. Canu, Y. Grandvalet, and A. Rakotomamonjy, “SVM and kernel methods Matlab toolbox,” in Perception Systémes et Information, INSA de Rouen, Rouen, France, 2003 [Online]. Available: http://asi.insa-rouen.fr/arakotom/toolbox/ [4] G. Davis, S. Mallat, and M. Avellaneda, “Greedy adaptive approximation,” J. Constructive Approx., vol. 13, pp. 57–98, 1997. [5] A. Jain and D. Zongker, “Feature selection: Evaluation, application, and small sample performance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 153–158, Feb. 1997. [6] K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T.-W. Lee, and T. Sejnowski, “Dictionary learning algorithms for sparse representation,” Neural Comput., vol. 15, no. 2, pp. 349–396, Feb. 2003. [7] M. Mathis, J. Semke, J. Mahdavi, and T. Ott, “The macroscopic behavior of TCP congestion avoidance algorithm,” ACM Comput. Commun. Rev., pp. 67–82, 1997. [8] H. Mehrvar and M. R. Soleymani, “Packet loss prediction using a universal indicator of traffic,” in Proc. IEEE Int. Commun. Conf. (ICC), Helsinki, Finland, Jun. 11–14, 2001, vol. 3, pp. 647–653. [9] S. B. Moon, “Measurement and analysis of end-to-end delay and loss in the Internet,” Ph.D. dissertation, Dept. Comput. Sci., Univ. Massachussets, Amherst, MA, 2000. [10] B. K. Natarajan, “Sparse approximate solutions to linear system,” SIAM J. Comput., vol. 24, no. 2, pp. 227–234, Apr. 1995. [11] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks. Reading, MA: Addison-Wesley, 1989. [12] V. Paxson, “Measurements and analysis of end-to-end Internet dynamics,” Ph.D. dissertation, Comput. Sci. Div., Univ. California, Bekeley, CA, 1997. [13] A. G. Parlos, C. Ji, T. Parisini, M. Baglietto, A. F. Atiya, and K. Claffy, “Guest editorial: Introduction to the special issue on adaptive learning systems in communication networks,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1013–1018, Sep. 2005. [14] L. Roychoudhuri and E. Al-Shaer, “Real-time packet loss prediction based on end-to-end delay variation,” IEEE Trans. Network Service Manager. , vol. 2, no. 1, Nov. 2005. [15] P. S. Rossi, G. Romano, F. Palmieri, and G. Ianello, “Joint end-to-end loss-delay hidden Markov model for periodic UDP traffic over the Internet,” IEEE Trans. Signal Process., vol. 54, no. 2, pp. 530–541, Feb. 2006. [16] Y.-C. Su, C.-S. Yang, and C.-W. Lee, “The analysis of packet loss prediction for Gilbert-model with loss rate uplink,” in Inf. Process. Lett., 2004, vol. 90, pp. 155–159. [17] J. Tropp, “Topics in sparse approximation,” Ph.D. dissertation, Dept. Comput. Appl. Math., Univ. Texas, Austin, TX, 2004. [18] J. Widmer, R. Denda, and M. Mauve, “A survey on TCP-friendly congestion control,” IEEE Network, vol. 3, pp. 28–37, 2001. [19] S. G. Yoo, K. T. Chong, and S. Y. Yi, “Neural network modeling of transmission rate control factor for multimedia transmission using the Internet,” in Lecture Notes in Computer Science, ser. 3399. Berlin, Germany: Springer-Verlag, 2005, pp. 851–862. [20] S. G. Yoo, K. T. Chong, and H. S. Kim, “Development of predictive TFRC with neural network,” in Lecture Notes in Computer Science, ser. 3606. Berlin, Germany: Springer-Verlag, 2005, pp. 193–205.