network friendly internet streaming video with variable frame rate ...

7 downloads 0 Views 143KB Size Report
ing frame rate control at the encoder, and quality recov- ery tools such as motion-compensated frame interpolation at the decoder. These components are ...
NETWORK FRIENDLY INTERNET STREAMING VIDEO WITH VARIABLE FRAME RATE ENCODING AND INTERPOLATION Hwangjun Song, Yon Jun Chung, Tienying Kuo, Jongwon Kim and C.-C. Jay Kuo

Integrated Media Systems Center and Department of Electrical Engineering-Systems University of Southern California, Los Angeles, California 90089-2564 ABSTRACT

An Internet modem video transmission scheme based on the H.263+ recommendation is presented in this research. The proposed system is capable of continually accommodating its bit stream size in response to changing network conditions. It consists of multiple components: congestion control and available channel bandwidth estimation, encoding frame rate control at the encoder, and quality recovery tools such as motion-compensated frame interpolation at the decoder. These components are designed to meet the low computational complexity requirement so that the whole system can operate in real time. It is demonstrated that network adaptivity is enhanced enough to mitigate the packet loss and network bandwidth uctuation, resulting in a more smooth video experience in the receiver.

1. INTRODUCTION Traditionally, the video conferencing system is mainly deployed on the transmission media such as ISDN (integrated service digital network), ATM (asynchronous transfer media) and POTS (plain old telephony system), where the compressed video bit stream can be delivered without worrying seriously about the possible bandwidth uctuation and corruption of bit stream. However, due to the advent of the Internet, there is an emerging video conferencing market based on the packet-based video transmission, and a new video conferencing standard under the ITU-T H.323 umbrella is being actively studied. There have been intense and diverse research activities upon transmitting delay-sensitive video through limited and error-prone Internet channels. They are conducted by proponents in the video conferencing and streaming industry mostly with proprietary techniques. In this work, as an attempt to support the transmission of standard conforming video for the Internet, a video transmission system based on ITU-T H.263+ is investigated. Key requirements to handle real-time video over the Internet are identi ed as follows. Like all forms of data transmission, streaming video packets are undergoing the best-e ort delivery of the Internet, which causes the loss of packets during transmission. To cope with lost packets eciently, we should approach the problem from two aspects. One is from the network congestion control viewpoint, where we should minimize the number of future packets that are likely to be dropped. In addition, the transmitted video should behave in a network friendly manner,

i.e. using only a fair share amount of bandwidth relative to the majority TCP data streams [1]. The other is from the video quality control viewpoint, where we should consider the spatio-temporal quality tradeo to keep the transient degradation of visual quality to minimum. Under the low latency video transmission scenario, lost and late packets are unusable to the decoder. Thus, in order to control the propagation of visual artifacts, error resiliency and error concealment techniques should be applied in conjunction. Also, in order to restore the degraded visual quality due to poor quantization and temporal frame skipping, an adequate visual quality recovery technique should be applied with emphasis on the perceived spatio-temporal quality. The adoption of popular low cost deblocking/deringing lters as a spatial tool and the temporal repetition as a temporal tool is a general trend. One can view video streaming as a relaxed form of video conferencing, since video streaming allows more room compared to the video conferencing scenario that requires near symmetric encoding and decoding, low latency and delay. Key technologies for each module in video streaming such as real-time compression with video rate control, network friendly transmission and congestion control, and real-time decompression with post-processing enhancement, are investigated in this research.

2. OVERVIEW OF PROPOSED INTERNET VIDEO STREAMING SYSTEM The proposed Internet video streaming system is illustrated in Fig. 1, where the entire Internet is treated as one massive IP cloud with inherent delay and loss, and the transmitter and the receiver are well-de ned entry and exit points for the end-to-end video transmission scenario. In this context, the real time streaming protocol (RTSP) employs the RTP/UDP/IP data setup and conveniently comes with a ready-made TCP control connection which can serve as a feedback channel. By utilizing the feedback channel, the rate-controlled video encoder with real time executability can be designed. This type of encoders perform video rate control to accommodate a directive usually in the form of the bit budget. This bit budget directive has to come from a real time Internet video congestion control mechanism. In the current context, this bit budget directive is fed back from the recipient to the transmitter. The purpose of this feedback is to avoid congestion or, in other words, to minimize future packet loss. It does so by instructing the encoder the

Internet Server capture encode transmit @rate = rx

pkts

variable delay & loss

pkts (r(k), tR(k))

RTP variable delay

Client estimate abw decode error conceal display

abw

TCP

Figure 1: Illustration of roles of the transmitter and the receiver in an Internet environment. amount of reduction required in bandwidth consumption in the face of current packet loss resulting from an increase in network trac. By minimizing packet loss, not only are the end users assuring an intact video transmission, albeit temporarily at a lower resolution, but they also behave in a network friendly manner. Another potential of the feedbackdriven solution lies in the interactive design of error resilient video transmission, where the encoder and the decoder work in conjunction adapting to the changing network condition to recover from the error e ect of packet loss.

3. NETWORK FRIENDLY TRANSMISSION OF COMPRESSED VIDEO 3.1. Receiver-based congestion control and available bandwidth estimation

The recent batch of end-to-end congestion control protocols to enforce the network friendly behavior appear to be unduly in uenced in their design approach by existing congestion control mechanisms based on TCP. As a result, the transmitter-based mechanism [2] is adopted to estimate the round trip time (RTT) and the congestion status of network. One consequence is that while it may enforce video streams to exhibit the network friendly behavior, the resulting video possesses less than desirable visual quality. As an alternative, we propose a receiver-based congestion control mechanism as a successor of our previous model-free LMS (least mean square) bandwidth controller [3]. Features of this congestion control scheme such as model-free and receiver-based control allow us to bypass the Internet modeling diculties and yet feed back the current network status in an accurate and timely manner. Through ecient reaction to network dynamics, the proposed scheme exhibits fast convergence to the optimal bandwidth, while maintaining fairness with other data streams. The task of any bandwidth estimator is to instruct the encoder whether to increase or decrease the bit stream size according to network trac. The pivotal point for the task is the socket at the receiver. It essentially acts as our observation portal. The information we relay back to the transmitter must be based on the observations made at this locale. That is, the receiver collects the loss and delay information, and sends the next optimal transmit rate back to the transmitter. We can observe the size and the transmission time of successfully transmitted RTP packets. From these two pieces of information, we can gauge the band-

width that the network is capable of sustaining for a particular instance in time. Also, by examining the sequence number of arriving packets, we can determine which packets were lost and consequently the packet loss rate.

3.2. Matching variable frame rate encoder to the available bandwidth The encoder needs a rate controller capable of quickly responding to changes in the updated available bandwidth. However, the huge complexity and execution latency of the video encoder makes rate control in a microscale manner almost impossible. It is thus assumed that the short term (instantaneous) uctuation e ect can be compensated by the smoothing e ect of deployed encoder/decoder bu ers (e.g. network de-jitter bu er). A longer time scale variation can then be successfully modeled into time varying CBR channels (including the feedback VBR and the renegotiated CBR channel), where the available bandwidth is time varying and well modeled with a piecewise constant function. One may think of a layered or aggregated scalable coding scheme to tackle the bandwidth uctuation problem in a more systematic manner. However, switching between bit streams is a dicult task even for aggregated scalable bit streams. Here, we adopt a fast rate controller for the video codec H.263+ proposed in [4]. While most rate control algorithms examine bit allocation at the macroblock(MB) layer, this method handles it from the frame layer. Although technically challenging, it yields a greater degree of control over the temporal quality, resulting in a precise control of the bit stream to satisfy the low latency requirement of CBR video. Also, the quantization parameter (QP) control, which is the MB layer rate control in H.263+ test model TMN8, has been incorporated in the proposed video rate control as a component. This scheme allows us to select frames to be encoded and allocate the optimal rate for the selected frame according to a speci c cost criterion (in terms of available bandwidth of the channel and underlying motion of the coded video) with a low computational complexity and minimal added delay. To achieve this, we consider a frame layer ratedistortion (R-D) model with respect to the averaged QP of all MBs in each frame. To be more speci c, the quadratic rate model and the ane distortion model are employed, and rate control results of the MB layer are used to determine the coecients of the frame layer R-D model. Thus, the additional computational complexity required for the frame layer R-D modeling is very small and negligible. Based on the frame layer R-D model, we can estimate the anticipated distortion of the current frame. Note that the anticipated distortion increases when fast motion change occurs or when the channel bandwidth decreases suddenly. If the spatial quality is below a tolerable level due to fast motion change or sudden channel bandwidth decrease, we should reduce the temporal quality and improve the spatial quality in order to reduce the ickering artifact. At the same time, it is still desirable to control the temporal quality degradation. On the contrary, if the spatial quality is above a certain level, we should increase the temporal quality. By adopting this rate control scheme, we can avoid

the abrupt change of the encoding frame rate and improve the spatial quality accordingly.

4. QUALITY RECOVERY FROM DEGRADED VIDEO 4.1. Recovery from packet loss

The low bit rate condition of the Internet modem prevents the full adoption of forward error correction and the low latency requirement blocks the use of the retransmission mechanism. The remaining choice is the adoption of the error resilient technique which helps us to limit the error propagation. We can apply error resilient schemes with and without feedback. Schemes applicable regardless of feedback comprise of intra-MB refresh, slicing, independent segment decoding, and data partitioning scheme. Feedbackbased schemes such as adaptive intra-MB refresh with error tracking and enhanced reference frame selection have more potentials, except for their combined complexity. Although the detection of errors is quite straightforward in the packet transmission case, the size of a transmitted packet is rather a control factor in the applicability of active error concealment techniques. This is due to the fact that error concealment is of very limited use, when a large portion of a video frame is missing due to the loss of a packet. Since the combined overhead of RTP/UDP/IP is 40 bytes per packet, the size of a packet is often up to 1500 bytes of the maximum transfer unit (MTU). This usually results in the inclusion of multiple P-frames in a packet. However, depending on the content of the lost packet, there still remains a need of error concealment to cope with partial loss of I-frames and subsequent P-frames. To handle this, the technique mentioned in the H.263+ test model known as the TCON model can be applied with the following considerations. Error concealment is made based on a motion criterion. For a missing MB, if the motion vector of the corresponding MB in the previous frame is relatively small, simple replication can be employed. If it is fast, replication can result in jagged and abrupt edge discontinuities. A better error concealment technique is needed in such a case. Usually, a simple error concealment scheme based on spatial interpolation makes concealed GOBs blurred and of a lower resolution in comparison of their surrounding GOBs. An alternative error concealment based on the interpolation of DCT coecients o ers an improved resolution with a low complexity [3].

work, a sub-optimal solution may be pursued by using a nonlinear ltering technique of low complexity [5]. The data delity constraints are imposed purely in the spatial domain instead of the transform domain so that the computational cost associated with forward and inverse transforms is avoided.

4.3. Temporal quality recovery via interpolation

The fast motion-compensated interpolation (FMCI) scheme is implemented in the decoder as a video post-processing unit, which is cascaded with the standard H.263+ decoder without changing the bit stream syntax [6]. FMCI consists of three main units: motion-preprocessing, segmentation and MCI prediction. The motion-preprocessing unit is used to modify the block-based motion eld to achieve a better frame interpolation result. Once the post-processed motion eld is obtained, we map it to the pixel-based motion eld for MCI prediction. We adopt the deformable block transform to map the block-based motion eld to the pixel-based motion eld. The second unit of FMCI performs object segmentation of decoded frames, which is useful in providing the moving object location to MCI. We do not use any complicated segmentation procedure, partly because we do not want to increase the computational load in the decoder and partly because the segmentation result is rough due to the use of the block-based motion eld only. For the third unit, classi cation of regions into stationary, covered and uncovered backgrounds and the moving object, commonly used in standard MCI, is adopted here. If FMCI does not deform the block, it is easy to map the moving object from decoded frames to the interpolated frame. That is, we can assign the pixel motion vector by using the block motion vector of the associated block. However, blocking artifacts are observed in some cases due to the translation of a rigid block. In order to remove this kind of blocking artifacts, we should consider the block rotational motion and the mapping from a rectangle to a parallelogram as well. A deformable mapping can actually be derived from the orthographic projection of the 3-D rigid motion of a planar surface. The ane (or perspective) transform is a common tool to achieve this functionality. Due to the complexity concern, we only consider the ane transform whose polynomial order is lower than the perspective transform.

5. SYSTEM INTEGRATION AND EXPERIMENTAL RESULTS

experimental results were performed with not only 4.2. Spatial quality recovery via deblocking/deringing These talking heads but also footage containing fast motion. The

The main issue in implementing a practical postprocessing algorithm is the removal of the blocking and ringing artifacts. Other artifact such as blurring is only visible upon close inspection. Various postprocessing techniques have been proposed before. While some of them are designed only for the deblocking purpose with ltering, other approaches try to restore all major artifacts in a uni ed framework by adopting the image restoration techniques. However, restoration approaches generally demand a high complexity for the iterative computation. Thus, after formulating the problem with the robust estimation frame-

receiver-based congestion control and available bandwidth estimator relays to the real time video rate controller at the transmitter the available Internet channel bandwidth in a time-varying CBR form. The real-time rate controller then employs the time-varying CBR bandwidth approximation as a budget limit and performs the necessary rate adjustment. The proposed frame rate control algorithm increases the average PSNR and reduces the standard deviation of PSNR as shown in Fig. 2. Furthermore, degradation in motion smoothness caused by encoding in variable frame rate is not obvious since the encoded frame interval changes

very gradually. Finally, the proposed FMCI does improve the motion smoothness obviously even though PSNR gain is not much as shown in Fig. 3.

AFS+FMCI versus FR: Suzie 36

34

32

4

4.8

x 10

30

Channel Bandwidth(bps)

PSNR

4.6

28

4.4

26

4.2

24

4

22

3.8

20 3.6

18 3.4

50

100

150

Frame Number

3.2

3

0

0

50

100

150 # of frames

200

250

300

(a)

Figure 3: PSNR performance of Frame replication (indicated dotted) versus FMCI (indicated by solid line) for Suzie sequence.

4

1.8

x 10

can minimize the packet loss to the transmitter in a network friendly manner. The real-time frame rate controller at the encoder responds by tailoring the bit stream to t the instructed available bandwidth. Last, the deformable block version of fast motion-compensated frame interpolation along with other postprocessing unit at the receiver side, restored a good visual quality by interpolating frames.

o:proposed algorithm

1.6

*:tmn8 1.4

Rate(bits)

1.2

1

0.8

0.6

7. REFERENCES

0.4

0.2

0

50

100

150 # of frames

200

250

300

(b) 35

o:proposed algorithm *:tmn8

34

33

PSNR(dB)

32

31

30

29

28

27

26

0

50

100

150 # of frames

200

250

300

(c)

Figure 2: Variable frame rate control performance for the QCIF Foreman: (a) the bandwidth variation plot, (b) the video rate plot and (c) the PSNR plot.

6. CONCLUSION The demonstration presented in this work was conceived as a total solution to the problems associated with time varying CBR video transmission over a non-centralized, loss prone, single packet class network such as the Internet. The receiver-based congestion control and available bandwidth estimator instructs the bandwidth consumption level, which

[1] S. Floyd and K. Fall, \Promoting the use of end-toend congestion control in the Internet," submitted to IEEE/ACM Trans. on Networks, 1998. [2] J. Padhye, J. Kurose, D. Towsley, and R. Koodli, \Tcpfriendly rate adjustment protocol for continuous media

ows over best e ort networks," Technical Report 9847, UMASS CMPSCI, Oct. 1998. [3] Y. J. Chung, J. Kim, and C.-C. J. Kuo, \Real-time streaming video with adaptive bandwidth control and DCT-based error concealment," submitted to IEEE Trans. on Circuits and Systems II, Oct. 1998. [4] H. Song, J. Kim, and C. C. J. Kuo, \Real-time encoding frame rate control for H.263+ video over the internet," Signal Processing:Image Communication, submitted 1998. [5] M. Shen and C.-C. J. Kuo, \Real-time postprocessing for compression artifacts reduction in the low-bit-rate video coding," in Proc. of SPIE International Symposium on Optical science, Engineering and Instrumentation, July 1998. [6] T. Kuo and C.-C. J. Kuo, \Motion-compensated interpolation for low-bit-rate video quality enhancement," in Proc. of SPIE Visual Communications and Image Processing `99, vol. 3653, Jan. 1999.

Suggest Documents