CONTENT-AWARE STREAMING OF LECTURE VIDEOS OVER WIRELESS NETWORKS Tiecheng Liu and Chekuri Choudary Department of Computer Science and Engineering University of South Carolina Columbia, SC 29208
[email protected] ABSTRACT Video streaming over wireless networks becomes increasingly important in a variety of applications. To accommodate the dynamic change of wireless networks, Quality of Service (QoS) scalable video streams need to be provided. This paper presents a system of content-aware wireless streaming of lecture (instructional) videos for Elearning applications. A method for real-time analysis of instructional videos is first provided to detect video content regions and classify video frames, then a “leaking video buffer” model is applied to dynamically compress video streams. In our content-aware video streaming, instructional video content is detected and different QoS are selected for different types of video content. Our adaptive feedback control scheme is able to transmit properly compressed video streams to video clients not only based on wireless network bandwidth, but also based on video content and the feedback of video clients. Finally, we demonstrate the scalability and content awareness of our system and show experimental results of two lecture videos. 1. INTRODUCTION With the popularity of mobile devices such as PDAs and smart phones, there is increasing demand on transmitting videos over wireless networks. Wireless video streaming, however, is still challenging because of the dynamic change of wireless networks, limited bandwidth, package loss, and limited computing capacity of mobile clients. Considering the diversity of video clients and the change of wireless networks, scalable video coding and adaptive video transmission techniques are required. Extensive research has been conducted on scalable video coding [1, 2, 3, 4], real-time video compression [5], network bandwidth allocation [6, 7], and video streaming [8]. Scalable coding approaches [1, 2, 3, 4] provide error-resilient and quality scalable video encoders to overcome two main difficulties in wireless transmission, namely, network fluctuation and package loss. A major improvement of scalable
video coding technique in recent years is the introduction of Fine Granularity Scalable (FGS) video coding in MPEG-4 [4] standard. FGS coding scheme encodes a video into different layers, usually consisting of a base layer and many enhancement layers. Each enhancement layer provides increasingly more detailed video data to enhance the base layer. Although these existing techniques [1, 2, 3, 4] provide scalability in video streaming, there are still some issues not fully addressed. Firstly, most previous scalable video coding approaches require computational expensive off-line encoding and transcoding. The computational complexity makes them not suitable for scalable transmission of live video streams, such as sports and lecture videos. Secondly, content-based temporal scalability has not been thoroughly investigated. Most previous research provide temporal scalability based exclusively on signal level image quality parameters. Temporal compression of instructional videos at semantic level has not been investigated. Thirdly, the desired QoS by clients in video streaming depends not only on frame rate, image quality, granularity, but also depends on video content. In wireless video streaming, Video Content is an indispensable factor of determining the desirable QoS of streaming. For different scenes in a video, different QoS may be requested. For example, in wireless streaming of a lecture video, it is acceptable to transmit low quality video segment of classroom discussion, but the video segment of blackboard presentation requires high image quality because the content on blackboard is essential in understanding the lecture. Under a given wireless network bandwidth, to select an appropriate set of quality parameters depends on both video content and the preference of client. For example, in wireless streaming of a sports video, if the network bandwidth allows either a high image quality, 15 fps video sequence or a low quality, 30 fps video sequence be transmitted, to select which scheme depends on the content scenes in the sports video and the preference of viewer. Previous research [9, 10, 11, 12] provide content adaption methods for universal media delivery. However, most
of these approaches provide very limited function of video content analysis. None of them has been specially designed for instructional videos. Specially for instructional videos, we develop methods in content analysis and temporal compression. Since wireless video streaming is essential in developing e-learning and mobile learning systems [13, 14, 15], we effectively combine computer vision techniques of content analysis with wireless video streaming to provide QoS scalable video streams to mobile clients. This paper presents a framework of content-aware wireless video streaming, and shows its application in instructional videos. First, we present a real-time content analysis method and an approach to dynamically compressing lecture videos without the need of off-line transcoding. We further provide an adaptive video transmission scheme which is able to adjust video quality not exclusively based on network conditions, but also based on instructional video content and the preference of clients.
2. CONTENT-AWARE STREAMING As shown in Figure 1, the framework of our wireless video streaming system consists of four main modules: “content analysis”, “temporal compression”, “QoS adjustment”, and “adaptive control”. Content analysis module analyzes and senses the current video content so that different QoS adjustment schemes can be applied. Temporal coding module dynamically compresses video sequence. The adaptive control scheme changes video streaming QoS based on the requirement and feedback from video client. The system is able to dynamically change image quality and frame rate in correspondence to the change of video content and network bandwidths. The goal of our system is to provide QoS scalable lecture videos over wireless networks. In the current system, each frame is independent and no inter-frame compression technique is applied. Although not most coding efficient, the simplicity of computation allows real-time coding and decoding. In the case of network error and package loss, the error is not propagated to other frames. The framework shown in Figure 1 is based on unicast video streaming. When video server receives request from a user, it initiates a new process to serve the client. Extension to multicast is possible by providing video streams of different level of QoS.
Figure 1: Framework of content-aware wireless video streaming.
An incoming video stream, S, is first processed by a “content analysis” module which detects video scenes and segments content regions in frames. Then a “leaking buffer model” is used to select content significant key frames. An image quality control subsystem further changes image quality of output frames by locating the content region, resampling the images, and using different color depths. Finally, a “data compression” subsystem reduces data redundancy based on the requirement and computational capacity of mobile clients. At initialization, video server first estimates network bandwidth and computing capacity of the video client. Since wireless network bandwidth is dynamically changing, video client updates its request of QoS during video streaming. Server receives new request, depending on current instructional video content and network bandwidth, translates the request into control parameters. The QoS adjustment module then outputs the video sequence of desired QoS to clients. 2.1. Instructional Video Content Analysis Content analysis module analyzes video content dynamically in real-time. It classifies video content into different scenes and segments content regions. For lecture videos, content analysis module estimates paper/board background color and variation, and segments content regions from irrelevant areas in each frame. An unsupervised clustering method is applied to retrieve background color of handwritten slides and blackboard. To improve computational efficiency for real-time analysis, computation of content background color is conducted only when the histogram difference between the current frame and its previous frames is larger than a pre-determined threshold. To estimate the background color of board and slide, dominant color analysis is applied in Y U V color space, which has advantage over other color spaces in two aspects: Y U V color components can be directly retrieved from MPEG encoded videos, and the effect of chalk dust is separable in Y U V color space. All pixels in a video frame are clustered using an unsupervised clustering technique. The main cluster is retrieved by ranking all the clusters. A repetition of clustering process is conducted to remove outliers for better estimation. Finally, the centroid of the main cluster is retrieved as the estimation of background color. The standard deviation of background colors is also calculated. We further use a block-based approach to detect content (text) regions in lecture videos. Each frame is divided into blocks of size 16 × 16. All blocks are classified into content blocks and non-content blocks by comparing the mean and standard deviation of the color of each block with those of the background. All content blocks are merged together to form the content region. Topology refinement is applied
to ensure a continuous content region without holes. Finally, the bounding box of content text region is estimated, as shown in Figure 2. Our block-based approach is fast enough for real-time content analysis, yet achieving results good enough for video streaming purposes. In this application, since accurate boundaries of text regions are not critical, for real-time content analysis, more sophisticated image segmentation algorithms may not be better alternatives.
Figure 3: Different type of scenes in a lecture video: content video frames and non-content video frames. These different scenes have different requirement of QoS in video streaming.
Figure 2: Real-time content analysis of lecture videos. The left image shows the original frame, and the right image shows the processed image with content region detected and enhanced. For low bit-rate video streaming, only the content region is delivered to clients.
Since different types of video content require different QoS in video streaming, the content analysis module also classifies frames into two types: content frames and noncontent frames. Content frames are those of close-up shot of handwritten slide and blackboard. Non-content frames, on the other hand, include the frames of classroom discussion, frames in close-up shot of instructor, and other irrelevant frames, as shown in Figure 3. Video content frames require high image quality because the written text on board/slide is essential in understanding instructional content. While for non-content frames, frame rate is more important. The ratio of paper/board background pixels to all pixels in each frame is calculated, and the frames are classified by comparing the ratio with a pre-determined threshold. 2.2. Dynamically Temporal Video Compression To accommodate the change of network bandwidth, video server provides both temporally and spatially scalable video streams. Temporal scalability is achieved by dynamically compressing video at any designated compression ratio. Spatial scalability is implemented by adjusting image resolution. Inspired by human short-term memory recognition process, we provide a “leaking buffer” model to select key frames dynamically. A “leaking video buffer” consists of n slots, each of which holds one video frame. A video sequence comes in one end of the video buffer at a frame rate of λin , and leaves at the other end at λout . The output frame rate is determined by video content and network bandwidth. An
evaluating function, φi (·), is applied to the frames in the buffer. All frames in the buffer are evaluated and comparatively insignificant frames are “leaked” from the buffer. The output key frames are those “pulled out” at the other end of video buffer. Frame leaking decision is made each time when there is no empty slot in the buffer. For a video sequence S, let F be the ordered set of all frames in the leaking video buffer. The order of elements in F is based on frame numbers. Denote the number of the elements in F as C(F ). Let di,j be the content difference between frame fi and fj . Define compression ratio r as the ratio of the number of all frames to the number of output key frames frames. r determines how fast key frames are pulled out from the video buffer. For a given compression ratio r, output frame rate λout obviously is λin /r. We first initialize F as an empty set: F = ∅, then fill empty slots of the video buffer with incoming frames: F = {fn , fn−1 , · · · , f1 }. For each fixed temporal interval of 1/λout , we output the frame at the end of the video buffer and select it as a key frame, then we shift all frames in the video buffer, leaving one empty slot at the other end of the buffer, as illustrated in step D of Figure 4. At the same time, the video sequence enters the leaking video buffer at a fixed frame rate of λin . For each fixed temporal interval of 1/λin , a new frame enters the video buffer. If there exists empty slots in the buffer, the frame occupies the empty slot at the end of video buffer. If the video buffer is full, i.e.,C(A) = n, one frame needs to be deleted (leaked) from the video buffer to make one empty slot. The process of video leaking is as follows. 1. Build a doubly linked list of all distances of adjacent frames. Two frames are adjacent in the list only when they are temporally adjacent in the video buffer.
2. Find the minimum difference between any two adjacent frames in the video buffer and delete one frame. Suppose dj,k is the minimum of all adjacent frame differences, fj is adjacent to {fp , fk }, fk is adjacent to {fj , fq }, as indicated in Figure 4, we make a decision to drop frame: if dp,j < dk,q , we delete frame fj form the buffer; otherwise we delete frame fk .
frame selection can be monitored using our developed software tool, as shown in Figure 5. The bottom row of images shows the frames in the video buffer, the image at upperleft shows the input video frame, and the upper-right image shows the output key frame. The compression ratio is also dynamically displayed.
3. Update F . Suppose fj is the frame leaked from the video buffer, then F = F \ {fj }. The frames before fj in the video buffer are shifted to occupy the empty slot, leaving one empty slot at the beginning. 4. After the frame fj is leaked, fp and fk become adjacent in the video buffer, we measure their content difference and update the linked list of frame distances. Figure 4 illustrates the process of leaking frame, filling in video buffer, and outputting key frames. The algorithm above is a greedy approach in nature. For a video of N frames, the memory usage of the leaking buffer is fixed n slots, and the computational complexity is O(N ). Figure 5: A server software tool that monitors the process of dynamically temporal compression of video streams.
3. CONTENT-AWARE CONTROL SCHEME Our control scheme of wireless streaming is able to change QoS of lecture videos based on the request of client, network bandwidth, and instructional video content. Temporal scalability is provided at server-side by adjusting compression ratio dynamically, while spatial scalability is achieved by resampling and scaling content regions. 3.1. Initialization and Override Control
Figure 4: Illustration of the process of temporal compression using a leaking buffer model. (A): The state of video buffer; (B): Frame fj is leaked from buffer, leaving one empty slot; (C): A new frame fm enters the video buffer; (D): Frame fq is selected as a key frame and pulled out of the video buffer.
For content video frames of slide/board presentation, we define the difference between two frames as the number of different ink/chalk pixels in the frames [15]. The evaluation function above is the maximum of the minimum content difference of adjacent frames in the video buffer. It minimizes content redundancy of selected key frames. All video clients share the same leaking video buffer. Each client keeps the record of frame leakage and the information of key frames in the video buffer. The activity of leaking video buffer and the process of dynamic video key
Wireless network bandwidth is estimated by transmitting k testing frames at initialization. In our experiments, we select k as 10. Since mobile devices usually have limited computing capacity, the time of processing and displaying video frames at client side is not neglectable. The estimation of computing capacity is essential for selecting appropriate data compression modes in wireless video transmission. At the beginning of network transmission, the video client sets video streaming QoS parameters explicitly or asks the server to choose, as shown in Figure 6. The package of QoS parameters includes frame rate, resolution, color bits, compression mode, and other relevant quality parameters. At any time in the middle of streaming, the server allows clients to “override control” the QoS of video streaming. After receiving a fixed number of frames, video client sends a feedback package that includes an “override con-
trol” token, video streaming QoS parameters, and feedback information. The feedback information further comprises of the times at receiving individual video frames at client-side and the measured wireless signal strength. The server receives the package. If the “override control” token is set, it sets QoS parameters as described in the package, otherwise it determines the QoS based on video content and bandwidth, as we will describe in the following.
of mobile devices, we have several levels of resolutions. The default image resolution is 320 × 240. The minimum viewable size is 160 × 120, based on subjective evaluation. The reason we provide different frame resolution levels instead of DCT residue levels is because JPEG compression may degrade the quality of text images significantly in lecture videos. Color Depth We provide several image color formats: the 24-bit true color, 16-bit color, 8-bit color (using color palette), and 4/2/1 bit gray levels. The default mode is 16-bit color. Color depth affects image quality of different types of frames in instructional videos in different ways: when color depth decreases, the board/slide content frames are not affected as severely as in noncontent frames. Data Compression Data compression modes include uncompressed raw data, run-length lossless compression, and JPEG compression. The compression mode is mainly determined by computing ability of mobile devices and is usually fixed for the whole session of video streaming. The default mode is uncompressed data. More aggressive data compression techniques may be introduced, but the factor of limited computing power of mobile devices needs to be considered.
Figure 6: The user interface of mobile video client for initiating
3.3. Adaptive Feedback Control
wireless video streaming. The video client may set QoS requirement explicitly or ask video server for the decision of QoS.
The server adjusts QoS parameters of video streaming to accommodate the fluctuation of network bandwidth and the change of instructional video content. Network bandwidth ω and average frame rate λ are estimated based on the feedback information of client. Suppose a frame has a size of µ at a given set of QoS parameters, λout is the frame transmission rate, we have ω = µλout = µλin /r. For a fixed incoming frame rate λin , to accomodate the change of bandwidth ω, the tunable factors are frame size µ and compression ratio r. The frame size µ is further determined by image resolution qr , color depth qd , and image compression mode qc .
3.2. QoS Parameters The video server adjusts video streaming QoS to fit network bandwidth and to accomodate content change. the following parameters are tunable for video streaming. It is possible to include additional QoS parameters, such as the subjective quality of video frames. Frame Rate Adjustment of frame rate depends not only on bandwidth, but also on video content. When it is necessary to change frame rate, we set a threshold δ to avoid excessive adjustment. Only when the target frame rate is below (1 − δ)λin /r or above (1 + δ)λin /r does the server update the frame rate. The new compression ratio r is updated as λin /λ. The small amount of intentional “delay” in adjusting frame rate avoids oscillation of frame rates when network bandwidth changes frequently in a small scale. Image Resolution Different image resolutions are provided. After key frames are retrieved from the leaking video buffer, their image quality is changed to fit network bandwidth. Considering the resolution restraint
Figure 7: Illustration of adaptive control scheme for wireless video streaming.
Different control schemes are applied to different type of video frames. For content frames, image quality has a
higher priority. When estimated bandwidth increases, the video server first increases image quality. The order of the adjustment of QoS parameters for content frames is resolution, color depth, and frame rate. The server increases the image resolution first, when resolution reaches maximum, it adjusts image color depth, then adjusts frame rate. When network bandwidth decreases, the server uses a similar control scheme, but in an inverse order of changing QoS parameters: the server decreases frame rate first, then color depth and image resolution. For non-content frames, because frame rate has a higher priority, when network increases, the control scheme first adjusts frame rate, then improves image quality when frame rate reaches a pre-determined threshold. The current system provides two different schemes for two different content types. It may be extended to handle more content types. 4. EXPERIMENTAL RESULTS We make experiments on two lecture videos of different presentation formats. The first video is a 45 minute blackboard presentation video and the second one is a 40 minute slide video. The video server transmits instructional videos over wireless networks to mobile client, a DELL AXIM 5 PDA running PocketPC 2003 Operation System. The current system guarantees the correct transmission of information and feedback packages, but does not provide error-resilient mechanism for the packages of video frames. We simulate live video stream by fetching video frames from video files at a fixed incoming frame rate of 30 frames per second (fps). The server processes video frames dynamically and transmits them over 802.11b wireless networks to video client through a wireless router. During wireless video transmission, we move the mobile device around in a university campus to simulate the real scenario of mobile education. Figure 8 shows the fluctuation of output frame rate λout , image resolution, and color depth of an instructional video of blackboard presentation. Figure 9 shows the results on an instructional video of handwritten slides. Since λout is inversely proportional to compression ratio r, Figure 8 and Figure 9 indicate that server is able to dynamically adjust QoS parameters based on network bandwidth and video content. The abrupt increases of frame rate in the figures correspond to the times when the video sequence changes from content frames to non-content frames and the wireless signal gets stronger. In these cases the adaptive control scheme increases frame rate to achieve better video quality. Figure 10 shows captured screen images of different qualities at different network conditions. The server estimates the computing capacity of mobile devices at initialization step, and chooses to use uncompressed image format because the time of decoding and
Figure 8: Fluctuation of video qualities in wireless streaming of a 45 minute instructional video of blackboard presentation. The server is able to dynamically adjust frame rate, image resolution, and image color depth based on video content and wireless network conditions. The unit of image resolution is “K pixels”.
displaying frame at client side exceeds a pre-determined threshold. Subjective evaluation of the transmitted frames shows no loss of instructional visual content in these two instructional videos, all instructional content written on blackboard/slide are properly selected and transmitted by video server. 5. CONCLUSIONS AND DISCUSSION In this paper, we presented an approach to content-aware scalable transmission of instructional videos over wireless networks. Real-time content analysis and temporally scalable video compression for instructional videos are provided. An adaptive feedback control scheme which provides different QoS for different video content was provided. The computing capacity of mobile devices was also considered in the control scheme. The system is being extended to multicasting by providing video streams of different levels of quality. A human subjective feedback system that enables the server to dynamically changing control schemes to provide personalized QoS of videos is being developed. More human subjective evaluation of video transmission quality is underway. The effectiveness of video streaming system on learning ex-
Figure 10: Images of different qualities are transmitted at different network conditions. From left to right: full screen (320 × 240) images of blackboard and slide presentations, images of reduced resolution (240 × 180) on content regions.
[5] James E. Fowler, Kenneth C. Adkins, Steven B. Bibyk, and Stanley C. Ahalt, “Real-time video compression using differential vector quantization,” in IEEE Trans. on Circuits and Systems for Video Technology, 1995, pp. 14–24. [6] Chun-Ting Chou and Kang G. Shin, “Analysis of adaptive bandwidth allocation in wireless networks with multilevel degradable quality of service,” in IEEE Trans. on Mobile Computing, 2004, pp. 5–17. Figure 9: Fluctuation of video qualities in wireless streaming of a 40 minute instructional video of handwritten slide presentation. The server is able to dynamically adjust frame rate, image resolution, and image color depth based on video content and wireless network conditions. The unit of image resolution is “K pixels”.
perience of students and its social effect need to be further investigated. 6. REFERENCES [1] Amy R. Reibman, Leon Bottou, and Andrea Basso, “Scalable video coding with managed drift,” in IEEE Trans. on Circuits and Systems for Video Technology, Feb. 2003, pp. 131–140. [2] Ulrich Benzler, “Spatial scalable video coding using a combined subband-dct approach,” in IEEE Trans. on Circuits and Systems for Video Technology, Oct. 2000, pp. 1080–1087.
[7] Arnaud Legout, Jorg Nonnenmacher, and Ernst W. Biersack, “Bandwidth-allocation policies for unicast and multicast flows,” in IEEE/ACM Trans. on Networking, Aug. 2001, pp. 464–478. [8] Abhik Majumdar, Daniel Grobe Sacks adn Igor V. Kozintsev, Kanan Ramchandran, and Minerva M. Yeung, “Multicast and unicast real-time video streaming over wireless lans,” in IEEE Trans. on Circuits and Systems for Video Technology, Jun. 2002, pp. 524– 534. [9] Kun Tan and Richard Ribier adn Shih-Ping Liou, “Content-sensitive video streaming over low bitrate and lossy wireless network,” in ACM Multimedia, 2001, pp. 512–515. [10] Nicola Cranley, Liam Murphy, and Philip Perry, “User-perceived quality-aware adaptive delivery of mpeg-4 content,” in ACM NOSSDAV, 2003, pp. 42– 49.
[3] Zhou Wang, Ligang Lu, and A. C. Bovik, “Foveation scalable video coding with automatic fixation selection,” in IEEE Trans. on Image Processing, Feb. 2003, pp. 243–254.
[11] Mulugeta Libsie and Harald Kosch, “Content adaptation of multimedia delivery and indexing using mpeg7,” in ACM Multimedia, 2002, pp. 644–647.
[4] Feng Wu, shipeng Li, and Ya-Qin Zhang, “A framework for efficient progressive fine granularity scalable video coding,” in IEEE Trans. on Circuits and Systems for Video Technology, March 2001, pp. 332–344.
[12] Tiecheng Liu and John R. Kender, “Time-constrained dynamic semantic compression for video indexing and interactive searching,” in IEEE Conference on Computer Vision and Pattern Recognition, 2001.
[13] Chitra Dorai, Parviz Kermani, and Avare Stewart, “Elm-n: e-learning media navigator,” in ACM Multimedia, 2001, pp. 634–635. [14] Chi-Hong Leung and Yuen-Yan Chan, “Mobile learning: a new paradigm in electronic learning,” in 3rd International Conference on Advanced learning Technologies, 2003, pp. 76–80. [15] Tiecheng Liu and John R. Kender, “Rule-based semantic summarization of instructional videos,” in International Conference on Image Processing, 2002.