Surveillance Video Coding via Low-Rank and Sparse Decomposition Chongyu Chen
Jianfei Cai
Weisi Lin
School of Electronic Engineering, Xidian University No.2 South TaiBai Road Xi’an, Shaanxi, China
School of Computer Engineering, Nanyang Technological University Nanyang Avenue, Singapore
School of Computer Engineering, Nanyang Technological University Nanyang Avenue, Singapore
[email protected] Guangming Shi
[email protected]
[email protected]
School of Electronic Engineering, Xidian University No.2 South TaiBai Road Xi’an, Shaanxi, China
[email protected]
ABSTRACT Surveillance videos are usually with a static or gradually changed background. The state-of-the-art block-based codec, H.264/AVC, is not sufficiently efficient for encoding surveillance videos since it cannot exploit the strong background temporal redundancy in a global manner. In this paper, motivated by the recent advance on low-rank and sparse decomposition (LRSD), we propose to apply it for the compression of surveillance videos. In particular, the LRSD is employed to decompose a surveillance video into the low-rank component, representing the background, and the sparse component, representing the moving objects. Then, we design different coding methods for the two different components. We represent the frames of the background by very few independent frames based on their linear dependency, which dramatically removes the temporal redundancy. Experimental results show that, for the compression of surveillance videos, the proposed scheme can significantly outperform H.264/AVC, up to 3 dB PSNR gain, especially at relatively low bit rates. Categories and Subject Descriptors: I.4.2 [Image Processing and Computer Vision]: Compression Keywords: Surveillance video compression, low-rank and sparse decomposition, CUR decomposition
1. INTRODUCTION As the growing needs for public security, traffic controlling, and remote healthcare monitoring, the use of surveillance cameras has dramatically increased over the last decade. Efficient compression and fast transmission of large amount
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’12, October 29–November 2, 2012, Nara, Japan. Copyright 2012 ACM 978-1-4503-1089-5/12/10 ...$15.00.
of surveillance videos are required in practice. The static or gradually changed background in the scene is a common characteristic of surveillance videos, which leads to much temporal redundancy. It is believed that efficient compression of surveillance videos is possible if such redundancy can be removed. Existing codecs [5, 7] are typically block-based and they are designed for general videos, including the state-of-the-art video coding standard, H.264/AVC [7]. H.264/AVC achieves high efficiency in the compression of general videos by exploiting both temporal and spatial redundancy. However, it is not sufficiently efficient for encoding surveillance videos with static or gradually changed background. This is mainly because H.264/AVC partitions each video frame into blocks and cannot exploit the strong background temporal redundancy in a global manner. Another straightforward way to encode surveillance videos is through background subtraction, which works well only when the background frame is identical. However, it is often not the case in practice. Most of the surveillance videos contain some background perturbations such as illumination changes, moving escalators, and swaying trees. Recently, a few low-rank and sparse decomposition (LRSD) tools [1, 4, 2] have been developed, which can decompose a surveillance video into a low-rank component and a sparse component, approximately representing the background and the foreground moving objects, respectively (see Fig. 1). The LRSD has been successfully applied to a few applications such as moving object detection. In this paper, we propose to apply the LRSD for the compression of surveillance videos since the extracted background component containing strong temporal redundancy can be compressed in a very efficient way. To the best of our knowledge, the idea of applying LRSD for video compression has not been reported before. In particular, we represent the frames of the background component by very few independent frames based on the linear dependency, which dramatically removes the temporal redundancy. The remaining part, consisting of the sparse component and the residue component, can be efficiently compressed by the existing block-based coding scheme. Experimental results
show that, for the compression of surveillance videos, the proposed scheme can significantly outperform H.264/AVC, up to 3 dB PSNR gain, especially at relatively low bit rates. The rest of this paper is organized as follows. Section 2 introduces the mathematical tool that is capable of separating perturbations and low-rank background of video frames, based on which a novel scheme is proposed in Section 3 for the compression of surveillance videos. In Section 4, we test the proposed scheme on several representative surveillance videos. Finally, Section 5 concludes this paper.
2. LOW-RANK AND SPARSE DECOMPOSITION In matrix theory, the linear dependency among columns of a matrix is referred to as the low-rank property. As a result, if we stack many linear dependent frames as the columns of a matrix L, then L is exactly low-rank and its rank is identical to the numbers of its independent columns. Matrices converted from surveillance videos are expected to be low-rank because of the static backgrounds. In this case, perturbations of such videos can be seen as other matrices that are added to L. The emerging theory of robust principal component analysis (RPCA) [1, 4, 2] provides a suitable formulation for the separation of perturbations and background. That is, A = L + S,
(1)
where A is the original matrix that contains the low-rank and sparse components, L is the low-rank matrix described above, and S is a sparse matrix. Given a matrix A, L and S can be found by RPCA algorithms such as the augmented Lagrange multiplier (ALM) method [4] and the principal component pursuit (PCP) [2] when the low-rank component L is not sparse and the sparse component S is not lowrank. For a matrix A constructed by stacking frames of a surveillance video as columns, its low-rank component is often the static background and thus is not sparse, while its sparse component often represents moving objects that are linear independent and thus is not low-rank. An example of the separation of a surveillance video via ALM [4] is given in Fig. 1, which shows the ability of RPCA algorithms in handling sparse perturbations caused by moving objects. Existing RPCA algorithms often concentrate on finding more meaningful decompositions. However, their complexity is often uncontrollable due to their automatic and iterative solving procedure, which makes them unsuitable to video coding. Recently, the GoDec [8] algorithm is proposed for separating low-rank and sparse components of matrices. The formulation of GoDec can be seen as noisy RPCA, i.e. A = L + S + N,
(2)
where matrix N is the noise component. Besides the controllable complexity, GoDec also provides controllable rank of L and sparsity of S. These characteristics make GoDec a good choice for video coding. As a result, we choose GoDec in our proposed scheme.
3. PROPOSED SCHEME In this section, we propose a scheme to improve the coding efficiency of the existing block-based codecs for surveillance videos based on the low-rank and sparse decomposition (LRSD). For simplicity, we take H.264/AVC as an example
(a) Original
(b) Low-rank
(c) Sparse
Figure 1: Different components separated by ALM [4]. (a) The first frame of the original video. (b) The background restored from the first column of L. (c) The foreground converted from the first column of S. of block-based codecs, and only consider the compression of grayscale videos. Given a surveillance video sequence of resolution H × W , the proposed scheme consists of the following steps: 1. Stack a set of frames of the video as columns of a matrix A ∈ Rm×n , where m = HW and n is the number of frames; 2. Separate the components of A using GoDec, so that A = L + S + N , where L is a rank-r matrix, S is a sparse matrix, and N is a dense residual matrix that has small entries; 3. Compute a low-rank decomposition of L, so that L = CX, where the m × r matrix C contains some columns of L, representing the principal components of the background, and X is an r × n matrix, storing the coefficients to recover each background frame based on the principal components. 4. Construct Sˆ by normalizing the entries of S + N so as to ensure that the entries of the dense matrix Sˆ are ranging from 0 to 255; 5. Convert C and Sˆ to two video sequences, denoted as VC and VS respectively, and compress them separately using H.264/AVC. Fig. 2 shows the diagram of the proposed codec. It can be seen that the compressed video sequence consists of four parts, the bit streams of VC and VS , the r × n matrix X (“coefficient 1”), and the denormalization coefficients (“coefˆ Based on the obserficient 2”) for restoring S + N from S. vation that GoDec often converges in about 10 iterations, we set the maximum number of iteration to be 12 in the proposed scheme. In the rest of this section, we describe the steps of the encoding scheme in detail, and explain our choices of parameters by showing some experimental results. Without specification, the “Hall” video is used as the default one in the following examples.
3.1 Encoding the low-rank component In this paper, we propose to compress L by its low-rank property. In particular, we factorize the m×n matrix L into two small matrices by computing CUR decomposition [3] of L. That is, L = CU R,
(3)
where the m × r matrix C consists of r adaptively selected columns of L, the r × n matrix R consists of r adaptively selected rows of L, and the r×r matrix U is the pseudo-inverse
Video frames
LRSD
50 Average Y−PSNR (dB)
Low-rank component
Sparse and residual components
Column-row-based decomposition
Normalization
45 40 35 10% non−zero entries in S 35% non−zero entries in S Original video
30 25
Independent frames
0
Encoding via H.264/AVC
Encoding via H.264/AVC
Bit stream 1
Bit stream 2
Coefficients 1
Coefficients 2
(a) Bit stream 2
Decoding via H.264/AVC
Decoding via H.264/AVC
Independent frames
Multiplying
Restoration
Low-rank component
Sparse and residual components
Adding
Decoded video frames
(b)
Average Y−PSNR (dB)
50
Code VL via H.264/AVC (rank=3)
40
Code VL via H.264/AVC (rank=5)
35
Code VC & store X (rank=3)
Code VL via H.264/AVC (rank=7) Code VC & store X (rank=5) Code VC & store X (rank=7)
30 10
15
20 25 Bit rate (kbps)
30
150 Bit rate (kbps)
200
250
300
Figure 4: A comparison of coding VS and the original video via H.264/AVC when the cardinality of S changes.
isting in the surveillance video background frames. It can also be seen that the proposed scheme tends to be less efficient as the rank increases since the size of C increases. Our studies show that a rank of 2 is usually a good choice for representing the background of common surveillance videos. Thus, we set the target rank of L to be 2 when using GoDec in the proposed scheme.
3.2
Figure 2: Overview of the proposed (a) encoding scheme and (b) decoding scheme.
45
100
Coefficients 2
Bit stream 1
Coefficients 1
50
35
40
Figure 3: A comparison of coding the low-rank background via H.264/AVC and the proposed scheme. of the intersection of C and R. In this way, L is divided into two small matrices, C and X = U R. Matrix C is used to restore the r independent frames of the background and construct a short video VC which only has r frames. Then, we compress VC via H.264/AVC and directly store X without compression considering the amount of data for X is small. At the decoder side, C can be uncompressed by stacking the frames of VC as columns. Then the restoration of L can be done by multiplying C and X. The low-rank component L can be directly converted to a video VL that basically represents all the background frames. As the frames of VL are highly correlated, directly compressing VL via H.264/AVC is also expected to be efficient. Thus, for encoding L, we conduct experiments to compare the scheme of directly encode VL and the proposed scheme of encoding VC plus directly storing X. We use identical quantization parameters for both methods and the distortion of decoded video is measured by average peak signal-to-noiseratio (PSNR) of the luminance component. As shown in Fig. 3, the proposed scheme is more efficient than directly compressing VL via H.264/AVC no matter the rank of L is 3, 5 or 7. This is mainly because the block-based coding scheme is inefficient in exploiting the global redundancy ex-
Encoding the sparse and residual components
To guarantee sufficiently high quality of the decoded video, both the sparse component S and the residual component N need to be encoded. In the proposed scheme, considering that the entries of S + N can be positive and negative, we first normalize S + N to matrix Sˆ with the value range of [0, 255]. As a result, the maximum and minimum entries of S +N must be stored, which constitute “coefficient 2” shown in Fig. 2. Then, Sˆ is converted into a video denoted as VS , which is directly encoded. Existing block-based codecs such as H.264/AVC are expected to be efficient in compressing VS , because there are many near flat blocks in each frame of VS , which become exactly flat after moderate quantization. Fig. 4 shows the comparisons between compressing the original video and compressing VS when the cardinality of S changes, where the word “cardinality” is referred to the number of non-zero entries of S. These comparisons indicate that, such compression scheme tends to be less efficient when the cardinality of S becomes bigger. Thus, we empirically set the target cardinality of S to be 0.15mn when using GoDec in the proposed scheme. In addition, the comparison between compressing the original video and compressing VS shows that the proposed scheme might not be suitable for high-fidelity or lossless video coding since its coding efficiency becomes lower at very high PSNR ranges.
4.
EXPERIMENTAL RESULTS
In this section, we conduct experiments to evaluate the performance of the proposed codec. The H.264/AVC reference software1 used is JM18.2, which is implemented with the fidelity range extensions (FRExt) [6]. Four representative surveillance videos of 200 frames2 , named “Hall”, “Escalator”, “Campus”, and “Lobby”, are used as test sequences, which are shown in Fig. 5. Note that in the “Escalator” video, besides the stationary background and common moving objects, there are several escalators that cause periodic perturbations. The “Campus” video has several trees that 1 2
http://iphome.hhi.de/suehring/tml/download/ http://perception.i2r.a-star.edu.sg/bk model/bk index.html
(a) Hall
(b) Escalator
(c) Campus
(d) Lobby
and sparse structure very well because the sharp change of brightness still maintains linear dependent background, and there is only one person moving in the video. Thus, the proposed scheme obtains a significant PSNR gain up to 3 dB. The “Hall” and “Escalator” videos have multiple moving objects, and thus more bits are required to compress the sparse component. The corresponding PSNR gains are relatively smaller, up to 2 dB. In the “Campus” video, despite that the sparse assumption of the foreground is broken due to the irregular perturbations of the background trees, our scheme can still achieve a PSNR gain, up to 1 dB. Therefore, the proposed scheme is very suitable for the compression of surveillance videos, especially when there are few movements in the background and the bit rate is relatively low.
5.
44
42
42
40
Average Y−PSNR (dB)
Average Y−PSNR (dB)
Figure 5: The four surveillance videos used in our experiments.
40 38 36 34 32 H.264/AVC Proposed scheme
30
38 36 34 32 30
H.264/AVC Proposed scheme
28
28
26 0
50
100 150 Bit rate (kbps)
200
250
0
(a) Hall
100
150 200 Bit rate (kbps)
250
300
350
(b) Escalator
6.
50
42 40 38 36 34 32 30 H.264/AVC Proposed scheme
28 26 0
50
100
150 200 Bit rate (kbps)
250
(c) Campus
46 44 42 40
350
7.
H.264/AVC Proposed scheme
38 300
ACKNOWLEDGEMENTS
This work is partially supported by MoE AcRF Tire 2 Grant, Singapore, Grant No.: T208B1218 and NSFC No. 61033004, 61070138, and 61072104.
48 Average Y−PSNR (dB)
Average Y−PSNR (dB)
50
CONCLUSION
Surveillance videos usually have massive temporal redundancy due to their static or gradually changed background. However, the state-of-the-art coding standard H.264/AVC cannot fully exploit such redundancy in a global manner due to its block-based nature. The emerging theory of low-rank and sparse decomposition (LRSD) provides efficient algorithms for the separation of the background and the moving objects in the surveillance videos. We have proposed a scheme in this paper that can efficiently encode the two components and achieve the overall compression efficiency, outperforming H.264/AVC. We believe such an approach sheds light on advancing object-based compression and streaming.
0
50
100 Bit rate (kbps)
150
200
(d) Lobby
Figure 6: Experimental results of compressing four surveillance videos via the proposed scheme and H.264/AVC.
cause irregular perturbations, and the “Lobby” video has a sharp change of brightness caused by switching off the light. Considering that VC , representing the essential background information for all the background frames, is more important than VS in terms of overall reconstruction quality, we set the quantization parameter (QP) for encoding VS to be twice as much as that for encoding VC in the proposed scheme. Fig. 6 shows the PSNR performance of the proposed scheme under different rates through varying the QP of VS from 8 to 48, as well as the performance of encoding the original videos via H.264/AVC directly. Using a common PC that has a memory of 2 GB and a dual-core CPU of 2.67 GHz, the GoDec algorithm takes about 15 seconds to converge, and H.264/AVC takes about 45 seconds to compress 200 frames. It is shown in Fig. 6 that the proposed scheme significantly outperform H.264/AVC for encoding the surveillance videos at the bit rates lower than 200 kbps, where the PSNR of the decoded videos are sufficiently high for practical applications. In particular, the “Lobby” video fits the low-rank
REFERENCES
[1] J. Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 20(4):1956–1982, 2010. [2] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):11:1–11:37, June 2011. [3] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl., 30:844–881, 2008. [4] Z. Lin, M. Chen, and Y. Ma. The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. Technical report, UIUC, Oct. 2010. [5] K. Rijkse. H.263: video coding for low-bit-rate communication. IEEE Commun. Mag., 34(12):42–45, Dec. 1996. [6] G. J. Sullivan, P. Topiwala, and A. Luthra. The H.264/AVC advanced video coding standard: Overview and introduction to the fidelity range extensions. In SPIE conference on Applications of Digital Image Processing XXVII, 2004. [7] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol., 13(7):560–576, July 2003. [8] T. Zhou and D. Tao. GoDec: Randomized low-rank & sparse matrix decomposition in noisy case. In IEEE International Conference on Machine Learning, 2011.