A New Concept of Real-Time Security Camera Monitoring With Privacy Protection by Masking Moving Objects Kenichi YABUTAa , Hitoshi KITAZAWAa and Toshihisa TANAKAa a Department
of Electrical and Electronic Engineering Tokyo University of Agriculture and
Technology, 2–24–16, Naka-cho, Koganei-shi, Tokyo, 184–8588 Japan
ABSTRACT Recently, monitoring cameras for security have been extensively increasing. However, it is normally difficult to know when and where we are monitored by these cameras and how the recorded images are stored and/or used. Therefore, how to protect privacy in the recorded images is a crucial issue. In this paper, we address this problem and introduce a framework for security monitoring systems considering the privacy protection. We state requirements for monitoring systems in this framework. We propose a possible implementation that satisfies the requirements. To protect privacy of recorded objects, they are made invisible by appropriate image processing techniques. Moreover, the original objects are encrypted and watermarked into the image with the “invisible” objects, which is coded by the JPEG standard. Therefore, the image decoded by a normal JPEG viewer includes the objects that are unrecognized or invisible. We also introduce in this paper a so-called “special viewer” in order to decrypt and display the original objects. This special viewer can be used by limited users when necessary for crime investigation, etc. The special viewer allows us to choose objects to be decoded and displayed. Moreover, in this proposed system, real-time processing can be performed, since no future frame is needed to generate a bitstream. Keywords: Privacy protection, Security camera, Watermarking, JPEG encoding, Distributed computing, Object tracking
Further author information: (Send correspondence to K.Y.) K.Y.: E-mail:
[email protected] H.K.: E-mail:
[email protected] T.T.: E-mail:
[email protected]
1. INTRODUCTION Security monitoring cameras are becoming more and more important to deter and investigate crimes. While the number of monitoring cameras is rapidly increasing, it should be noticed that these monitoring cameras record not only criminals but also general people who are not be aware that they are monitored. It is therefore very natural to address how to protect the privacy of recorded objects such as people, cars, and so on. Moreover, the recorded images is probably distributed without any permission by these monitored people. Therefore, we should establish a framework for security monitoring systems which consider the privacy protection. A straightforward way to solve this problem is to deteriorate the quality of moving objects. This idea has been proposed by Kitahara et al.1 and Wickramasuriya et al.2 In their works, non-reversible methods, which decrease the resolution of objects whose privacy should be protected, are used. These methods can make a moving object unrecognizable and still keep the nature of the original object, that is, the deteriorated object can still be distinguishable whether the object is a man, a car, or something else. However, the non-reversible processing loses essential information of the objects such as human faces, number plates of cars, etc. This implies that this deterioration reduces the reliability of monitoring cameras in terms of security. In other words, if a face image of a criminal, for example, is even slightly destroyed, we may not be able to specify this criminal. A reversible method3 is a possible solution against this drawback. This method can display object images whose privacies are either protected or unprotected. However, this system needs special equipments, and therefore, we cannot use widely-used image viewers such as a JPEG decoder. Recently, the authors have proposed a framework of security monitoring systems where privacy of objects can be protected.4 In this framework, objects whose privacy should be protected are made invisible and the input image including these objects is encoded with the JPEG5 into a bitstream, in which the original objects are encrypted with a password and watermarked. Therefore, a normal JPEG viewer only displays the masked image where the privacy of the objects is protected. On the other hand, if we use a special viewer, then we can reconstruct the input image in which the original objects are preserved. The special viewer requires the password to decrypt the objects that has been embedded in the bitstream and then reconstructs the input image. This means that even if recorded images taken by monitoring cameras are distributed, the objects are unrecognizable,
as long as normal JPEG viewers are used. However, this method still has several problems. First, all the masked objects are reconstructed by the special viewer, even if one wants to reconstruct some of masked objects and keep the others invisible or unrecognizable. Second, embedding the object data in the bitstream leads to the reduced image quality and the increase of the data size. Third, heavy computation makes it difficult to implement the encoding in real-time. In this paper, we address these problems and propose an improved monitoring method. The proposed method enables us to decode only one specific object by exploiting object tracking information. To do this, the identification number with respect to each object is embedded into encrypted data together with the corresponding password. Therefore, in the decoder, every object can be selectively reconstructed. When embedding the object data into the original image, we use only the luminance or Y component. This yields the improved image quality and the reduced data size. Finally, to achieve real-time processing, we introduce distributed computing for the encoder. The rest of this paper is organized as what follows. In Section 2, we state requirements for the new security monitoring system. In Section 3, we propose an algorithm which meets this requirement. In Section 3.2, we describe how to individually decode a single object out of multiple objects in a frame. In Section 3.3, we explain a watermarkingbased6 method to embed these objects in a JPEG bitstream generated from a masked image frame. In Section 3.4, we illustrate a system of distributed computing that yields the real-time processing. In Section 4, we show experimental results and discuss about the results. Section 5 concludes our work and mentions open problems.
2. SYSTEM REQUIREMENTS The requirements of a fixed monitoring camera which protects the privacy have been stated by the authors4 as follows.
1. Masked images should be displayed by normal viewers for compressed images, such as JPEG viewers. 2. Moving objects in masked images should be invisible or unrecognizable. 3. Original input images should be reconstructed by a special viewer with a decodings password. A decoded images should be reconstructed as close to an input image as possible.
4. An encoder should generate only one JPEG bitstream. A normal and a special viewers can decode a masked and a reconstructed images, respectively, only from this bitstream.
In our previous work,4 a possible realization that satisfies the above requirements have been proposed. However, the above requirements do not consider the case where an input image includes multiple objects to be masked. Also, the processing time is quite important in practice. For the practical use of a monitoring system, the encoder should code an input image in real-time manner. Batch processing is not desirable. In this paper, therefore, we propose the following additional requirements. 5. The special viewer can selectively reconstruct the original objects. 6. Encoding should be performed in real-time. In the following sections, we describe a novel monitoring system so as to suite requirements 1–6.
3. PROPOSED SYSTEM AND MASKING METHOD 3.1. Overview Figure 1 shows the flow of the whole system proposed in this paper. First, moving objects are extracted from the original input image and are given object identification numbers by using object tracking. The object which is considered the same object as the ones in the previous and next frames is given the same identification number. In a future frame, if a new object appears, then a new object number is assigned. We apply in this paper the background substraction and heuristic tracking algorithm proposed in the literatures.7–11 Then, the input image with information of the objects and the associated object numbers is sent to the encoder, where the objects are masked by the methods described in the following, so that their privacy can be protected. Figure 2 shows the encoder structure. The extracted moving objects are coded by the discrete cosine transform (DCT), the zig-zag scan, and the run-length/Huffmann technique in the same manner as the JPEG. One bitstream is generated for one object. Each bitstream is encrypted by AES with a password involving the object numbers as a part of this password. In the original input image, the moving object regions that have been encrypted, are masked by either “scrambling” or “erasing.” The “scrambling” means that all pixels in the object
Masked Image Object number
Original Image
Monitoring camera
Moving object extraction
Tracking
Password
Decoding by normal viewer
Encoding
Transmission and/or storage
Password
Decoding by special viewer
Reconstructed Image
Figure 1. Flow of Masking process.
region are randomly permuted. The “erasing” replaces the object region by the corresponding background region, which can be obtained in past frames. This masked image is transformed by the DCT. The data of objects that have been encrypted are watermarked in the DCT coefficients of the masked image. Finally, the encoder produces one JPEG bitstream per a frame. In the receiver, this bitstream can be decoded by two types of decoders. One is a normal JPEG viewer, which displays only masked image where the moving objects are unrecognized or invisible. On the other hands, when the “special” viewer with a password decodes this bitstream, the viewer can show the input image where the moving objects are not masked. In the following subsections, details of each building block will be described.
Masked image Scrambling DCT
Watermarking
Huffman coding
Erasing Original image JPEG bitstream
Masked image
Object encoder Huffman coding
DCT
AES Encryption
Moving object Object number
Password
Figure 2. Flow of encoder structure.
3.2. Moving Object Extraction and Masking Generally, an input image can include multiple objects. However, the special viewer developed in the previous paper4 displays the original image. In other words, even though some of masked objects should be preserved invisible, original images of all the objects are displayed. In many case, however, it is unnecessary to reconstruct all the objects in the masked image. To avoid this situation, we propose a modified encoder/decoder structure which allows us to selectively obtain the original images of any objects when the special viewer is used. 3.2.1. Tracking Objects and Their Identification Numbers Extracted moving objects are tracked in a time series of input images. This tracking is achieved by comparing the current object with ones in previous frames. When the object is considered the same as the object in other past frames, the object is given an identical object number in the successive frames. The similarity is measured by its height, width, area, speed and so on. Details of the tracking method can be seen in the literature.9 In the decoder, by specifying an object identification number, we can sequentially reconstruct the object with respect to the object number.
Encoder Object data 2
Object data 1 Discrimination code of the data head
Object data 3
1. Adding discrimination codes
Object data 1
Object data 2
Discrimination code of the data end
Object data 3
2. AES encrypted A part of password: each object number
Encrypted data1
Encrypted data2
Encrypted data3
3. Combining data Data to be embedded
Decoder Data to be embedded If 2 is given, only this part can be decoded. No sense data
1. Decrypting data A part of password: only one number Object data 2
No sense data
2. Extracting data from between discrimination codes Object data 2
3. Displaying data
Figure 3. How to make embedding data for decoding only desirable object.
3.2.2. Encryption We embed these extracted objects into an processed image by scrambling or erasing, which is described earlier, such that the special viewer can display these objects. Object data to be embedded should be encrypted to protect their privacy. Here in this subsection, a method of encrypting individual objects is presented. In the beginning, an input image is divided into so-called minimum coded units (MCUs), which are defined in the JPEG image compression standard. The size of each MCU is 8 × 8 pixels. If an MCU has at least one pixel of a moving object region, this MCU is encoded by the JPEG to a bitstream, which is stored in an N -byte-length array, when the size of this bitstream is N bytes. It should be noted that the bitstream consists of only Huffman
codes; therefore, a JPEG header is removed. A set of bitstreams which is generated from one object with the corresponding object identification number is dealt with one bitstream. Figure 3 depicts the flow of the encryption. In this figure, the connected bitstream with respect to object number n is denoted by “object data n.” As shown in Fig. 3, each object data has discrimination codes in the top and the end of the data. If the byte length of the object data with these discrimination codes is not the multiple of 16, dummy data, which is identical to neither discrimination code in the data head nor in the data end, is added after the discrimination code of data end. Then, this object data is encrypted with a password by AES,12 which is a block encryption method and the block size is 16 bytes. A part of this password should be the object identification number. This encryption is executed for every object data. All encrypted object data are combined into one bitstream, as shown in Fig. 3. In the end, this bitstream is watermarked in the scrambled/erased image, as will be explained later. 3.2.3. Decryption In the special viewer, the watermarked bitstream is extracted and sent to the following decryption stage. This bitstream consists of all encrypted objects; however, the decoder cannot know which part of the bitstream corresponds to each object. Therefore, we should apply only one password to this bitstream for decrypting. If we apply the password involving number n, then we can correctly obtain object data n. In other words, only a part of the bitstream associated with the nth object is decrypted, and the other parts have no sense, so we discard these parts. Therefore, the other objects are not correctly decoded, so that they are still unrecognizable. In summary, we can reconstruct only the object that we want to display by giving the corresponding object number. Then, the privacies of the other objects can be protected.
Remarks
In this method, if multiple objects are overlapped, we cannot divide the overlapped objects into
individual objects. However, how to recognize the boundary of each object is a difficult problem, and is out of scope in this paper. Therefore, we use a simple method to divide it. If the object suddenly becomes very large when a frame changes, we consider that some objects are overlapped with the target object. To protect the privacy of the objects except the target, the decoder stop its execution.
Low frequency 1
2
3
5
: Not to be used
6
: Default area
4 : Extra area 1 : Extra area 2
59
61
: Extra area 3
60
62
: Extra area 4
63
64
: Extra area 5
High frequency Figure 4. Position of DCT coefficients for watermarking.
3.3. Embedding Objects by Watermarking As mentioned in the previous subsections, we embed the encrypted data in the scrambled/erased image by watermarking. We obtain DCT coefficients of the scrambled/erased image by using a JPEG technique. We use DCT coefficients of middle frequencies for watermarking, since any changes of low coefficients are very noticeable and the changes of high frequency coefficients can lead to great changes when the JPEG decoder decodes quantized data. Compared with the low or high frequency DCT coefficients, a small perturbation of a middle frequency DCT coefficient does not visually affect the reconstructed image. Therefore, we propose to embed the moving object data in the middle frequency coefficients indicated in Fig. 4. The least significations bits (LSBs) of quantized DCT coefficients are replaced by the encrypted data one bit by one bit. The position to be watermarked in is decided according to the number of bytes of the data. If the size of input image is 320 × 240, then the number of MCUs is (320 × 240)/162 × 6 = 1800. The position where the data is embedded is dynamically selected by the size of data as shown in Table 1. 3.3.1. Influence on Image Quality Watermarking leads to reduced compression efficiency and low image quality. These undesired influences can be reduced by exploiting the property of MCU. In the case of the 4 : 2 : 0 format of JPEG,5 for four blocks of 8 × 8 pixels, we obtain six MCUs, four of which are Y components and 2 of which are U and V components, as shown
Table 1. Adaptive embedding position selection by data size
Size of embedding data [bytes] 1 to 2362 2363 to 3562 3563 to 4612 4613 to 5512 5513 to 6262 6263 to 7762
Used coefficients 10 to 27 (default 10 to 35 (default 10 to 42 (default 10 to 48 (default 10 to 53 (default 10 to 64 (default
only) and extra and extra and extra and extra and extra
1) 1 to 1 to 1 to 1 to
2) 3) 4) 5)
Not subsampled : Used for watermarking Y 16 8 16 Input image
Divide into blocks
Block
8 Divide into MCUs U
V
Subsampled by 1:4 : Not used for watermarking
Figure 5. The input image is transformed to minimum coded units (MCUs).
in Fig. 5. If the LSBs of the U and V components are used for watermarking, the influence is bigger than the influence if the Y component is used, because the U and V components have been subsampled by ratio 1 : 4.
3.4. Enhance Processing Speed 3.4.1. Using Distributed Computing Since the moving object extraction, the tracking, and the masking require very heavy computational power, we use distributed computing for enhancing processing speed. Figure 6 shows the hardware configuration of distributed computing system for real-time security monitoring and surveillance. One Windows computer, four Linux computers, and a network camera are connected by Gigabit either. The main process including GUI(Graphical User Interface) is run on Windows computer, and the numerical intensive processes such as masking are run on Linux computers. We apply free PC Cluster software MPI(Message Passing Interface) and MPICH13 for four Linux PCs. The TCP/IP send/recv functions are used to data transmission between Windows PC and Linux PC1. The functions of each computer are summarized as follows.
MPI - MPICH
Network camera
Linux PC1
Windows PC
Linux PC2
Linux PC3
Linux PC4
Pentium4/ 2.6GHz
Pentium4/ 3.0GHz
Gigabit either
Figure 6. The hardware configuration of distributed computing.
Windows GUI, image input from network camera, moving object extraction, and tracking Linux 1 masking Linux 2 object recognition by clustering and matching (under development) Linux 3 human action analysis (under development) Linux 4 reserved
Figure 7 shows process flow of the distributed computing of main process and masking part. To avoid data waiting for camera input and masking, the network camera interface and the masking interface are executed as separate threads. For the data synchronization betweens these threads, we apply semaphores.
4. EXPERIMENTAL RESULTS In this section, we show the validity of the proposed method described in the previous section.
4.1. Object Reconstruction by the Special Viewer In Fig. 8, we demonstrate how the proposed system works and how a normal JPEG and the special viewer display the decoded image. This figure shows examples of an original image and its scrambled, erased and decoded images. We can observe that by using a normal viewer, the objects are successfully masked by scrambling or erasing. We cannot identify each person from these degraded objects.
Windows PC
Linux PC1
Network camera interface thread Network camera
Main thread Moving object extraction
Camera input
Tracking
Masking interface thread
Masking process
Data sending
Data receiving
Timer loop Moving object Tracking data Masking region Background image Masked image
Encoding
Disk storage
Process flow
Data receiving
Data sending
Data flow
Figure 7. Flow of distributed computing.
We show next that the special viewer displays any original objects in the made image by indicating the object number. The upper right image illustrates the case where object number 6 is specified when the made image is generated by scrambling. It can be seen that the only most left object, which has been assigned object number 6 in the encoder, is reconstructed. In the lower right image, object number 4 is applied to the decoder. Then, the only center object, which has object number 4, is reconstructed. As we have seen, we can satisfy the requirement in which the special viewer should selectively reconstruct the original objects. Moreover, scrambling/erasing can protects the privacies of the other objects. Figure 9 demonstrates that the encoder masks objects and the decoder reconstructs the images, sequentially. Figures 9(a), 9(b), 9(c) and 9(d) show a time sequence of masked images. Figures 9(e), 9(f), 9(g) and 9(h) show a time sequence of reconstructed images, where object number 2 is indicated to the decoder. We can observe in Figs. 9(e), 9(f) and 9(g) that only the object assigned object number 2 is reconstructed. In Fig. 9(d), there is
Normal viewer
Special viewer
Scrambling
Original Erasing
Figure 8. The masked and decoded images.
no object which is assigned object number 2. Therefore, we can see that no object is reconstructed.
4.2. Effect of Watermarking to Y Component In this section, we present the effect by using only Y components for watermarking. Figure 10(a) shows a part of a background image in the original image. Figures 10(b) and 10(c) show the corresponding part of the wartermarked image. The former image is a result of watermarking to YUV components. On the other hand, the latter image shows the case where only Y components are used for watermarking. To investigate the effect by using only Y components, we objectively compare these two images in the following. The comparison is given by the peek signal to noise ratios (PSNRs). The PSNR is defined as
P SN R = 10 log10
2552 , M SE
where MSE is the mean square error between the original input and reconstructed images. Table 2 lists PSNSs of the images shown in Figs. 10(b) and 10(c). The PSNR of the image in Fig. 10(b) shows lower value than
(a) Scrambled frame).
(35th
(e) Reconstructed (35th frame).
(b) Scrambled frame).
(43rd
(f) Reconstructed (43rd frame).
(c) Scrambled frame).
(49th
(g) Reconstructed (49th frame).
(d) Scrambled frame).
(55th
(h) Reconstructed (55th frame).
Figure 9. Sequential masked and reconstructed images.
(a) Original.
(b) Use YUV components.
(c) Use only Y component.
Figure 10. The comparison of the image quality.
that in Fig. 10(c). This result tells us that the U and V components yields the decrease of the image quality and the increase of the file size. Thus, we can conclude that the U and V components should not be used for watermarking.
Table 2. The comparison of PSNRs.
YUV components only Y component
PSNRs [dB] 33.02 36.40
File size [KB] 19.83 18.56
Table 3. The comparison of the processing speed.
Sequential Distributed
Total elapsed time 48.6 [sec] (449 frames) 27.7 [sec] (449 frames)
Masked frames elapsed time 38.7 [sec] (271 frames) 19.1 [sec] (271 frames)
Masking frame rate 7.0 [frames/sec] 14.2 [frames/sec]
4.3. Processing Time by Distributed Computing The comparison of the processing time and the frame rate between the sequential processing and the distributed processing are shown in Table 3. The experimental data was composed of 449 frames in which 271 frames include moving objects. The frame size was 320 × 240 pixels. The CPU speed of Windows and Linux PCs are Pentium IV 2.6GHz and 3.0GHz, respectively. A new image data is supplied to the moving object extraction immediately after the previous frame is processed. Therefore, the frame rate is dynamically changed according to the processing load. When there is no moving object, the frame rate becomes high and it is limited by the predefined value. On the other hand, when there are many moving objects, the frame rate becomes a low value. However, the real-time process is maintained.
5. CONCLUSION We have presented a concept for a real-time security monitoring camera with protecting the privacy of moving objects in recorded images. We have clarified the requirements for the monitoring camera by extending the privously presented requirements. We have introduced several improved functions to our previous work.4 As a result, even if the image includes multiple objects, we can selectively reconstruct only one object and keep the other invisible or unrecognizable. The improved watermarking method can decrease the artifacts in the masked image caused by watermarking. Moreover, the proposed distributed computing leads to real-time or nearly realtime processing. It can be concluded that the proposed method is very effective for security monitoring with privacy protection. The following problems would be still open. First, we need more efficient compression to
decrease the size of output bitstream and to improve the subjective quality of the reconstructed image. Second, we should extend our method to MPEG formats, since the current system is based on the motion JPEG. Third, we can use a multi-core processor for enhancing distributed computing.
ACKNOWLEDGMENTS This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (C), 17560329, 2005.
REFERENCES 1. I. Kitahara, K. Kogure, and N. Hagita, “Stealth vision for protecting privacy,” in Proc. ICPR 2004, 4, pp. 404–407, 2004. 2. J. Wickramasuriya, M. Alhazzazi, M. Datt, S. Mehrotra, and N. Venkatasubramanian, “Privacy-protecting video surveillance,” in Proc. SPIE, 5671, pp. 64–75, (San Jose, CA), Feb. 2005. 3. A. Senior, S. Pankanti, A. Hampapur, L. Brown, Y. Tian, and A. Ekin, “Blinkering surveillance: Enabling video privacy through computer vision,” Tech. Rep. RC22886 (W0308-109), IBM Technical Paper, Aug 2003. 4. K. Yabuta, H. Kitazawa, and T. Tanaka, “A new concept of security camera monitoring with privacy protection by masking moving objects,” in Proc. ICPR 2004, 4, pp. 404–407, 2005. 5. W. B. Pennebacker and J. L. Mitchell, JPEG Still Data Compression Standard, Van Nostrand Reinhold, 1993. 6. S. Katzenbeisser, F. A. P. Petitcolas, and F. Petticolas, Information Hiding Techniques for Steganography and Digital Watermarking, Artech House Publishers, 2002. 7. C. Staffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Proc. CVPR ’99, pp. 246–252, (Fort Colins, CO), June 1999. 8. A. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target detection and classification from real-time video,” in Proc. IEEE WACV ’98, Nov. 1998. 9. W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee, “Using adaptive tracking to classify and monitor activities in a site,” in Proc. CVPR ’98, pp. 22–31, 1998.
10. R. T. Collins, A. J. Lipton, and T. Kanade, “A system for video surveillance and monitoring,” Tech. Rep. CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, May 2000. 11. C. J. Veenman, M. J. T. Reinders, and E.Backer, “Resolving motion correspondence for densely moving points,” IEEE Trans. PAMI 23(1), pp. 54–72, 2001. 12. “Announcing the advanced encryption standard AES.” http://csrc.nist.gov/publications/fips/fips197/fips197.pdf. 13. “MPICH home page.” http://www-unix.mcs.anl.gov/mpi/mpich/.