Privacy Protected Surveillance Using Secure Visual Object Coding Karl Martin12 and Konstantinos N. Plataniotis Multimedia Laboratory The Edward S. Rogers Sr. Dept. of Electrical and Computer Engineering University of Toronto Multimedia Lab Technical Report 2008.01
January 7, 2008
1 Corresponding Author: Karl Martin, Multimedia Laboratory, Room BA 4157, The Edward S. Rogers Sr. Department of ECE, University of Toronto, 10 King’s College Road, Toronto, Ontario, M5S 3G4, Canada, phone: 1 (416) 978-6845, FAX: 1 (416) 978-4425, e-mail:
[email protected]. 2 Partially supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) under the Network for Effective Collaboration Technologies through Advanced Research (NECTAR) project.
Abstract This report presents the Secure Shape and Texture SPIHT (SecST-SPIHT) scheme for secure coding of arbitrarily-shaped visual objects. The scheme can be employed in a privacy protected surveillance system, whereby visual objects are encrypted so that the content is only available to certain entities, such as persons of authority, possessing the correct decryption key. The secure visual object coder employs Shape and Texture Set Partitioning in Hierarchical Trees (ST-SPIHT) along with a novel selective encryption scheme for efficient, secure storage and transmission of visual object shape and textures. The encryption is performed in the compressed domain and does not affect the rate-distortion performance of the coder. A separate parameter for each encrypted object controls the strength of the encryption versus required processing overhead. Security analyses are provided, demonstrating the confidentiality of both the encrypted and unencrypted portions of the secured output bit-stream, effectively securing the entire object shape and texture content. Experimental results showed that no object details are revealed to attackers who do not possess the correct decryption key. Using typical parameter values and output bit-rates, the SecST-SPIHT coder is shown to require encryption on less than 5% of the output bit-stream, a significant reduction in computational overhead compared to “whole content” encryption schemes.
1
1
INTRODUCTION
content” encryption may be computationally infeasible [4].
Introduction
One class of existing schemes addressing privacy protection in video surveillance employs scrambling, obscuring, or masking techniques to protect the identity of the subjects [5–7]. In these schemes, the visual texture data of the subject’s face or whole body are discarded or irreversibly transformed. Where SecSTSPIHT stores the encrypted, coded object data, these schemes disallow the use of the content for future investigative purposes and ultimately limit the efficacy of the surveillance system in which they are utilized. In [5], the subject’s body image is masked, revealing only a silhouette. However, such a silhouette may still allow identification of the subject via biometric modalities such as gait [8]. Similarly, in [6], the focus is on removing appearance information while retaining structural information about the body in order to assess behavior. The approach in [7] is to ”deidentify” face images so that facial recognition software cannot be used to reliably identify the subject, but enough facial features remain so that the image could still be used for detecting behavior. In this so-called k-Same approach, face images are clustered based on a distance metric, and the images replaced by a representative image generated by averaging of components based on pixels or eigenvectors. This approach, however, does not obscure the whole body image, and again, the original data is discarded and cannot be retrieved by authorized users.
Video surveillance of both public and private spaces is expanding at an ever-increasing rate. Consequently, individuals are increasingly concerned about the invasiveness of such ubiquitous surveillance and fear that their privacy is at risk. The demands of law enforcement agencies to prevent and prosecute criminal activity, and the need for private organizations to protect against unauthorized activities on their premises are often seen to be in conflict with the privacy requirements of individuals. In order to address this, we propose a secure visual object coder, Secure Shape and Texture Set Partitioning in Hierarchical Trees (SecST-SPIHT). The SecST-SPIHT scheme codes the shape and texture of arbitrarily-shaped visual objects in the same fashion as ST-SPIHT [1], employing a shape-adaptive discrete wavelet transform (SA-DWT) variant [2] and a modified SPIHT algorithm [3] offering progressive/embedded bit-rate output. The proposed scheme incorporates a novel selective encryption algorithm, utilizing a stream cipher to encrypt a small portion of the output bit-stream. The activation of the cipher is controlled by intelligent bit-classification instructions received from the coder. The scheme efficiently and effectively secures the entire shape and texture of the object and ensures that the object cannot be reconstructed without provision of the correct decryption key; no object details are revealed without providing the exact, correct decryption key. At typical output bit-rates and choice of security parameter, the encryption operation is performed on less than 5% of the output code bits; the remaining unencrypted code bits cannot be decoded due to their dependence on the correct interpretation of the encrypted portion of the code. The progressive/embedded nature of the coder allows the output bit-rate to be varied without affecting the total number of encrypted bits or reducing security. The SecST-SPIHT secure coder can be employed in surveillance systems where the capture of certain visual objects may be considered privacy invasive (e.g., face and body images). The decryption key required to decrypt and decode the visual object shape and texture may be managed such that only the appropriate authorities are able to access the object data. Furthermore, the key may be tied to the subject’s identity (e.g., through RFID based tokens), thus giving control of the private content to the subject. The proposed, selective encryption procedure makes the scheme suitable for real-time applications where significant processing resources are requisitely consumed for coding of the video stream and traditional “whole
Another class of privacy protection schemes attempts to separate private features from the input signal and secure them in a fashion so that they may still be retrieved for future use [9–12]. The proposed SecST-SPIHT scheme falls under this category as the subject’s image is coded and encrypted as an arbitrarily-shaped object, and can be retrieved with provision of the correct decryption key. In [9], a region of interest (ROI) is defined for face data within a frame, and the corresponding coefficients downshifted in order to be coded and protected in a separate quality layer using Motion JPEG 2000 [13]. However, using a traditional, non-shape-adaptive wavelet transform, the wavelet domain separation of ROI content only allows for rough separation of content in the spatial domain, thus disallowing precise object vs. background separation possible with object-based coding. The computer vision approach of [10] provides three policy-dependent options to hiding privacy data: summarization; transformation (obscuration); and encryption. In the case of encrypted output, traditional encryption is applied to the entire private data stream, which is computationally infeasible in many digital video surveillance systems. The 1
2
SECURE SHAPE AND TEXTURE SPIHT CODING SCHEME
In Section 2, the SecST-SPIHT scheme is described in detail. Security analysis of SecST-SPIHT is provided in Section 3. In Section 4, experimental results are provided and analyzed for various object inputs and parameters. Finally, the report is concluded in Section 5.
proposed scheme in [11] embeds the private information of subjects as an encrypted watermark within the surveillance frames. However, the private data is limited to rectangular regions of the image frame and the utilization of traditional encryption and watermarking may be computationally burdensome. In [12], a reversible wavelet-domain scrambling is performed on ROI-defined private data, thus allowing subsequent retrieval of the private data by authorized users. This approach, as in [9], does not allow explicit spatial domain separation of the object of interest and the background, and the region-of-interest shape is not secured. Furthermore, the scrambling is performed before compression, resulting in a modest reduction in coding performance [12]. A variety of image and video content protection schemes exist for entertainment applications [14, 15]. The techniques employed generally place an emphasis on standards compliance to ensure compatibility with the plethora of existing consumer devices and content delivery systems. However, these techniques may not be directly applicable to privacy-protected surveillance applications, where system operators may demand a greater level of confidentiality over the content and the system must support a mechanism for separation of private content while still maintaining the efficacy of the surveillance system. The schemes in [14] use efficient encryption or shuffling of variablelength codeword concatenations to secure MPEG-4 video streams while maintaining format compliance. However, entire frames are secured and hence cannot be used to secure only private data in surveillance applications. Furthermore, some image details may be reconstructed through error concealment techniques [14]. In [15], MPEG-4 video objects are secured through selective encryption of Object Descriptors (OD). This approach, however, offers very limited security since only meta-data is secured and none of the actual object content is encrypted. The proposed SecST-SPIHT secure coder offers an efficient solution for protection of private data in surveillance video. The object-based coding approach allows for explicit separation of a subject’s shape and texture from background imagery, offering a finer level of content granularity not present in ROI-based schemes. The selective encryption algorithm is designed to minimize processing overhead by encrypting the minimum amount of output code bits required to decode the original object shape and texture. The analysis provided in this report verifies the security of the proposed scheme and offers insight into the general design of selective encryption approaches for embedded coders. The remainder of the report is organized as follows.
2
Secure Shape and Texture SPIHT Coding Scheme
The Secure ST-SPIHT (SecST-SPIHT) coding and decoding system is shown in Fig. 1. It employs the Shape and Texture Set Partitioning in Hierarchical Trees (ST-SPIHT) scheme for coding arbitrarilyshaped visual objects [1], with a novel selective encryption algorithm that utilizes a stream cipher to encrypt specific bits in the output bit-stream. The shape and texture of the input object are coded in parallel, producing a single partially encrypted, embedded bit-stream which can be progressively decoded with provision of the correct decryption key; the resultant bit-stream may be truncated at an arbitrary point to produce a lower bit-rate output. The selective encryption offers an efficient alternative to complete content encryption which can be computationally burdensome in full color image and video applications. The data-dependent decoding algorithm makes the unencrypted portion of the bitstream effectively impossible to locate or interpret. Furthermore, the bits chosen for encryption represent the most significant components of the coded object, ensuring complete confidentiality of the visual data from those without the correct decryption key. Since encryption is performed during the output stage, SecST-SPIHT offers identical rate-distortion performance and embedded/progressive output properties as ST-SPIHT [1]. The proposed system describes secure coding of still visual objects but can easily be extended to the frames of a video object sequence in a fashion similar to Motion JPEG 2000 [13], or using 3-D transform domain representations [16]. The input consists of two components: i) an M ×N full color (texture) image x : Z 2 → Z 3 representing a two-dimensional matrix of three-component RGB color samples x(i, j) = [x(i, j)1 , x(i, j)2 , x(i, j)3 ], with i = 0, 1, . . . , M − 1 and j = 0, 1, . . . , N − 1 denoting the spatial position of the pixel, and x(i, j)k denoting the component in the red (k = 1), green (k = 2), or blue (k = 3) color channel; and ii) an M × N binary (shape mask) image s : Z 2 → {0, 1} representing a two-dimensional matrix of binary values where s(i, j) = 1 denotes spatial positions ‘inside’ the object, and s(i, j) = 0 denotes spatial positions 2
2
SECURE SHAPE AND TEXTURE SPIHT CODING SCHEME
Parameters:
λ, K
Object
Image (texture)
x
xT Pre-Processing
s
Image (texture)
^ x
^ xT Post-Processing
Secure ST-SPIHT Decoder
Inverse SA-DWT
Shape Mask
Channel/ Storage
kE kD
1 1 0 0
1 1 0 0 1 0
Secret Keys
0 1 0 0 1 0 0 1 1 0
Shape Mask
Secure ST-SPIHT Coder
SA-DWT
^s
Reconstructed Object
Figure 1: System level diagram of the SecST-SPIHT coding and decoding scheme. ‘outside’ the object. The object is preprocessed by first converting the texture to the YCb Cr color space. Subsequently, texture positions outside the object are set to zero, such that x(i, j) = [0, 0, 0], ∀ (i, j) where s(i, j) = 0. Each color channel of the texture is subsequently transformed using an in-place lifting shape-adaptive discrete wavelet transform (SA-DWT) with global subsampling [1, 2], creating the M × N vectorial field xT : Z 2 → Z 3 of transform coefficients xT (i, j) = [xT (i, j)1 , xT (i, j)2 , xT (i, j)3 ]. The in-place SA-DWT allows the spatial domain shape mask s to remain unmanipulated and coded directly [1]. The SecST-SPIHT coder, shown in Fig. 2, employs an ST-SPIHT coder and selectively encrypts the output bit-stream using a stream cipher fE (b, kE ), applied to individual bits b using the private key kE . The ST-SPIHT algorithm is utilized to code the input shape and texture as well as to provide intelligent bit classification instructions to the stream cipher. The details of the ST-SPIHT coding algorithm are summarized in the Appendix; full details and analysis can be found in [1].
λ, K Texture xT Shape
s
ST-SPIHT Coder
0 1 1 1 0 0...
Location of bits Bn,LIP-α, Bn,LIP-sig, Bn,LIS-α, Bn,LIS-sig
Combined Coding and Encryption
kE
Compressed/ Encrypted Bitstream
fE(b,kE) Stream Cipher Encryption Function
Secure ST-SPIHT Coder
Figure 2: SecST-SPIHT Coder.
stream can be divided into the ordered subsets B = {Bnmax , Bnmax −1 , Bnmax −2 , . . .}, where Bn is the set of bits obtained during coding iteration for bitplane n (i.e., representing the value 2n ), and nmax is the highest bit-plane at which coding is initiated. Each Bn can be further subdivided into Bn = {Bn,LIP , Bn,LIS , Bn,LSP }, where Bn,LIP denotes the ordered set of bits obtained during the first phase of the sorting pass where coefficients in the LIP are tested for significance; Bn,LIS denotes the ordered set of bits obtained during the second phase of the sort2.1 Selective Encryption for STing pass where entire trees are tested for significance; SPIHT and Bn,LSP denotes the ordered set of bits obtained The SecST-SPIHT selective encryption algorithm is during the refinement pass. Each set of bits Bn,LIP is composed of α-test shape a novel extension of the scheme proposed in [17] for regular SPIHT. We denote the ST-SPIHT bit- bits (Bn,LIP−α ), significance bits (Bn,LIP−sig ) and stream as the ordered set of bits B. The bit- sign bits (Bn,LIP−sgn ). Similarly, each set of bits 3
2
SECURE SHAPE AND TEXTURE SPIHT CODING SCHEME
Bn,LIS-sgn
Bn,LIS-Tsig 0 1
1 1
0
1
0
0
1
1 1
0
1
1 …
Bn,LIS-α
Bn,LIS-sig
Bn
Bn,LIP
Bn,LIS
Bn,LSP Bn,LIP-sgn
Bn,LIP-α 1
1
0
0
1 0
1 1
1
1
1
0
1 1 … Bn,LIP-sig
Figure 3: Composition of subset Bn of ST-SPIHT bit-stream for n > λ. Bn,LIS is composed of significance bits (Bn,LIS−sig ) and sign bits (Bn,LIS−sgn ) for individual coefficients, significance bits for trees (Bn,LIS−Tsig ), and α-test shape bits for both individual coefficients and trees (Bn,LIS−α ). This decomposition of the bit-stream is shown in Fig. 3. The SecST-SPIHT encryption scheme uses an encryption function fE (b, kE ) to encrypt only the bits b ∈ Be = for {Bn,LIP−α , Bn,LIP−sig , Bn,LIS−α , Bn,LIS−sig }, n = nmax , nmax − 1, . . . , nmax − K + 1, and K > 0. The key kE enforces the confidentiality of the data by preventing entities without the correct matching decryption key, kD , from correctly decrypting the data.1 The parameter K is controlled by the user at the time of encryption/encoding to determine the number of coding iterations to be encrypted. Increasing K results in more bits being encrypted and greater security, with the trade-off of greater computational overhead. The specific bits are selectively chosen since they represent the object shape information and the significance information of individual coefficients. The coefficient sign bits (Bn,LIP−sgn and Bn,LIS−sgn ) remain unencrypted since their values do not affect the coder/decoder execution path. Similarly, the significance bits relating to entire trees (Bn,LIS−Tsig ) remain unencrypted since they do not affect specific coefficient reconstruction values. The encryption function fE (b, kE ) must be implemented using a stream cipher since the decoder (Fig. 4) must decode individual bits and instruct the decryption function fD (b, kD ) whether each subsequent bit requires decryption or not; the use of a block ci-
pher would prevent the decoder from correctly determining which bits in the output bit-stream are part of the cipher block. However, the system is flexible in that any bit-level stream cipher may be used, employing either symmetric private keys or publicprivate key pairs. The coding operation is typically terminated when a specified rate or distortion criterion is met. While SecST-SPIHT allows for coding to be terminated before the shape has been losslessly coded, typical rate criteria and values of λ will result in complete lossless coding of the shape. Also, the coder may be instructed not to code the shape in situations where, for example, the shape is implicitly available via the shape of another object which surrounds the object to be coded (e.g., a background object). The SecST-SPIHT decoder follows exactly the same execution path as the coder and only requires basic initialization information (i.e., M , N , |G|, nmax , λ, the number of wavelet transform levels, and s if the shape was not coded) to interpret the output bit-stream. Provided with the correct decryption key, kD , the decoder decodes the bit-stream and instructs the decryption function fD (b, kD ) as
Texture
Compressed/ Encrypted ...0 1 1 1 0 0... Bitstream fD(b,kD)
kD
Stream Cipher Decryption Function
ST-SPIHT Decoder Location of bits Bn,LIP-α, Bn,LIP-sig, Bn,LIS-α, Bn,LIS-sig
Combined Decryption and Decoding
Secure ST-SPIHT Decoder
1 In
the case where fE (b, kE ) implements a symmetric key cipher, kD = kE .
Figure 4: SecST-SPIHT Decoder. 4
x^T
Shape
s^
3
placed in the LIP during the first K coding iterations. This occurs if the parent of coefficient xT (i, j)k has other descendants found significant during the first K coding iterations, before xT (i, j)k is found significant. Defining the parent coordinates of coefficient xT (i, j)k as P(i, j)k , as per the color spatial orientation tree definition [18], we then define the set of coordinates of “parental descendants” of xT (i, j)k as DP (i, j)k = D(P(i, j)k ) \ {(i, j)k }. That is, the parental descendants of xT (i, j)k are all the coefficients descendant from its parent, not including itself. Hence, if max(r,s)t ∈DP (i,j)k (nMSB(r,s)t ) > nMSB(i,j)k and max(r,s)t ∈DP (i,j)k (nMSB(r,s)t ) > nmax − K, then coefficient xT (i, j)k will be placed in the LIP during the first K coding iterations, and will have encrypted bits in the bit-planes max(nMSB(i,j)k , nmax −K +1) ≤ n ≤ max(r,s)t ∈DP (i,j)k (nMSB(r,s)t ). The net effect of this is that a non-significant coefficient will still have one or more of its bits encrypted if it is located in the region of significant coefficients, thus the partial encryption can be seen to be applied in general regions of significance. In addition to the partial bit-plane encryption of the texture coefficients, the output of each α-test is encrypted, effectively encrypting the entire shape code during the first K iterations. If K > nmax − λ, then the complete, lossless shape code is encrypted. The choice of K should be made to ensure that the number of bits finally encrypted is sufficient to make it computationally infeasible to perform a bruteforce, exhaustive search attack over all possible sequences. As with SPIHT and ST-SPIHT, the SecST-SPIHT coder and decoder follow a data-dependent execution path. This means that the correct interpretation of a given bit in the output bit-stream requires complete knowledge of all previous significance test and α-test bits. The result is that an attacker cannot in fact locate the bits in the output bit-stream which are not encrypted. To demonstrate the difficulty encountered by a cryptanalyst attempting to determine which bits are unencrypted, we use bjn,LIP to denote the j th bit in the set Bn,LIP , for j = 0, 1, 2, . . . , Nn,LIP −1, where Nn,LIP is the total number of bits in Bn,LIP . According to the SecST-SPIHT coder definition, considering the initial coding iterations in which n ≥ λ (i.e., the shape is still being coded), it is known a priori that the first bit is an α-test bit:
to whether each subsequent bit should be decrypted or passed through, unencrypted. Since the first bit is always in Bnmax ,LIP−α (generated from the first iteration of step 2.1.1), it must always be decrypted. It should be noted that SecST-SPIHT is backward compatible such that when the input shape s fills the entire M × N rectangular bounding box, the coding operation is identical to traditional SPIHT [3] and the selective encryption algorithm operates the same as in [17]. Also, the selective encryption may be applied “offline” to an object already coded using STSPIHT. Using an ST-SPIHT decoder to interpret the bit-stream, the equivalent bit classification instructions can be generated as in the SecST-SPIHT coder, and the appropriate bits replaced with encrypted versions.
3
SECURITY ANALYSIS OF SECST-SPIHT
Security Analysis of SecSTSPIHT
The SecST-SPIHT selective encryption ensures the confidentiality of the coded visual object data in two ways: i) securing the most significant portion of the bit-stream using a secret cryptographic key kE and a stream cipher; and ii) making the unencrypted portion of the bit-stream impossible to decode since its location and the state of the decoder cannot be determined without correct decryption and decoding of the encrypted portion. As noted in the previous section, encryption is performed on the output bits b ∈ Be = {Bn,LIP−α , Bn,LIP−sig , Bn,LIS−α , Bn,LIS−sig |nmax − K < n ≤ nmax }. This represents a partial bit-plane and shape encryption performed on the visual object in the SA-DWT domain, with the choice of K determining how many bit-planes selective encryption is applied to. A coefficient xT (i, j)k will have its most significant bit (MSB), at bitplane nMSB(i,j)k = log2 (|xT (i, j)k |), encrypted if nMSB(i,j)k > nmax − K — i.e., if the coefficient is found significant during the first K coding iterations. Also, if the coefficient is part of the luminance SA-DWT LL subband (i.e., (i, j)k ∈ H), it is placed in the LIP upon initialization of the coder and hence will also have each bit encrypted in bit-planes max(nMSB(i,j)k , nmax − K + 1) ≤ n ≤ nmax . In other words, for luminance LL subband coefficients, the higher order bits are also encrypted, until the b0n,LIP ∈ Bn,LIP−α (1) bit-plane at which the coefficient is found significant, or K coding iterations have passed. Alternatively, However, classification of the second bit depends on if xT (i, j)k is contained in a spatial orientation tree the first bit: (i.e., (i, j)k ∈ / H), it will have one or more bits Bn,LIP−sig , if b0n,LIP = 1 1 b (2) ∈ n,LIP encrypted if it has been removed from the tree and Bn,LIP−α , otherwise 5
4 EXPERIMENTAL RESULTS
cryption key, kD , should also be long enough to defend against a brute-force attack over the key space. Alternatively, an attacker may attempt to locate the unencrypted portion of the bit-stream Bu = {Bn |n ≤ nmax − K} since it is known that all bits b2n,LIP ∈ 0 ⎧ in Bu are unencrypted, and may reveal important 1 ⎨ Bn,LIP−sig , if bn,LIP = 0 and bn,LIP = 1 image features if correctly decoded. If we denote the Bn,LIP−sgn , if b0n,LIP = 1 and b1n,LIP = 1 ⎩ total number of bits in the first K coding iterations Bn,LIP−α , otherwise (both encrypted and unencrypted) as NK , an attack (3) on Bu may be attractive if H(Be ) > H(NK ). In other words, if determining the location of Bu (which starts This can be generalized as follows: at bit NK +1 within the overall bit-stream B) is comj putationally simpler than an exhaustive search over bn,LIP ∈ ⎫ ⎧ the encrypted bits Be , the attacker may view this apBn,LIP−sig , ⎪ ⎪ ⎪ proach as offering greater probability of success in re ⎪ ⎪ ⎪ ⎪ ⎪ j−1 j−1 ⎪ ⎪ ∈ B and b = 1 if b ⎪ ⎪ vealing image details. However, even with knowledge n,LIP−α n,LIP n,LIP ⎬ ⎨ of Bu , the state of the LSP, LIP, and LIS lists and the , Bn,LIP−sgn
⎪, ⎪ ⎪ ⎪ shape decoding remain unknown without correct dej−1 j−1 ⎪ if bn,LIP ∈ Bn,LIP−sig and bn,LIP = 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ cryption and decoding of Be . This means that while ⎭ ⎩ Bn,LIP−α , otherwise the initial bits in Bu may be correctly classified by the 1 ≤ j < Nn,LIP . (4) attacker, it cannot be determined which coordinates within the SA-DWT representation of the object the From (4), it is evident that the bits Bn,LIP can in coded bits correspond to. Ultimately, the attacker fact be treated as the ordered set of coded transition will not be able to determine any image details from instructions in a Markov chain. The classification of Bu without correct decryption and decoding of Be . th state in the chain, must bj−1 In summary, the SecST-SPIHT secure coder n,LIP , indicating the (j−1) be known along with the value bjn,LIP (the transition achieves confidentiality by encrypting the most instruction) in order to determine the classification of significant portion of the bit-stream as well as obfusbjn,LIP (the j th state in the chain). Since the value of cating the unencrypted portion. We note that the bjn,LIP indicates only the transition and not the state scheme in [19] applies a similar approach for zero-tree itself, it is clear that all previous bits bln,LIP , 0 ≤ l < j wavelet coded rectangular images, except that an a priori design choice is made to restrict encryption to must be known in order classify bjn,LIP and determine the lowest two frequency subbands (i.e., the top two whether it is unencrypted. Similar arguments can be levels in the spatial orientation trees). This approach made for Bn,LIS . Hence, without the correct decryp- does not allow for the data-dependent distribution tion key, not only do the the encrypted bits remain of significant coefficients and is inflexible to varying confidential, but the locations of the unencrypted bits applications which require input images of different cannot be determined and are thus also confidential. sizes with the use of varying number of wavelet In attacking the encrypted portion of the bit- decomposition levels. In contrast, the approach stream, the cryptanalyst may attempt to recreate the of SecST-SPIHT is for the selective encryption to Markov chain and perform statistical analyses so that follow the data-dependent execution path of the the original bits could be correctly predicted with coder, ensuring that the most significant coefficients, probability p > 0.5 from previous bits, thus aiding regardless of location, are partially encrypted, and an exhaustive search attack. While recreating such that always the initial portion of the bit-stream is an attack is beyond the scope of this report, the ef- partially encrypted. Furthermore, SecST-SPIHT ficiency of the coding algorithm [1, 3] implies that offers the user parameter K which provides control the entropy of each bit H(b) ≈ 1 and thus p ≈ 0.5, over how many coding iterations are considered regardless of the additional contextual information for encryption. This allows flexibility to meet the offered by the previous states in the decoded chain. security requirements of the application at hand. However, if a more conservative estimate of H(b) < 1 is made, then K can simply be increased to increase the number of encrypted bits in order to ensure that 4 Experimental Results an exhaustive search remains computationally infeasible. Also, it should be noted that, as with tradi- The analyses provided in Section 3 demonstrate the tional cryptographic systems, the length of the de- security of the SecST-SPIHT coder. However, the efAnd, consequently, classification of the third bit depends on the first and second bits:
6
4 EXPERIMENTAL RESULTS
ficacy of such a scheme must also be demonstrated via subjective visual evaluation to ensure that the secured object details remain confidential. In this section we input a variety of sample visual objects to the SecST-SPIHT coder and evaluate the output generated when the user does not provide the correct decryption key; the performance of the proposed scheme is judged on its ability to obscure the original visual object features. Additionally, the security level parameter K, and shape code level parameter λ, are varied to determine the resultant number of encrypted bits as a portion of the whole bit-stream. The rate-distortion performance of SecST-SPIHT is identical to ST-SPIHT, which is examined in detail in [1], and will not be covered here. The chosen input visual test objects are shown in Figs. 5 to 9. The ‘surveillance1’, ‘surveillance2’, and ‘surveillance3’ objects were extracted from actual surveillance video frames using motion-based segmentation, whereas ‘akiyo’ and ‘foreman’ are the standard MPEG test objects. The coder accepts an arbitrary binary segmentation map so that any segmentation algorithm can be employed, depending on the requirements of the application. All frames are in 8bit per channel RGB CIF format (352 × 288) with Table 1 showing the percentage of the frame that the object occupies. In all test cases, the SecST-SPIHT coder utilized the CDF 9/7 biorthogonal wavelet filters [20] with a 4-level transform, and an output code bit-rate of 2.4 bits-per-object-pixel (including the shape code, where applicable). Since the progressive/embedded output property of ST-SPIHT is maintained, the output code may be arbitrarily truncated to achieve a lower bit-rate with the sacrifice of greater texture distortion.2 If lossless coding of the texture is required, integer-to-integer wavelet filters [21] and colour transforms can be utilized and the coder instructed to code all of the transform domain bit-planes [1]. For simulation purposes, a Vernam cipher was employed as the stream cipher [22], using 128-bit randomly gener-
ated key. However, any bit-level stream cipher that is sufficiently secure for the application at hand can be utilized. Figs. 10 to 14 show sample output using the test objects. In all cases, encryption is performed during the first two coding iterations (K = 2). In the cases where the shape is coded and encrypted with the object texture, the shape is code is completed in the third iteration (λ = nmax − 2). Figs. 10 to 12 show the decrypted/decoded output ‘surveillance’ objects/frames when: (a)/(d) the correct decryption key is provided; (b)/(e) the incorrect decryption key is provided; and (c)/(f) the incorrect decryption key is provided, but the shape is available externally and only the texture is coded. In all cases where the incorrect key is provided, the textural content is completely obscured; no object details can be seen. For the case (b)/(e) where the shape is coded and encrypted with the texture, the shape is also completely obscured. In order to reconstruct the frame without revealing the object shape mask, the background is transmitted as a full frame, with the missing texture information behind the object filled-in using prior frames. Similarly, the decrypted/deoded test objects/frames ‘akiyo’ and ‘foreman’ are shown in Figs. 13 and 14, respectively with: (a)/(d) the correct decryption key provided; (b) the incorrect decryption key provided; and (c)/(e) the incorrect decryption key is provided, but the shape is available externally and only the texture is coded. In the cases when the shape is coded and encrypted with the object and the incorrect decryption key is provided (Figs. 13(b) and 14(b)), the full frame background is not transmitted since the prior frames in the sequence do not offer enough information to in-fill the original object area. Fig. 15 shows the fraction of the output code bits which are encrypted vs. the number of coding iterations during which encryption is performed (K). The total number of output code bits corresponds to 2 At most bit-rates and choices of λ, the shape will be coded a bit-rate of 2.4 bits-per-object-pixel (including the losslessly. shape code for Figs. 15(b) to 15(d)). Fig. 15(a) shows the case where the shape is not coded; Figs. 15(b) to 15(d) show the cases where the shape code Table 1: Percentage of frame occupied by test ob- is completed during the first, second, and third coding iteration (λ = nmax , nmax − 1, and nmax − 2), jects. respectively. In Fig. 15(a), the effect of varying K Object Frame Percentage can clearly be seen, with the fraction of the output ’surveillance1’ 10.9% code being encrypted rising with K. The fraction re’surveillance2’ 7.6% mains small for all considered K = 1, · · · , 4, ranging ’surveillance3’ 25.7% from approximately 0.2% to 1.6%. In Figs. 15(b) ’akiyo’ 37.2% to 15(d), a large jump in the portion of the bit’foreman’ 29.4% stream that is encrypted is observed once K is set 7
4 EXPERIMENTAL RESULTS
(a) original frame
(b) segmentation map
(c) segmented object
Figure 5: ‘Surveillance1’ test object.
(a) original frame
(b) segmentation map
(c) segmented object
Figure 6: ‘Surveillance2’ test object.
(a) original frame
(b) segmentation map
Figure 7: ‘Surveillance3’ test object.
8
(c) segmented object
4 EXPERIMENTAL RESULTS
(a) original frame
(b) segmentation map
(c) segmented object
Figure 8: ‘Akiyo’ test object.
(a) original frame
(b) segmentation map
Figure 9: ‘Foreman’ test object.
9
(c) segmented object
4 EXPERIMENTAL RESULTS
(a)
(b)
(c)
(d)
(e)
(f)
Figure 10: ‘Surveillance1’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b)/(e) with incorrect key; (c)/(f) with incorrect key and shape provided externally.
10
4 EXPERIMENTAL RESULTS
(a)
(b)
(c)
(d)
(e)
(f)
Figure 11: ‘Surveillance2’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b)/(e) with incorrect key; (c)/(f) with incorrect key and shape provided externally.
11
4 EXPERIMENTAL RESULTS
(a)
(b)
(c)
(d)
(e)
(f)
Figure 12: ‘Surveillance3’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b)/(e) with incorrect key; (c)/(f) with incorrect key and shape provided externally.
12
4 EXPERIMENTAL RESULTS
(a)
(b)
(d)
(c)
(e)
Figure 13: ‘Akiyo’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b) with incorrect key; (c)/(e) with incorrect key and shape provided externally.
13
4 EXPERIMENTAL RESULTS
(a)
(b)
(d)
(c)
(e)
Figure 14: ‘Foreman’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b) with incorrect key; (c)/(e) with incorrect key and shape provided externally.
14
5 CONCLUSION
high enough to ensure that the shape is completely encrypted (K = nmax − λ + 1). When K is raised above this point, the effect is more subtle since at low output bit-rates the shape code represents a significant portion of the bit-stream. With K > nmax − λ, the actual percentage of the output code that is encrypted is largely controlled by the portion which is the shape code (Bn,LIP−α and Bn,LIS−α ). If the user wishes to keep the level of encryption to a minimum for the purpose of computational efficiency, λ should be set low enough to disperse the shape code further into the bit-stream, and setting K ≤ nmax − λ so that only the initial portion of the shape code is encrypted. In this case, λ should be chosen so that K can still be set high enough to encrypt a minimum number of bits to achieve a minimum desired level of security. For example, as in Figs. 10 to 14, setting K = 2 and λ = nmax − 2 (i.e., shape code completed in the third coding iteration). The drawback of this approach is that the shape cannot be completely, losslessly decoded until later in the output bit-stream, possibly resulting in lossy shape reconstruction in very low bit-rate scenarios. Table 2 shows the number of bits encrypted for λ = nmax − 2 and different K. As in Fig 15(d), there is a jump at the iteration at which the remaining shape code is generated and encrypted (K = 3). With this choice of λ, K = 2 can be chosen since the number of bits encrypted is large enough to prevent a brute-force, exhaustive search attack over the encrypted bits, but still represent minimal processing overhead with less than 5% of the output bit-stream encrypted for a bit-rate of 2.4 bits-per-object-pixel.
Table 2: The number of bits encrypted for the test objects using different values of K and λ = nmax − 2. K Test Object 1 2 3 4 Surveillance1 777 805 4333 4507 Surveillance2 783 819 3239 3428 Surveillance3 734 790 3494 4030 Akiyo 768 901 4086 4934 Foreman 762 874 5381 5763
It should be noted that the property of SecSTSPIHT to disperse the shape code within the texture code is inherited from ST-SPIHT. With the execution path of the texture decoding dependent on the shape code, the two portions of the code cannot be separated without correct decryption of all encrypted bits. 15
5
Conclusion
The SecST-SPIHT secure visual object coder was presented, offering an efficient solution for privacy protection of subjects in digital video surveillance systems. Provided with segmented, arbitrarily-shaped visual objects, SecST-SPIHT securely codes both the shape and texture, ensuring confidentiality through the use of a private decryption key. In contrast to privacy protection systems that simply discard the subject’s visual details via masking or blurring, SecST-SPIHT allows complete recovery of the data if the correct decryption key is provided. This is necessary in applications where the visual data may be required for future investigative purposes. Furthermore, by encrypting the object shape, subject recognition based on silhouette characteristics is prevented. Additionally, the SecST-SPIHT secure coder offers all the features of the ST-SPIHT visual object coder [1], namely efficient and progressive/embedded parallel coding of the object shape and texture. The parameter K offers the user control over a variable level of application-dependent security. In effect, increasing K increases the portion of the output bitstream that is encrypted by performing encryption for a greater number of coding iterations. In practice, K can be chosen to ensure that the number of encrypted bits is high enough to protect against a brute-force, exhaustive search attack over the encrypted portion of the bit-stream. The remaining unencrypted portion of the bit-stream cannot be decoded since the data-dependent execution of the decoder requires complete knowledge of the prior (encrypted) portion of the bit-stream. The provided secure coding algorithm operates on individual visual object input frames, but may be extended for video sequences using techniques similar to Motion JPEG 2000 [13] or 3-D transform domain representations [16]. Alternatively, motion compensation may be employed to reduce the size of the shape and texture coded for subsequent frames. Consequently, for a given K, the number of encrypted bits for subsequent encrypted object frames would also be very low. However, confidentiality of those object frames would not be compromised since correct decoding would require decryption of the previous frames, thus extending the data dependent, partial encryption paradigm into the temporal dimension. SecST-SPIHT is well suited as a privacy enhancing technology for surveillance-intensive environments. However, the coder can be employed in any number of applications where the confidentiality and efficient coding of arbitrarily-shaped visual objects is required.
5 CONCLUSION
0.25
0.016 0.014
surveillance1 surveillance2 surveillance3 akiyo foreman
# encrypted bits/# total code bits
# encrypted bits/# total code bits
0.018
0.012 0.01 0.008 0.006 0.004
0.2
surveillance1 surveillance2 surveillance3 akiyo foreman
0.15
0.1
0.05
0.002 0 1
2
3
0 1
4
4
0.25 surveillance1 surveillance2 surveillance3 akiyo foreman
# encrypted bits/# total code bits
# encrypted bits/# total code bits
0.25
0.15
0.1
0.05
0 1
3
(b) Shape code completed in first iteration (λ = nmax )
(a) Shape not coded
0.2
2
K (# encrypted coding iterations)
K (# encrypted coding iterations)
2
3
0.2
0.15
0.1
0.05
0 1
4
K (# encrypted coding iterations)
surveillance1 surveillance2 surveillance3 akiyo foreman
2
3
4
K (# encrypted coding iterations)
(c) Shape code completed in second iteration (λ = nmax − 1)
(d) Shape code completed in third iteration (λ = nmax − 2)
Figure 15: The fraction of bits encrypted vs. the security level parameter K (number of encrypted coding iterations) for different λ (shape code levels). The total bits in the code corresponds to a bit-rate of 2.4 bits-per-object-pixel.
16
REFERENCES
A
Shape and Texture SPIHT Coding
The Shape and Texture Set Partitioning in Hierarchical Trees (ST-SPIHT) algorithm codes the shape and texture of arbitrarily-shaped visual objects in parallel to produce one unified, embedded output bitstream [1]. The texture coding in ST-SPIHT follows a natural extension of SPIHT with the spatial orientation trees (SOT) defined as in [3], with the modification for color images proposed in [18]. The SOTs are first formed using all coordinates inside the bounding box of size M ×N ; the binary shape mask s is used to describe which nodes are inside the object and which are outside. The same input object definition and preprocessing steps described in Section 2 are used. We define G = {(i, j) | s(i, j) = 1} as the set of all coordinates inside the object, and G = {(i, j) | s(i, j) = 0} as the complementary set containing all coordinates outside the object — i.e., G G = {(i, j) | i = 0, 1, . . . , M − 1, j = 0, 1, . . . , N − 1} and |G| + |G| = M N . All the definitions from the standard SPIHT algorithm described in [3] remain in use with the addition of the color component index k. Briefly, the list of insignificant pixels (LIP), list of significant pixels (LSP), and list insignificant sets (LIS), store different coefficient and tree root coordinates. A “type-A” entry in the LIS refers to D(i, j)k , all the descendants of (i, j)k ; a “type-B” entry refers to L(i, j)k = D(i, j)k −O(i, j)k , where O(i, j)k are the direct offspring of location (i, j)k . H denotes the set of all luminance LL subband coefficient coordinates and Sn (·) refers to the significance test at bit-plane n, as defined in [3]. Unique to the ST-SPIHT algorithm are a series of three “α-test” functions. The “α pixel test” function, αp (·, ·), identifies whether a coordinate is inside or outside the shape and is defined follows: 1, (i, j) ∈ G αp (i, j) = (5) 0, otherwise The “α set-discard test” function, αSD (·), identifies sets of coefficients that are entirely outside the object: αSD (T ) =
0, 1,
T ⊆G , otherwise
(6)
where T represents a given set of coefficients. And finally, the “α set-retain test” function, αSR (·), identifies sets of coefficients that are entirely inside the object: 1, T ⊆ G (7) αSR (T ) = 0, otherwise 17
The ST-SPIHT coding routine requires the shape code level parameter, λ, to be input. This defines the quantization level at which the routine forces the coding of not-yet-coded shape mask pixels s(i, j). This is done by applying the subroutine “Shape Code Set” (SCS) to the appropriate trees. The complete algorithm codes the shape and texture information in parallel, producing an embedded bit-stream that can be decoded to produce progressive shape and texture reconstruction. By lowering λ, the shape code becomes further dispersed in the output bit-stream, delaying the point at which the shape can be completely, losslessly decoded. At very low output bit-rates, lowering λ allows greater emphasis to be placed on the texture, providing the trade-off of lossy shape reconstruction [1]. The decoder follows the same datadependent execution path as the coder based on interpretation of the output bit-stream.
References [1] K. Martin, R. Lukac, and K. N. Plataniotis, “SPIHT-based coding of the shape and texture of arbitrarily shaped visual objects,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 10, pp. 1196–1208, Oct. 2006. [2] S. Li and W. Li, “Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual object coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, pp. 725–743, Aug. 2000. [3] A. Said and W. A. Pearlman, “A new fast and efficient image codec based on set partitioning in hierarchical trees,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 243–250, Jun. 1996. [4] B. Furht, D. Socek, and A. M. Eskicioglu, Multimedia Security Handbook. CRC Press, 2004, ch. 3: Fundamentals of Multimedia Encryption Techniques. [5] S. Tansuriyavong and S. Hanaki, “Privacy protection by concealing person in circumstantial video image,” in Proc. Workshop on Perceptive User Interfaces, vol. 4, 2001, pp. 1–4. [6] D. Chen, Y. Chang, R. Yan, and J. Yang, “Tools for protecting the privacy of specific individuals in video,” EURASIP Jrnl. on Advances in Sig. Proc., vol. 2007, pp. 1–9, 2007. [7] E. M. Newton, L. Sweeney, and B. Malin, “Preserving privacy by de-identifying face images,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 2, pp. 232–243, Feb. 2005.
REFERENCES
[8] H. Lu, K. N. Plataniotis, and A. N. Venet- [18] A. A. Kassim and W. S. Lee, “Embedded color sanopoulos, “A full-body layered deformable image coding using SPIHT with partially linked model for automatic model-based gait recogspatial orientation trees,” IEEE Trans. Circuits nition,” EURASIP Jrnl. on Advances in Sig. Syst. Video Technol., vol. 13, no. 2, pp. 203–206, Proc., Spec. Issue on Adv. Sig. Proc. and Patt. Feb. 2003. Recog. Methods for Biometrics, preprint 2008. [19] H. Cheng and X. Li, “Partial encryption of compressed images and videos,” IEEE Trans. Sig[9] I. Martinez-Ponte, X. Desurmont, J. Meesen, nal Process., vol. 48, no. 8, pp. 2439–2451, Aug. and J.-F. Delaigle, “Robust human face hiding 2000. ensuring privacy,” in Proc. Int. Workshop on Image Analysis for Multimedia Interactive Ser[20] M. Antonini, M. Barlaud, P. Mathieu, and vices., 2005. I. Daubechies, “Image coding using wavelet transform,” IEEE Trans. Image Process., vol. 1, [10] A. Senior, S. Pankanti, A. Hampapur, L. Brown, pp. 205–220, Apr. 1992. Y.-L. Tian, A. Ekin, J. Connell, C. F. Shu, and M. Lu, “Enabling video privacy through com[21] R. Calderbank, I. Daubechies, W. Sweldens, and puter vision,” IEEE Security Privacy, vol. 3, B.-L. Yeo, “Wavelet transforms that map inteno. 3, pp. 50–57, May–June 2005. gers to integers,” Appl. Comput. Harmon. Anal., vol. 5, no. 3, pp. 322–369, 1998. [11] W. Zhang, S. S. Cheung, and M. Chen, “Hiding privacy information in video surveillance sys[22] A. J. Menezes, P. C. van Oorschot, and S. A. tem,” in Proc. IEEE Int. Conf. on Image Proc., Vanstone, Handbook of Applied Cryptography. vol. 3, 2005, pp. 868–871. CRC Press, 1996. [12] F. Dufaux, M. Ouaret, Y. Abdeljaoued, A. Navarro, F. Vergnenegre, and T. Ebrahimi, “Privacy enabling technology for video surveillance,” in Image Processing for Military and Security Applications, S. S. Agaian and S. A. Jassim, Eds. Proc. SPIE 6250, 2006, pp. 1–12. [13] JTC 1/SC 29/WG 1, ISO/IEC 15444-3:2007 Information technology – JPEG 2000 image coding system: Motion JPEG 2000, ISO/IEC Std., 2007. [14] J. Wen, M. Severa, W. Zeng, M. H. Luttrell, and W. Jin, “A format-compliant configurable encryption framework for access control of video,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 6, 2002. [15] P.-C. Wang and T.-W. Hou, “An AV object oriented encryption algorithm for MPEG-4 streams,” in Proc. Int. Conf. on Multimedia and Expo, Jun. 2004, pp. 971–974. [16] G. Minami, Z. Xiong, A. Wang, and S. Mehrotra, “3-D wavelet coding of video with arbitrary regions of support,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 1063–1068, Sep. 2001. [17] K. Martin, R. Lukac, and K. N. Plataniotis, “Efficient encryption of wavelet-based coded color images,” Pattern Recognition, vol. 38, no. 7, pp. 1111–1115, 2005. 18