Coding local and global binary visual features extracted ... - arXiv.org

9 downloads 3686 Views 3MB Size Report
Feb 26, 2015 - Milano, Milan, Italy, email: [email protected]. The project .... 1) We consider the problem of coding local binary features extracted from ...
1

Coding local and global binary visual features extracted from video sequences

arXiv:1502.07939v1 [cs.MM] 26 Feb 2015

Luca Baroffio, Antonio Canclini, Matteo Cesana, Alessandro Redondi, Marco Tagliasacchi, Stefano Tubaro

Abstract—Binary local features represent an effective alternative to real-valued descriptors, leading to comparable results for many visual analysis tasks, while being characterized by significantly lower computational complexity and memory requirements. When dealing with large collections, a more compact representation based on global features is often preferred, which can be obtained from local features by means of, e.g., the Bag-ofVisual Word (BoVW) model. Several applications, including for example visual sensor networks and mobile augmented reality, require visual features to be transmitted over a bandwidth-limited network, thus calling for coding techniques that aim at reducing the required bit budget, while attaining a target level of efficiency. In this paper we investigate a coding scheme tailored to both local and global binary features, which aims at exploiting both spatial and temporal redundancy by means of intra- and interframe coding. In this respect, the proposed coding scheme can be conveniently adopted to support the “Analyze-Then-Compress” (ATC) paradigm. That is, visual features are extracted from the acquired content, encoded at remote nodes, and finally transmitted to a central controller that performs visual analysis. This is in contrast with the traditional approach, in which visual content is acquired at a node, compressed and then sent to a central unit for further processing, according to the “Compress-Then-Analyze” (CTA) paradigm. In this paper we experimentally compare ATC and CTA by means of rate-efficiency curves in the context of two different visual analysis tasks: homography estimation and content-based retrieval. Our results show that the novel ATC paradigm based on the proposed coding primitives can be competitive with CTA, especially in bandwidth limited scenarios. Index Terms—Visual features, binary descriptors, BRISK, Bagof-Words, video coding.

I. I NTRODUCTION Visual analysis is often performed extracting a featurebased representation from the raw pixel domain. Indeed, visual features are being successfully exploited in a broad range of visual analysis tasks, ranging from image/video retrieval and classification, to object tracking and image registration. They provide a succinct, yet effective, representation of the visual content, while being invariant to many transformations. Several visual analysis applications (e.g., distributed monitoring and surveillance in visual sensor networks, mobile visual search and augmented reality, etc.) require viDipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy, email: [email protected] The project GreenEyes acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number: 296676. The material in this paper has been partially presented in [1]. Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

sual content to be transmitted over a bandwidth-limited network. The traditional approach, denoted hereinafter as “Compress-Then-Analyze” (CTA), consists in the following steps: the visual content is acquired by a sensor node in the form of still images or video sequences; then, it is encoded and efficiently transmitted to a central unit where visual feature extraction and analysis takes place. The central unit relies on a lossy representation of the acquired content, potentially leading to impaired performance. Furthermore, such a paradigm might lead to an inefficient management of bandwidth and storage resources, since a complete pixel-level representation might be unnecessary. In this respect, “Analyze-Then-Compress” (ATC) represents an alternative approach to visual analysis in a networked scenario. Such a paradigm aims at moving part of the analysis from the central unit directly to sensing nodes. In particular, nodes process visual content in order to extract relevant information in the form of visual features. Then, such information is compressed and sent to a central unit, where visual analysis takes place. The key tenet is that the rate necessary to encode visual features in ATC might be less than the rate needed for the original visual content in CTA, when targeting the same level of efficiency in the visual analysis. This is particularly relevant in those applications in which visual analysis requires access to video sequences. Therefore, in order to maximize the rate saving, it is necessary to carefully select suitable visual features and design efficient coding schemes. In this paper we consider the problem of encoding both local and global binary features extracted from video sequences. The choice of this class of visual features is well motivated from different standpoints [2]. First, binary features are significantly faster to compute than real-valued features such as SIFT [3] or SURF [4], thus being suitable whenever energy resources are an issue, such as in the case of low-power devices, where they constitute the only available option. Second, binary features have been recently shown to deliver performance close to stateof-the-art real-valued features. Third, they can be compactly represented and coded with just a few bits [5]. Forth, binary features are faster to match, thus being suitable when dealing with large scale collections. The processing pipeline for the extraction of local features comprises: i) a keypoint detector, which is responsible for the identification of a set of salient keypoints within an image, and ii) a keypoint descriptor, which assigns a description vector to each identified keypoint, based on the local image content. Within the class of local binary descriptors, BRIEF [6] computes the descriptor elements as the result of pairwise comparisons between (smoothed) pixel intensity values that

2

are randomly sampled from the neighborhood of a keypoint. BRISK [7], FREAK [8] and ORB [9] are inspired by BRIEF, and similarly to their predecessor, are also based on pairwise pixel intensity comparisons. They differ from each other in the way pixel pairs are spatially sampled in the image patch surrounding a given keypoint In particular, they introduce ad-hoc spatial patterns that define the location of the pixels to be compared. Furthermore, differently from BRIEF, they are designed so that the generated binary descriptors are scale- and rotationinvariant. More recently, in order to bridge the gap between binary and real-valued descriptors, BAMBOO [10][11] adopts a richer dictionary of pixel intensity comparisons, and selects the most discriminative ones by means of a boosting algorithm. This leads to a matching accuracy similar to SIFT, while being 50x faster to compute. A similar idea is also exploited by BinBoost [12], which proposes a boosted binary descriptor based on a set of local gradients. BinBoost is shown to deliver state-of-the-art matching accuracy, at the cost of a computational complexity comparable to that of real-valued descriptors such as SIFT or SURF. On the other hand, global features represent a suitable alternative to local features when considering scenarios in which very large amounts of data have to be processed, stored and matched. Global features computed by summarizing local features into a fixed-dimensional feature vector have been effectively employed in the context of large scale image and video retrieval [13]. Global features can be computed based on the Bag-of-Visual-Words (BoVW) [14] model, which is inspired by traditional text-based retrieval. VLAD [15] and Fisher Vectors [16] represent more sophisticated approaches that achieve improved compactness and matching performance. More recently, the problem of building global features starting from sets of binary features was addressed in [17] and [18], extending, respectively, the BoVW and VLAD model to the case of local binary features. Solutions based on global image descriptors offer a good compromise between efficiency and accuracy, especially considering large scale image retrieval and classification. Nonetheless, local features still play a fundamental role, being usually employed to refine the results of such tasks [19] [14]. Furthermore, the approaches based on global features disregard the spatial configuration of the keypoints, preventing the use of spatial verification mechanism and thus being unsuitable to tracking and structure-frommotion scenarios [20], [21]. This paper proposes a number of novel contributions: 1) We consider the problem of coding local binary features extracted from video sequences, by exploiting both intraand inter-frame coding. In this respect, we adopt the general architecture of our previous work [22], which targeted real-valued features, and propose coding tools specifically devised for binary features. 2) For the first time, we consider the problem of coding global binary features extracted from video sequences, obtained by summarizing local features according to the BoVW model, exploiting both intra- and inter-frame coding. 3) We evaluate the proposed coding scheme in terms of rateefficiency curves for two different visual analysis tasks:

homography estimation and content-based retrieval. We show the impact of the main configuration parameters, namely, the number of keypoints, descriptor elements and visual words. Unlike our previous work, contentbased retrieval is evaluated by means of a complete image retrieval pipeline, in which a video is used to query an image database. 4) We compare the overall performance of ATC vs. CTA for both analysis tasks. In the case of homography estimation, we show that ATC based on local features always outperforms CTA by a large margin. In the case of content-based retrieval, we show the ATC achieves a significantly lower bitrate than CTA when using global features, while it is on a par with CTA when using local features. In the context of local visual features, several past works tackled the problem of compressing both real-valued and binary local features extracted from still images. As for real-valued local features, architectures based on closed-loop predictive coding [23], transform coding [24][25] and hashing [26] were proposed. In this context, an ad-hoc MPEG group on Compact Descriptors for Visual Search (CDVS) has been working towards the definition of a standard [27] that relies on SIFT features. As for binary local features, predictive coding architectures aimed at exploiting either interdescriptor correlation [28] or intra-descriptor redundancy [29] were proposed. Furthermore, Monteiro et al. proposed a clustering based coding architecture tailored to the context of binary descriptors [30]. Moreover, some works aimed at modifying traditional extraction algorithms, so that the output data is more compact or more suitable for compression. In this context, CHOG [31] is a gradient-based descriptor that offers performance comparable to that of SIFT at a much lower bitrate. As an alternative approach, Chao et al. [32] studied how to adjust the JPEG quantization matrix in order to preserve local features extracted from decoded images. The problem of encoding visual features extracted from video content has been addressed only very recently. Makar et al. [33], [34] propose to encode and transmit temporally coherent image patches in the pixel-domain, for augmented reality and image recognition applications. Thus, the detector is applied at the transmitter side, while the descriptors are extracted from decoded patches at the receiver. The encoding of local features (both keypoint locations and descriptors) extracted from video sequences was addressed for the first time in [35] for the case of real-valued features (SIFT and SURF) and later extended in [22]. To the best of the authors’ knowledge, the encoding of streams of binary features has not been addressed in the previous literature. Furthermore, the interest of the scientific community in this kind of problem is witnessed by the creation of a new MPEG ad-hoc group, namely Compact Descriptors for Video Analysis (CDVA), which has recently started its activities [36]. CDVA targets the standardization of the extraction and coding of visual features in application scenarios ranging from video retrieval, automotive, surveillance, industrial monitoring, etc., in which video, rather than images, plays a key role. The rest of this paper is organized as follows. Section II

3

states the problem of coding sets of local binary descriptors, defining the properties of the features to be coded, whereas Section III illustrates the coding architecture. Section IV introduces the problem of coding Bag-of-Visual-Words extracted from a video sequence and Section V defines the coding algorithms. Section VI is devoted to defining the experimental setup and reporting the results. Finally, conclusions are drawn in Section VII. II. C ODING LOCAL FEATURES : PROBLEM STATEMENT Let In denote the n-th frame of a video sequence, which is processed to extract a set of local features Dn . First, a keypoint detector is applied to identify a set of interest points. Then, a descriptor is applied on the (rotated) patches surrounding each keypoint. Hence, each element of dn,i ∈ Dn is a visual feature, which consists of two components: i) a 4-dimensional vector pn,i = [x, y, σ, θ]T , indicating the position (x, y), the scale σ of the detected keypoint, and the orientation angle θ of the image patch; ii) a P -dimensional binary vector dn,i ∈ {0, 1}P , which represents the descriptor associated to the keypoint pn,i . We propose a coding architecture which aims at efficiently coding the sequence {Dn }N n=1 of sets of local features. In particular, we consider both lossless and lossy coding schemes: in the former, the binary description vectors are preserved throughout the coding process, whereas in the latter only a subset of K < P descriptor elements is lossless coded, thus discarding a part of the original data. Each decoded descriptor ˜ n,i }. The number of bits can be written as d˜n,i = {˜ pn,i , d necessary to encode the Mn visual features extracted from frame In is equal to Rn =

Mn X

p d (Rn,i + Rn,i ).

(1)

i=1

That is, we consider the rate used to represent both the location p d of the keypoint, Rn,i , and the descriptor itself, Rn,i . For both the lossless and the lossy approach, no distortion is introduced during the coding process in the received descriptor elements. Nonetheless, since in the lossy case part of the descriptor elements are discarded, the accuracy of the visual analysis task might be affected. ˜ n,i , we decided to encode the As for the component p coordinates of the keypoint, the scale and the local orientation ˜ T . Although some visual analysis ˜ n,i = [˜ i.e., p x, y˜, σ ˜ , θ] tasks might not require this information, it could be used to refine the final results. For example, it is necessary when the matching score between image pairs is computed based on the number of matches that pass the spatial verification step using, e.g., RANSAC [19] or weak geometry checking [20]. Most of the detectors produce floating point values as keypoint coordinates, scale and orientation, thanks to interpolation mechanisms. Nonetheless, we decided to round such values with a quantization step size equal to 1/4 for the coordinates and the scale, and π/16 for the orientation, which has been found to be sufficient for typical applications [35], [22].

III. C ODING LOCAL FEATURES : ALGORITHMS Figure 1 illustrates a block diagram of the proposed coding architecture. The scheme is similar to the one we recently proposed for encoding real-valued visual features [35], [22]. However, we highlighted the functional modules that needed to be revisited due to the binary nature of the source. A. Intra-frame coding In the case of intra-frame coding, local features are extracted and encoded separately for each frame. In our previous work we proposed an intra-frame coding approach tailored to binary descriptors extracted from still images [5], which is briefly summarized in the following. In binary descriptors, each element represents the binary outcome of a pairwise comparison. The descriptor elements (dexels) are statistically dependent, and it is possible to model the descriptor as a binary source with memory. Let πj , j ∈ [1, P ] represent the j-th element of a binary descriptor d ∈ {0, 1}P . The entropy of such a dexel can be computed as H(πj ) = −pj (0) log2 (pj (0)) − pj (1) log2 (pj (1)),

(2)

where pj (0) and pj (1) are the probability of πj = 0 and πj = 1, respectively. Similarly, the conditional entropy of dexel πj1 given dexel πj2 can be computed as X pj2 (y) , H(πj1 |πj2 ) = pj1 ,j2 (x, y) log2 pj1 ,j2 (x, y) x∈{0,1},y∈{0,1}

(3) with j1 , j2 ∈ [1, P ]. Let π ˜j , j = 1, . . . , P , denote a permutation of the dexels, indicating the sequential order used to encode a descriptor. The average code length needed to encode a descriptor is lower bounded by R=

P X j=1

H(˜ πj |˜ πj−1 , . . . , π ˜1 ).

(4)

In order to maximize the coding efficiency, we aim at finding the permutation of dexels π ˜1 , . . . , π ˜P that minimizes such a lower bound. For the sake of simplicity, we model the source as a first-order Markov source. That is, we impose H(˜ πj |˜ πj−1 , . . . π ˜1 ) = H(˜ πj |˜ πj−1 ). Then, we adopt the following greedy strategy to reorder the dexels: ( arg minπj H(πj ) j=1 π ˜j = (5) arg minπj H(πj |˜ πj−1 ) j ∈ [2, P ]

The reordering of the dexel is described by means of a ˜n,i = TINTRA dn,i . Note permutation matrix TINTRA , such that c that such optimal ordering is computed offline, thanks to a training phase, and shared between both the encoder and the decoder. As such, this does not require additional bitrate. B. Inter-frame coding

As for inter-frame coding, each set of local features Dn is coded resorting to a reference set of features. In this work we consider as a reference the set of features extracted from the previous frame, i.e., Dn−1 . Considering a descriptor dn,i ,

l2C this procedure is a process, set of dexels, ordered BRISK according their discrimrithm. the search for feature C of candiwhere D(d , dai.e., dwithin is the set Hamming disFor the evaluation we extracted [5]tofeatures from n,i n,i n reference 1,l ) = kd n 1,l ka0 given reference frame, ⇤scales aredisinability. Hence, a targetatdescriptor size K(352 < P⇥ , it288) is possible where D(dn,i ,i.e., ddescriptors = kd d the date features, whose coordinates andlHamming in n,i n the 1,l )ones ndn 1,l k tance between the d is the index aissetthe of Hamming six videogiven sequences CIF resolution and 4. EXPERIMENT 0 ,isand n,i and 1,l 4. EXPERIMENTS where D(d , d ) = kd d k dis⇤ n,i n,i n 1,l n 1,l 0 ⇤d tance between descriptors and ,1,l and l ± is We the).index to only the first4.K EXPERIMENTS descriptor elements by Mother, this algodreference = D(d ,the ),step. (6) of , inmin a drange ofn,i (± x, ± y, The n,i ndn 1,l nthe 1,l n,iarg 30encode fps, namely Foreman, Mobile, Hall, Paris, selected News and of the theneighborhood selected feature used ind next limit l2C tance between the descriptors d and d . We limit the search ⇤ ofsearch the selected reference featureaswithin used step. We limit residual is computed cn,i in =athe dgiven , candithat is, 1,lFor rithm. n,i n n,inextd n C1,lof eachthe with 300 frames [26]. we In extracted the training phase,[5]wefeatures employed theprediction for a reference feature set evaluation process, BRISK from 4 thefeatures, search within C of are candibitwise XOR between d . the For the evaluation process, we extracted B where D(d ,i.e., dan reference )ones = kdfeature d ka0⇤given is Hamming disn,i and For the process, we at extracted BRISK [5] features from sequences (Mother, News and Paris), whereas the remaining n,i 1,l ndn 1,l1,l date the whose coordinates and scales in C of forn,iafor reference feature within aset given set candidate features, athree set ofevaluation six video sequences CIF resolution (352 ⇥ 288) and ⇤ scales are in date features, i.e., the ones whose coordinates and tance between descriptors and l ± is thecoding a set of namely six(Hall, video sequences at CIF resolution (352 ⇥and 288) and sequences Mobile Foreman) were employed for testing n,i and n 1,l - Coding modethe decision: Compare the cost of, ± inter-frame the neighborhood of dn,i , in a drange of d (± x, y, ).index The 4. and EXPERIMENTS 30 fps, Foreman, Mobile, Hall, Paris, Newsvideo Mother, a set of six sequences at CIF reso i.e., the ones whose coordinates and scales are in the neighborthe neighborhood of d , in a range of (± x, ± y, ± ). The n,i of with the selected reference featureaswhich used the We limit 30 fps, namely Foreman, Mobile, Hall, Paris, and Mother, purposes. The of the symbols to be fed phase, toNews the entropy coder thatresidual of intra-frame coding, can expressed prediction is computed cn,i in = dbe dstep. , that is, n,inext n 1,l⇤as each with 300statistics frames [26]. In the training we employed ⇤ 30 fps, namely Foreman, Mobile, Hall, P prediction residual is computed as c = d d , that is, each). with 300 frames [26]. In extracted the extracted training phase, we employed were learned based on the descriptors from[5] the training sesearch for aof reference within aof C1,lof hood dn,i , feature in range (±setn x, ±candiy, ± three The prediction ⇤given the bitwise XOR between dn,iaand dn,i .n,i For the evaluation process, we BRISK features from n 1,l p,INTRA d,INTRA sequences (Mother, News and Paris), whereas the remaining ˜(Mother, c J INTRA (d =n,i Rn,i R⇤n,i n,i ) whose thefeatures, bitwise XOR between d and dn+1,l . and, scales are (7) n,i three sequences NewsForeman) and Paris), whereas theexploited remaining quences. Moreover, the training video sequences were to date i.e., the ones coordinates in each with 300 frames [24]. In the traini a set of six video sequences at CIF resolution (352 ⇥ 288) and sequences (Hall, Mobile and were employed for testing ⇤ - Coding residual mode decision: Compare theascost of inter-frame cn,i = dn,i⇤ coding dn 1,lobtain . the optimal d˜ofnisd computed sequences (Hall,Foreman, Mobile and Foreman) employed for 1,l, ⇤in coding order of dexels for both intraandtesting interthe neighborhood a range of (± x, y, ± ). The - Coding decision: Compare the cost of ± inter-frame coding INTER p,INTER d,INTER 30 fps, namely Mobile, Hall, Paris, and Mother, Entropy) purposes. The statistics of the symbols towere be fed toNews the entropy coder ⇤ ) = Rwhich can that ofmode intra-frame coding, JMatching) (dn,i , d˜n n,i (l⇤ )be + expressed Rn,i (las), (8) three sequences (Mother, News and Paris) 1,l dn,iwith n,i Transform) purposes. The statistics of the symbols to be fed to the entropy coder framewith coding, asframes illustrated in Section Starting fromwe the employed original ⇤as with of intra-frame coding, can be expressed prediction residual is computed aswhich cn,i = dn,i dn 1,lthe , thatcost is, ofwere dCompare n,i coding) each 300 [26]. In the 3.1. training learned based on the descriptors extractedphase, from the training se- that Coding mode decision: inter-frame coding p d INTRA p,INTRA d,INTRA were learned based on the descriptors extracted from thethe training Rn,i Rn,i represent thedbitrate the BRISK descriptor consisting in P and = sequences 512 dexels, we (Hall, considered a seset Mobile thewhere bitwise XOR between and .needed Jand (dn,i )d =n,i Rn,i R⇤n,i n+1,l p,INTRA d,INTRA, to encode (7) three sequences (Mother, News Paris), whereas remaining Moreover, the training video sequences were exploited to and Foreman) w J INTRA (dintra-frame = Mode) Rlocation + Rn,i ,which (7) bequences. n,i ) the n,i with that of coding, can expressed as quences. Moreover, the training video sequences were exploited to location component (either itself or location displaceof target descriptor sizes K = {512, 256, 128, 64, 32, 16, 8}. For dn,i ⇤ sequences (Hall, Mobile Foreman) employed for pn⇤ 1,l decision) - Coding mode decision: Compare the cost of inter-frame coding The statistics obtain the optimal codingand order of purposes. dexelswere for both intraandtesting inter- of the symbols to b INTER p,INTER ⇤pn,1 d,INTER C ˜ obtain the optimal coding order of dexels for both intraand inter⇤ ment) and the one needed to encode the descriptor component (eieach of such descriptor sizes, we employed the selection algorithm J (d , d ) = R (l ) + R (l ), (8) p,INTER ⇤ d,INTER ⇤ n,i n˜ 1,l ⇤ n,i Entropy)3.1. purposes. The as statistics of the symbols to beStarting fed to the entropy coder with that of intra-frame coding, which can J INTER (d ) =INTRA R (l )be + expressed Rn,i (las), (8) d,INTRA frame coding, illustrated in Section from theoriginal original n,i , dn 1,l n,i n,i c,INTRA were learned based on the descriptors extra framelearned coding, as illustrated in Section 3.1. Starting from ther thepdescriptor If presented in Section 3.3. presidual), coding) d itself or the n,1 were based on the(7) descriptors extracted from thethe traininga seJ Rprediction (d )needed = R,respectively. + Rn,i , first pn,iwhere p andINTRA R(d the bitrate to encode the BRISK descriptor consisting inPP =number 512 dexels, we considered set INTER n,i p,INTRA d,INTRA n,id⇤represent n,i n,i where Rn,i the bitrate needed to encode the BRISK descriptor consisting in = 512 dexels, we considered a set J ,and dnRR ) represent

Suggest Documents