rate-efficient visual correspondences using random ... - CiteSeerX

3 downloads 0 Views 245KB Size Report
a plane, S. Observe that a hyperplane H passing through the ... In “Graf”, the images are taken of a planar scene from different viewpoints, while in. “Wall”, the ...
RATE-EFFICIENT VISUAL CORRESPONDENCES USING RANDOM PROJECTIONS Chuohao Yeo, Parvez Ahammad and Kannan Ramchandran Dept. of EECS, University of California, Berkeley, California ABSTRACT We consider the problem of establishing visual correspondences in a distributed and rate-efficient fashion by broadcasting compact descriptors. Establishing visual correspondences is a critical task before other vision tasks can be performed in a wireless camera network. We propose the use of coarsely quantized random projections of descriptors to build binary hashes, and use the Hamming distance between binary hashes as the matching criterion. In this work, we derive the analytic relationship of Hamming distance between the binary hashes to Euclidean distance between the original descriptors. We present experimental verification of our result, and show that for the task of finding visual correspondences, sending binary hashes is more rate-efficient than prior approaches. 1. INTRODUCTION The availability of cheap wireless sensor motes with imaging capability has inspired research on wireless camera networks that can be cheaply deployed for applications such as environment monitoring, scene reconnaissance and 3DTV. For these and other processing tasks such as camera calibration (shown in Figure 1), establishing visual correspondences is a critical step. This step is usually performed by first locating features in input images, computing descriptors for each of the features, and then comparing descriptors of features across cameras to determine which features are in correspondence. Even in a centralized setting, establishing correspondences in this fashion is highly challenging due to changes in viewing angles, illumination, and occlusions. Advances from the computer vision community in computing features [1] and descriptors [2; 3] have at least made establishing visual correspondences reasonably successful. However, in a distributed camera network setting, there is no luxury of having images from all of the cameras at a central processor for free; the communication costs of exchanging information must be taken into account. In building a vision graph that indicates which cameras in a network have significant overlap in their field of view, Cheng et al. introduced a rate-efficient feature digest used to determine correspondences [4]. Their feature digest is constructed from features and their descriptors, and rate reduction is accomplished by applying Principal Components Analysis (PCA) at each camera and sending only the top principal Chuohao Yeo is funded by the Agency for Science, Technology and Research, Singapore (A*STAR).

978-1-4244-1764-3/08/$25.00 ©2008 IEEE

217

(ii) Finds correspondences

Novel view rendering

A (i) Transmits

B

Camera calibration Object recognition

Scene Scene understanding

Fig. 1. Problem setup. A typical wireless camera network would have many cameras observing the scene. In many vision applications establishing visual correspondences between camera views is a key step. In this paper, we study the problem within the dashed ellipse: cameras A and B observe the same scene, and camera B sends information to camera A such that camera A can determine a list of visual correspondences between cameras A and B. The objective of this work is to find a way to efficiently transmit such information.

components. However, bit allocation is done in an arbitrary fashion by using 4 bytes per component. Furthermore, performing PCA locally is computationally taxing.Yeo et al. exploited the correlation between descriptors of features in correspondence for further rate gains by using distributed source coding (DSC) [5]. Their framework also allows for a principled way of performing bit allocation based on estimated descriptor statistics. However, computing Euclidean distances between all possible pairs of descriptors on a light-weight processing platform still presents a formidable computational load. To address the above concerns, we propose an alternative approach for determining visual correspondences in a rateefficient fashion using random projections. Inspired by work from Roy and Sun, we use coarsely quantized random projections to build a descriptor hash [6]; the Hamming distance between hash bits can then be used to determine if two features are in correspondence. In this work, we derive analytically and verify empirically the statistical relationship between the Hamming distance of the binary hashes and the Euclidean distance between the original descriptors. We believe this architectural approach would be useful for vision tasks in a practical distributed camera network. First, we show through our experiments that sending descriptor hashes is more rate efficient for establishing visual correspondences than sending the descriptors. Second, computing Hamming distances between descriptor hashes is computationally cheaper than computing

ICIP 2008

Euclidean distances between descriptors. Third, each camera sensor only has to compute features and descriptors for itself, instead of computing features and descriptors for every image it receives from other sensors in the network. 2. PROBLEM SETUP We focus on the problem of finding visual correspondences between two cameras (shown within the dotted ellipse in Figure 1), denoted as camera A and camera B, communicating under rate constraints. Camera B should transmit information to camera A such that camera A can determine a list of point correspondences with camera B. Although we use the two cameras problem as a way to illustrate our approach, the approach presented in this paper can be directly extended to a multiple cameras scenario. While we demonstrate our proposed method on a particular choice of feature detector and descriptor, namely the Hessian-Affine region detector [1] and Scale-Invariant Feature Transform (SIFT) descriptor [2], the framework is generally applicable to any other combination of feature detectors and descriptors. Let Ai denote the ith feature out of NA features in camera A n A A, with image coordinates (xA i , yi ) and descriptor Di ∈ R , and Bj denote the jth feature out of NB features in camera B, B n B with image coordinates (xB j , yj ) and descriptor Dj ∈ R . Note that descriptors are n-dimensional; for typical SIFT descriptors, n = 128 [2]. In this work, camera A will determine that Ai corresponds with Bj if they satisfy the Euclidean matching criterion, A −D  B 2 < τ D i j

(1)

for some acceptance threshold τ . Empirical evidence in the computer vision literature suggests that this is a reasonable criterion [2; 3]. While picking a suitable threshold is certainly critical to the visual correspondence task, here we will only focus on how to determine visual correspondences in a rateefficient manner given a pre-determined threshold. 3. APPROACH  ∈ Rn , we conFor a feature point with descriptor D n   using struct a m-bit binary hash, d ∈ {0, 1} , from D random projections as follows [6]. First, randomly generate a set of m hyperplanes that pass through the origin, H = {H1 , H2 , . . . , Hm }; denote the normal vector of the  kth hyperplane, Hk , by hk ∈ Rn . Next, the kth bit of d, dk ∈ {0, 1}, is computed based on which side of the hyper lies. In other words, plane D    >0 dk = I hk · D (2) The intuition for using such a hash is that if two descriptors are close, then they will be on the same side of a large number of hyperplanes, and hence have a large number of hash bits in agreement [6]. Therefore, to determine if two descriptors are in correspondence, we can simply threshold their Hamming distance.

218

3.1. Analysis To pick a suitable threshold, we need to understand how Hamming distances between descriptor hashes are related to Euclidean distances between descriptors. Since SIFT descriptors are normalized in the last step of descriptor computation [2], they lie on the surface of a n-dimensional sphere. For convenience and without loss of generality, we will assume that the normalization is to unit length. With this property, we can first show the following theorem about how a single hash bit relates to the distance between two descriptors, and then use it to show the relationship between Hamming distance between the binary hashes and the Euclidean distance between the descriptors. Theorem 1. Suppose Ai and Bj , with n-dimensional de B respectively, are separated by Eu A and D scriptors D i j  B 2 = δ. Then, the probaA −D clidean distance δ, i.e. D i j bility that a randomly (uniformly) generated hyperplane will separate the descriptors is π2 sin−1 2δ . Corollary 1. Suppose Ai and Bj , with n-dimensional de B respectively, are separated by Eu A and D scriptors D i j  B 2 = δ. If we generate A − D clidean distance δ, i.e. D i j A A B m-bit binary hashes, di and dB j , from Di and Dj reB spectively, then their Hamming distance, dH (dA i , dj ) has a   2 AB AB binomial distribution, Bi m, pij , where pij = π sin−1 2δ . Furthermore, the ML estimate of the Euclidean distance be  A ,dB ) d ( d H i j ·π . tween descriptors is given by δˆ = 2 sin m

2

B Proof of Corollary 1. dH (dA i , dj ) is the just the number of times a randomly generated hyperplane separates the two descriptors; since the hyperplanes are generated independently, the Hamming distance has a binomial distribution with the Bernoulli parameter given by Theorem 1. The ML estimate can then be found in a straightforward fashion. Notice that the ML estimate is independent of the dimensionality of the descriptor. To prove Theorem 1, we need the following lemma. Lemma 1. Suppose Ai and Bj , with 2-dimensional descrip B respectively, are separated by Euclidean  A and D tors D i j  B 2 = δ. Then, the probability that A −D distance δ, i.e. D i j a randomly (uniformly) generated hyperplane will separate 2 sin−1 δ2 . the descriptors is π  B lies  A and D Proof. In the simple case of 2 dimensions, D i j on a unit circle with center at the origin since SIFT descriptors are normalized to have unit-norm. A randomly (uniformly) generated hyperplane in this case is just a line passing through the origin, with equal probability of being in any orientation. Observe that the hyperplane (line) separates the descriptors

(denoted by event E), if and only if it intersects the shorter of  B . Hence,  A and D the arcs connecting D i j B  A and D Arc length between D 2 sin−1 i j = π π

0.45 0.4

δ 2

P(separating hyperplanes)

P (E) =

Experimental results of probability of separating hyperplanes vs distance 0.5

Now, we can easily prove Theorem 1. Proof of Theorem 1. We will show the result by reducing to  B and the origin defines  A, D the 2-D case as in Lemma 1. D i j a plane, S. Observe that a hyperplane H passing through the origin separates the descriptors if and only if the line intersection between H and S also separates the projections of  B on S (almost surely). Since this line has equal  A and D D i j probability of being in any orientation, the result follows by applying Lemma 1. 3.2. Determining correspondences As mentioned in Section 2, we will use the Euclidean Matching Criterion in this work to determine if two features are in correspondence. From Corollary 1, a reasonable way is to −1 τ threshold the Hamming distance by 2m π sin 2. 4. EXPERIMENTS AND RESULTS 4.1. Setup We evaluate our method on a dataset made publicly available1 by Mikolajczyk and Schmid [3]. In “Graf”, the images are taken of a planar scene from different viewpoints, while in “Wall”, the images are taken by a camera undergoing pure rotation. Due to geometric constraints in each of these cases, the image views are related by a homography [7]. The dataset also includes computed ground-truth homography, which allows for ground-truth correspondence pairs to be extracted based on overlap error in the regions of detected features [3]. In addition, we have also recorded images of an office from a webcam undergoing pure rotation, processed them to obtain ground-truth correspondences pairs as in [3], and used them in our experiments. 4.2. Verification of Theorem To demonstrate Theorem 1, we ran the following experiment on descriptors obtained from a separate set of training image pairs. We consider the set of all possible pairs of descriptors, and pick at random equal number of corresponding and noncorresponding pairs. We then compute the Euclidean distance between the pair, and estimate the probability that a randomly generated hyperplane separates the two points by performing a Monte-Carlo simulation with 5 × 104 trials. A scatter plot of the estimated probability vs Euclidean distance is shown in Figure 2. We also plot the theoretical probabilities as derived in Theorem 1. Figure 2 shows that the simulation results agree with the analysis (as expected). Furthermore, the plot also verifies that good separation between 1 http://www.robots.ox.ac.uk/

˜vgg/research/affine

219

0.35 0.3 0.25 0.2 0.15 0.1 Corresponding pairs Non−Corresponding pairs Theoretical

0.05 0

0

0.2

0.4

0.6 0.8 Euclidean distance

1

1.2

1.4

Fig. 2. Simulation results (best viewed in color). We show the scatter plot of Euclidean distance between a pair of descriptors and the estimated probability of a randomly chosen hyperplane separating the pair for a randomly chosen subset of pairs of features. Note the close adherence to the theoretical result, and the good separation between corresponding and noncorresponding pairs.

corresponding and non-corresponding pairs can be obtained with an appropriately chosen Euclidean distance threshold. 4.3. Rate-performance trade-off We compare the random projections (“RandProj”) approach, in which the binary hashes of descriptors are transmitted, with two other schemes and show their relative rate-performance trade-offs. The first scheme, which we term “Plain”, is that of simply quantizing the descriptor coefficients after applying a linear de-correlating transform [4]; thus the quantization chosen will impact the rate used. The second scheme, which we term “DSC”, not only quantizes the descriptor coefficients, but also uses a DSC framework to exploit correlation between corresponding descriptors for additional rate savings [5]. For each operating point, we compute the rate used per descriptor. We also compute the number of correspondences retrieved, Cretrieve , and the number of correctly retrieved correspondences, Ccorrect . Given the total number of ground-truth correspondences, Ctotal , we can then compute both the recall, Ccorrect , and precision, P r = C . To perform camera Re = CCcorrect total retrieve calibration, it is important to have high recall of correct correspondences, as well as high precision to reduce the number of outliers that which adversely affect calibration performance. We use a threshold of τ = 0.1952 which, according to Figure 2, would guarantee us a high precision. With “RandProj”, we carried out 100 trials, each with a randomly generated set of hyperplanes, and compute the mean and standard deviation of the performance metrics at each operating point. Figures 3(a) and 3(b) show how Re and P r varies with rate for each scheme. Notice that for “RandProj”, the number of correct correspondences stay relatively constant as the number of projections increases, while for the other two schemes, the number of correct correspondences drastically decreases as quantization step size increases. At low rates,

tance between hash bits and Euclidean distance between descriptors. Furthermore, our experimental results suggest that random projections offer a better performance-rate trade-off than simply quantizing and sending the original real-valued descriptors. Computing Hamming distances between binary hashes is also computationally cheaper than computing Euclidean distances between descriptors. In addition, sending descriptors instead of images makes sense in a distributed camera network, since we want to avoid having each camera sensor computing features and descriptors for each received image. While the actual choice of threshold might depend on natural scene statistics, the proposed framework can be applied to any natural scenes, and used with any choice of feature detector and descriptors. Although we only consider the two-camera case in this paper for simplicity, it is straightforward to extend it to a multi-camera setup. For example, each camera sensor can construct binary hashes of each descriptor and then broadcast the hashes. By repeating the proposed approach for each set of received hashes, each camera sensor can then determine the set of correspondences with other cameras in the network. For future work, with the flip probability given in Theorem 1, we can adapt the DSC framework proposed by Yeo et al. [5] by making use of binary block codes such as BCH or LDPC codes. While we only consider how rate affects correspondence performance in this work, it would be interesting to see its effect on vision tasks such as camera calibration.

Recall vs Bitrate

0.35 0.3

Recall

0.25 0.2 0.15 0.1 Plain DSC RandProj

0.05 0

0

200

400

600 800 Bits per descriptor

1000

1200

(a) Number of correct correspondences vs rate Precision vs Bitrate

1

Precision

0.95

0.9

0.85

0.8

0.75

Plain DSC RandProj 0

200

400

600 800 Bits per descriptor

1000

1200

(b) Precision vs rate

Fig. 3. Rate-Performance tradeoff. Unlike the “Plain” and “DSC” scheme, “RandProj” is able to return a relatively stable number of correct correspondences across a wide range of rates. For “RandProj”, as the number of hash bits increases, the ML estimate of the Euclidean distance improves, and hence precision also increases, though rapidly saturating. “Plain” and “DSC” also suffer from a drop in precision at low rates, but to a smaller extent.

“RandProj” clearly out-performs the other two schemes. We see that precision rapidly increases with bitrate for random projections, saturating at a level close to that of the other two schemes. For the plain and DSC schemes, precision is quite stable over the bit range investigated here. Error bars are also plotted for “RandProj” using the computed standard deviations to illustrate the amount of variation due to the different sets of hyperplanes chosen. As one might expect, as the number of projections used increases, there is a decrease in performance variation due to the choice of random hyperplanes. 5. CONCLUDING REMARKS AND FUTURE WORK In many computer vision tasks such as camera calibration and novel view rendering, establishing visual correspondences between cameras is a critical step. In this work, we consider the use of hash bits constructed from coarsely quantized random projections for establishing visual correspondences, and show analytically the probabilistic relationship of Hamming dis-

220

Acknowledgment: We would like to thank Shaowei Lin at University of California, Berkeley for helpful discussions on random projections.

References [1] K. Mikolajczyk and C. Schmid, “Scale and Affine Invariant Interest Point Detectors,” International Journal of Computer Vision, vol. 60, no. 1, pp. 63–86, 2004. [2] David G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [3] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615–1630, 2005. [4] Zhaolin Cheng, Dhanya Devarajan, and Richard J. Radke, “Determining vision graphs for distributed camera networks using feature digests,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. Article ID 57034, 11 pages, 2007, doi:10.1155/2007/57034. [5] Chuohao Yeo, Parvez Ahammad, and Kannan Ramchandran, “A rateefficient approach for establishing visual correspondences via distributed source coding,” in Proc. SPIE Visual Communications and Image Processing, Jan 2008. [6] Sujoy Roy and Qibin Sun, “Robust hash for detecting and localizing image tampering,” in Proc. IEEE International Conference on Image Processing, Sep 2007. [7] Yi Ma, Stefano Soatto, Jana Kosecka, and S Shankar Sastry, An Invitation to 3-D Vision: From Images to Geometric Models, Springer-Verlag, 2004.

Suggest Documents