ones of LF [13], KISSME [19] and LMNN-R [11]. We follow the same experimental protocol of those and split the dataset into a training set and a test set, where ...
Person Orientation and Feature Distances Boost Re-Identification Jorge Garc´ıa∗ , Niki Martinel† , Gian Luca Foresti† , Alfredo Gardel∗ and Christian Micheloni† ∗ Department
of Electronics University of Alcala, Alcal´a de Henares, Madrid, Spain Email: {jorge.garcia, alfredo}@depeca.uah.es † Department of Mathematics and Computer Science University of Udine, Udine, Italy Email: {niki.martinel, gianluca.foresti, christian.micheloni}@uniud.it Abstract—Most of the open challenges in person reidentification arise from the large variations of human appearance and from the different camera views that may be involved, making pure feature matching an unreliable solution. To tackle these challenges state-of-the-art methods assume that a unique inter-camera transformation of features undergoes between two cameras. However, the combination of view points, scene illumination and photometric settings, etc., together with the appearance, pose and orientation of a person make the intercamera transformation of features multi-modal. To address these challenges we introduce three main contributions. We propose a method to extract multiple frames of the same person with different orientation. We learn the pairwise feature dissimilarities space (PFDS) formed by the subspace of pairwise feature dissimilarities computed between images of persons with similar orientation and the subspace of pairwise feature dissimilarities computed between images of persons non-similar orientations. Finally, a classifier is trained to capture the multi-modal inter-camera transformation of pairwise images for each subspace. To validate the proposed approach we show the superior performance of our approach to state-of-the-art methods using two publicly available benchmark datasets.
I.
I NTRODUCTION
The person re-identification problem, formally defined as the problem of associating a given person acquired by a camera to the same person previously acquired by any other camera in the network at any location and at any time instant, is increasingly gaining attention by the community. This challenging task is very important for surveillance applications such as inter-camera tracking [1], [2], multi-camera behavior analysis, etc. Despite the problem can be alleviated by deploying a large number of sensors such as all the areas of the monitored environment are covered by camera field-of-views (FoVs), the costs of system installation, maintenance, etc., lead to a nonfeasible solution. Thus, in a real scenario, we have to deal with partial area coverage that yields to the re-identification problem. In the recent past the community has tackled the person reidentification problem using several approaches that differs from the way the person body is modeled, to which features are used (i.e. biometrics or appearance), to how matches This work is partially supported by UAH through the project “Identificaci´on de Personas a partir de la Reconstrucci´on de Im´agenes M´ultiples” (IPRIM) with Ref.:CCG2013/EXP-064.
between individuals are computed, etc. While recent works in the field of person re-identification can be grouped on the basis of any of such categories, we group them as follows. Discriminative signature based methods are the most commonly explored approaches for person re-identification. In [3] particular interest has been focused on finding the best set of features that can be exploited to match persons across cameras. A discriminative signature computed using the Mean Riemannian Covariance patches was used in [4]. In [5], the distribution of color features projected in the log-chromaticity space was described using the shape context descriptor. In [6] an unsupervised framework was proposed to extract distinctive features, then a patch matching method was used together with adjacency constraints. Those methods rely only on the discriminative power of appearance features to perform a pure feature matching. While no training is required and good results can be achieved when images are similar, this is still an unreliable solution as such methods generally assume that features are not transformed between cameras. Transformation learning based methods were explored in [7] to capture the transformation across non-overlapping cameras in a tracking scenario. Similarly, the problem of capturing the non-linear transformation between features was addressed in [8]. In [9] the implicit transformation function of features was learned by concatenating appearance feature vectors of persons viewed by different cameras. In [10] a Weighted Brightness Transfer Function that assigns unequal weights to observations based on how close they are to test observations was proposed. These methods assume that there is a unique transformation between features and that it can be used to project the feature from one camera to the feature space of the other camera. Distance learning based methods learn the best metric between appearance features of the same person across camera pairs. In [11] the Largest Margin Nearest Neighbour and a derivation of it were exploited. In [12] the re-identification problem was addressed in a transfer learning framework. A weighted maximum margin metric was learned in an online fashion and transferred from a generic metric to a candidate-set-specific metric. The problem of finding the optimal similarity measure was proposed in [13]. In [14] the re-identification problem was formulated as a local distance comparison problem introducing an energy-based loss function that measures the similarity between appearance instances. Such approaches match features in the same feature space and did not consider cross-view
over existing state-of-the-art methods is shown in section IV. Finally, conclusions are drawn in section V. II. Fig. 1. Person images acquired by the same camera. The person appearance looks very different among all the images due to the changes in pose and illumination conditions.
transformations. Motivation: Despite much effort has been spent, due to the nature of the problem, re-identify a person that moves across disjoint cameras still remains an open issue. Almost all of the existing works assume that an unique inter-camera transformation of features undergoes between two camera views, which is used to model the appearance of different perspectives of the person. However, we believe that the deployment and the configuration of the cameras (it is a combination of view points, illumination changes, and photometric settings, etc.) together with the appearance, pose and orientation of a person give rise to multi-modal inter-camera transformations. In Fig. 1 we show that it is indeed the case as, even the same person is acquired by the same camera, the way he/she appears is different due to the pose, light conditions,etc.. This justifies the fact that the inter-camera variations cannot be modeled by means of a single transformation or metric. Contribution: Inspired by this, and inspired by the fact that as the transformation between appearance features is multimodal, so is the transformation of the distances between them [15], we build upon the idea that the transformation learned for the same person seen from different viewpoints may be less reliable than the one learned for the same person seen from the same point of view. So, in this work we introduce a novel re-identification approach where person orientation is used to learn different models for the different transformations that exist between pairwise feature dissimilarities. First we form the pairwise feature dissimilarities space (PFDS) and divide it into two main regions, the region for which the dissimilarities are computed between pair of images with similar orientation and the region containing all other different pairwise orientations. Then, within each region, we train a classifier that captures all the possible multi-modal transformation of feature dissimilarities and uses those to discriminate between pairwise images of the same person (positive pair) and pairwise images of two different persons (negative pair). The core contributions of this work can be summarized as follows. We introduce: (i) a method to extract images of the same person with different orientations so as we can capture the multi-modal appearance of a person; (ii) a method to form the PFDS and divide it into two subspaces on the basis of pairwise image orientations; (iii) a method to reidentify persons across disjoint camera views by learning the parameters of two decision surfaces that separate each of the two PFDS subspaces. This allows to pose the re-identification as a binary classification problem. The rest of the paper is organized as follows. An overview of our person re-identification approach is given in section II. In section III the details of the modules that compose the system are described. The superior performance of our approach
OVERVIEW OF THE PROPOSED APPROACH
An overview of the proposed approach is shown in Fig. 2. The re-identification goal is achieved by exploiting 4 modules. The recovering perspectives modules extracts different shortterm tracklets of a person so as images of a same person with different orientation and poses can be used. Then, in order to model the person appearance, the next module extracts global and local features to capture color, texture and shape information. The high-dimensional features extracted from a pair of images of persons acquired by two cameras are input to the feature dissimilarities module that computes the pairwise feature dissimilarities vector (PFD). The set of all PFD will be denoted as PFDS. Considering the orientation of the pairwise images, we split the PFDS into two groups. The former is composed of PFDs computed for images of persons that have similar orientation, while the latter is formed of PFDs computed for images that have different orientations. Finally, within each subspace, a binary classifier is trained to separate between positive and negative PFDs. III.
M ETHODOLOGY
Let suppose that we have a pair of cameras with nonoverlapping FoVs and let assume that the people detection and tracking tasks in a single camera have already been achieved [16]. Then, each camera provides a set of short-term tracklets of targets to re-identify. A. Retrieving People Perspectives Since a camera provides a short-term tracklet of each person crossing the scene, multiple images may be retrieved to obtain different perspectives of a person. The trajectory of a person may contain direction changes due to static objects which are located in the scene, crosses with other people or due to his/her own trajectory. These situations are exploited by our approach in order to obtain different perspectives of the person. Given an image trajectory T = (u v vu vv ) where u and v are image position vectors and vu and vv are image velocity vectors, we can define the direction changes as the angle between the position vector and the velocity vector. In order to increase the accuracy of such direction changes, the vectors u,v, used to estimate the person orientation were estimated by means of a Kalman Filter. However, the direction changes value does not provide a perspective-camera relationship that allow to compare against perspectives from other short-term tracklets computed by other cameras. Therefore, it is necessary to find a relationship between different perspectives satisfying the previous restriction. Assuming people always walk in a forward direction, the angle between the trajectory vector and camera vector (optical axis) projected on the ground floor provide a common inter-camera relationship that can be used to compare against different perspectives. This angle can be defined as the estimated orientation of the person with respect to the camera. Let t = [tx ty tz ] and r = [σx σy σz ] be the extrinsic parameters which denote the coordinate system transformation from the scene to the camera. Considering that the pan angle (σy ) is null in the camera pair, the projection of the camera vector
Fig. 2. Overview of our approach. Given two tracklets of two persons acquired by disjoint cameras, we first recover the perspectives of the persons so as we have multiple frames of a same person viewed from different poses/viewpoints. Then, for each image shape, color and texture features are extracted and the PFD between image pairs is computed. Finally, the re-identification is performed by sending the PFD to one of the two previously learnt binary classifiers. The classifier is chosen on the basis of the difference in orientation between the persons images used to compute the PFD.
and axis y of the world coordinate system are parallel. Thus, we can determine the orientation vector ∆Yw θ = arctan (1) ∆Xw where ∆Yw and ∆Xw are the coordinate difference vectors on the ground floor. Assuming that the translation between coordinate systems only depends on the camera height, the translation vector is t = [0 0 tz ] where tz is the height of the camera. Consequently, ∆Yw and ∆Xw can be expressed as a function of coordinate difference vectors in the camera (∆Yc and ∆Xc ). Finally, we apply the image projection model using the intrinsic parameters to obtain an expression dependent on the image position vectors (u and v). That is cos σz ∆Yc (2) ∆Yw = − sin σz ∆Xc + cos σx ∆Xw = cos σz ∆Xc +
sin σz ∆Yc . cos σx
(3)
Let S = {si |i = 1, ..., N } be a short-term tracklet captured from a camera, where s represents a snapshot of the person and N is the number of snapshots contained in S. The number of recovered perspectives depends on the resolution of direction changes expressed by ∆θ between the last image added and the image under analysis. B. Feature extraction State-of-the-art methods for person re-identification have successfully explored different appearance features to tackle the re-identification challenges [3]. According to them, we considered color, texture and shape features to build the feature representation of a given image I. Color: We consider that most of the persons wear different clothes for the upper and lower body part, so, we first detect the two salient body parts (i.e., torso and legs) as in [17]. We discard the head region as it generally contains few and not informative pixels. Then, we extract the histogram Hωc (I) ∈ Rnc , for each color component c and body part ω ∈ {T, L}, where T denotes the torso and L denotes the
legs. Texture: To extract texture features we consider Gabor, Schmid and Leung-Malik filter banks. After convolving each image with a single Gabor filter we computed the modulus of the response and we quantized it in a histogram with g bins. We denote the set of all such histograms as {Gi (I)}Ii=1 , where i indicates the ith Gabor filter. Similarly we get the set of histograms {Sj (I)}Jj=1 , each of which has s bins, by convolving the 13 standard Schmid filters. Finally we convolve each given image with the Leung-Malik (LM) filter bank consisting of first and second derivatives of Gaussians at 6 orientations and 3 scales, 8 Laplacian of Gaussian filters, and 4 Gaussians. The response of each filter was quantized in a histogram with m bins. {Lk (I)}K k=1 is the set of all such histograms, where k indicates the k th LM filter. Shape: To capture the shape of a given person the Pyramid Histogram of Oriented Gradients (PHOG) feature is used. Let l = 0, · · · , L be the level of the spatial pyramid, and 4l the number of cells in which the image is divided at each level l. Then, the PHOG feature P(I) is formed by concatenating all the HOG features extractedP for each cell of the pyramid. This L results in a vector of size b l=0 4l , where b is the number of bins used to compute the HOG features. C. Pairwise Feature Dissimilarities Once all the features have been extracted from a pair of images acquired by two disjoint cameras we can compute the PFD as suggested in [15]. Using this approach, we take to advantage of the dissimilarities features which are used to model the transformation across cameras. The PFD computed for a positive pair (i.e. for pair of images of the same person) is denoted as ∆+ , while the PFD computed for a negative pair (i.e. for pair of images of different persons) is denoted as ∆− . The set of all positive and negative PFDs forms the PFDS ∆. D. Dual-Classification The goal of the dual-classification is shown in Fig. 3. Given the PFDS, we first compute the orientation distance for all PFDs as follows. Let θ(I A ) and θ(I B ) be the orientations
of persons in images I A and I B . The orientation distance is defined as Dθ θ(I A ), θ(I B ) = arccos cos(θ(I A )) cos(θ(I B ))+ (4) + sin(θ(I A )) sin(θ(I B )) . Then, we split the PFDS into two groups: PFDs with similar orientation, i.e. PFDs for which Dθ (·, ·) < T hθ , where T hθ is a fixed threshold, and the rest of PFDs. Then, for each of the two groups, the following actions are performed. We train a SVM classifier to learn the parameters of the decision boundary that best separates the PFDs computed for positives pair from the ones computed for negative pairs. Given a subset of PFDs denoted as xi , i = 1, ..., N where N is the number of training samples, the goal is to minimize the error function expressed by N X 1 min kwk2 + C ξi (5) w,b,ξ 2 i=1 subject to the constrains yi (wT φ(xi ) + b) ≥ 1 − ξi and ξi ≥ 0, where w is the vector of coefficients, φ(xi ) is the feature map for xi , ξi is the slack variable used to handle the non-separable input data and C is the regularization parameter. Once the minimization problem is solved, the decision function in the dual form is given by ! N X f (x) = sgn yi wi K(xi , x) + b (6) i=1
D✓ (·, ·) < T h✓
D✓ (·, ·) +
T h✓
+
Fig. 3. Image representation of the proposed dual-classification. The PFDS is split into two regions on the basis of the orientation of pairwise images. For each region a classifier is trained to discriminate between positive PFDs (green dots) and negative PFDs (red dots). The decision surface is depicted in dashed blue.
in terms of Cumulative Matching Characteristic (CMC) curve and proportion of uncertainty removed (PUR) [13]. To make a fair comparison, for each experiment, we run 10 different trials using different person IDs and different image pairs (samples).
T
where K(xi , x) = φ(xi ) φ(x) is the standard RBF kernel function and yi ∈ {1, −1} is the label space. Given a tracklet captured by a camera, multiples perspectives are retrieved depending on the direction changes. We try to enrich the PFDS providing different perspectives of each person. During the classification phase, a subset of perspectives is randomly selected for each camera. The selected perspectives are independently processed, i.e. without time constraints. Given a test PFD, x ˆ, we first compute the orientation distance as in eq.(4), then, we input the PFD to the classifier trained for the PFD group in which the test PFD rely. The probability given from the classifier is then used to tell whether the images used to compute the PFD are for the same person f (ˆ x) ≥ 1 (positive pair) or are for different persons f (ˆ x) = −1 (negative pair). As multiple PFDs can be computed for a single person, the final probability is computed by the probability mean of all PFDs that belong to same person. IV.
E XPERIMENTAL R ESULTS
In this section we report the performance of the proposed method using two different benchmark dataset. While many state-of-the-art methods use other datasets such as VIPeR, CAVIAR4REID, ETHZ, WARD, etc., none of them come with the required information we need to recover the people orientation from each camera. Despite this, the two dataset we have used have footages acquired by a very high number of cameras and are representative of real indoor and outdoor surveillance scenarios. They also come with very strong illumination changes, occlusions, viewpoint variations, etc.. To compare the performance with state-of-the-art methods we report the results
A. Implementation Details We have used the following settings for all the experiments presented in this section. They have been selected using 4-fold cross-validation and averaged for the two dataset to give a more general re-identification framework. Images from both datasets have been resized to 64 × 128 pixels. Color histograms have been computed for the HSV and CIELab color spaces using nc = [36 25 20] and nc = [20 32 32] bins, respectively. We used Gabor filters at 8 orientations and 5 scales, the standard 13 Schmid filters and the LM filters described in section III. For each filter response, histograms with 16 bins have been computed. PHOG features have been extracted using 4 levels and 9 bins. The χ2 distance has been used to compute all the feature distances. T hθ parameter have been set to π/4 radians. The SVM parameters have been separately estimated for each dataset using 4-fold cross-validation. Finally, for each experiment we select 1 positive pair and 1 negative pair for each person in the train set, while for the test we randomly selected 10 positive and 10 negative pairs per person. B. 3DPeS Dataset The 3DPeS dataset1 has been proposed in [18]. It contains different sequences of 200 people taken from a multi-camera distributed surveillance system. There are 8 cameras and each one is presented with different light conditions and calibration parameters, so the persons were detected multiple times with different viewpoints. Not only that, they were captured at different time instants and on different days, in clear light and 1 Available
at http://www.openvisor.org/3dpes.asp
Fig. 4. Image samples from 3DPeS dataset. Each column corresponds to a pair of images of the same person captured by two different cameras.
Fig. 6. Image samples from the SAIVT dataset. Each column corresponds to a pair of images of the same person captured by two different cameras. 100
100
90
90
80 Recognition Percentage
Recognition Percentage
80 70 60 50 40
60 50 40
RWACN Texture Model Height Model Colour−Soft Model Culture−Colours Model Fused Model Proposed
30 LMNN−R KISSME LF Proposed
30 20
70
20 10
5
10
15
20
25 30 Rank Score
35
40
45
50
5
10
15
20
25 30 Rank Score
35
40
45
50
(a) Fig. 5. Comparisons of the proposed approach with state-of-the-art methods on the 3DPeS dataset.
90
R EPRESENTATIVE R ANKING VALUES AND PUR VALUE FOR EACH METHOD ON 3DP E S. Rank LMNN-R KISSME LF Proposed
1 23.03 22.94 33.43 52.60
10 55.23 62.21 69.98 82.57
25 73.44 80.74 84.80 91.00
50 88.92 93.21 95.07 96.30
PUR 21.11 25.49 34.85 45.58
in shadow areas. This results in a challenging dataset with strong variation of light conditions (see Fig. 4).
80 Recognition Percentage
TABLE I.
100
70 60 50 40
RWACN Texture Model Height Model Colour−Soft Model Culture−Colours Model Fused Model Proposed
30 20
We report the result of our method and compare with the ones of LF [13], KISSME [19] and LMNN-R [11]. We follow the same experimental protocol of those and split the dataset into a training set and a test set, where each one is composed of 95 randomly selected persons. Fig. 5 shows the performance of our method compared to LF, KISSME and LMNN-R. Our method outperforms the others especially for low ranks, which are the most representatives. We achieve 52.6% of correct recognition for rank 1, while, for the same rank, only a recognition percentage of 33.43%, 22.94% and 23.03% is achieved by LF, KISSME and LMNN-R respectively. In Table I recognition percentages for ranks 1, 10, 25 and 50, and the PUR are reported. Our approach is the only one that achieves a recognition percentage higher than 90% for rank score 25 and it has a PUR value that is more than two times the PUR value of LMNN-R. C. SAIVT Dataset The SAIVT dataset2 has been introduced in [20]. The dataset was captured by a real surveillance camera network in an uncontrolled fashion, so it provides a highly unconstrained environment in which to test person re-identification 2 Available
at https://wiki.qut.edu.au/display/saivt/SAIVT-SoftBio+Database
10
5
10
15
20
25 30 Rank Score
35
40
45
50
(b) Fig. 7. Comparison of the proposed algorithm with state-of-the-art methods for person re-identification on SAIVT dataset. In (a) results are reported for camera pair 3-8. In (b) results are reported for camera pair 5-8.
approaches. The 150 persons were acquired by 8 indoor cameras with non-overlapping field of views. Some example images are shown in Fig. 6. We compare the result of our method with those reported in [20] (i.e. Texture Model, Height Model, CultureColours Model, Colour-Soft Model and Fused Model) and RWACN [17]. We adopt the same protocol used in [20] and report the performance for two camera pairs, denoted as 3-8 and 5-8. Camera pair 3-8 has 99 persons viewed by similar perspectives while camera pair 5-8 has 103 persons acquired at very different perspectives. In Fig. 7(a) we report the results for camera pair 3-8. For such camera pair where the views are similar we achieve similar performance to the Fused Model and outperform all others, RWACN included. However, when the viewpoint are
100
90
90
80
80
80
70
70
70
60 50 40 Phog Lab HSV Gabor Schmid LM Phog−Lab−HSV−Gabor−Schmid−LMF
30 20 10 0
5
10
15
20
25 30 Rank Score
35
40
45
50
60 50 40 Phog Lab HSV Gabor Schmid LM Phog−Lab−HSV−Gabor−Schmid−LMF
30 20 10 0
5
10
15
20
(a)
25 30 Rank Score
(b)
35
40
45
50
Recognition Percentage
100
90
Recognition Percentage
Recognition Percentage
100
60 50 40 Phog Lab HSV Gabor Schmid LM Phog−Lab−HSV−Gabor−Schmid−LMF
30 20 10 0
5
10
15
20
25 30 Rank Score
35
40
45
50
(c)
Fig. 8. Performance of the proposed method using different features. In (a) the results are computed for the 3DPeS dataset. In (b) and (c) performance are reported for camera pairs 3-8 and 5-8 of the SAIVT dataset respectively.
dissimilars, as in camera pair 5-8, we outperform all other methods used for comparison for ranks higher than 10 (see Fig. 7(b)). In such case our method achieves a recognition percentage higher than 90% for a rank score of 25. The same recognition percentage is reached at rank score 42 by the Fused Model, the Culture-Colours Model and the Height-Model. Finally, in Fig. 8 we report the performance of our method on both the 3DPeS and the SAIVT datasets. We show the results where only a single feature is used to compute the PFD and compare them to the combination of all of the features. It is worth noticing that color features are the most discriminative ones for the 3DPeS dataset. On the other hand the same features are less performing on camera pair 3-8 of the SAIVT dataset where, shape and texture features are more discriminative. While shape features may not be very discriminative for other methods, that is not the case for our method as it is highly related to the orientation of a person. For camera pair 5-8 all the features have similar performance. However, for all the considered datasets and camera pairs, the combination of all features is the one that achieves the best performance.
[2]
[3] [4]
[5] [6] [7]
[8] [9]
[10]
[11]
V.
C ONCLUSION
In this paper we have introduced a novel re-identification approach where different perspectives of the person are used to learn the transformations of pairwise feature dissimilarities across camera pairs. Towards this goal we introduced a person orientation value, recovered from short-term tracklets, that allowed us to obtain a inter-relationship between perspectives computed for different cameras. We exploited the PFDS to model the multi-modal inter-camera transformations. In particular the PFDS has been divided into two regions, one formed by PFDs computed for images with similar orientations, the other composed of PFDs computed for images with all other possible orientations. Then, for each subspace a binary classifier has been trained to discriminate between images of the same person and images from different persons. We have shown the superior performance of our approach to state-ofthe-art methods using two public benchmark datasets. R EFERENCES [1]
C. Micheloni and G. L. Foresti, “Zoom on Target While Tracking,” in ICIP, vol. III, Genova, 2005, pp. 117–120.
[12] [13]
[14]
[15] [16]
[17] [18]
[19]
[20]
N. Martinel, C. Micheloni, C. Piciarelli, and G. L. Foresti, “Camera Selection for Adaptive Human-Computer Interface,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 44, no. 5, pp. 653–664, May 2014. C. Liu, S. Gong, and C. C. Loy, “On-the-fly Feature Importance Mining for Person Re-Identification,” Pattern Recognition, 2013. S. Bk, E. Corv´ee, F. Br´emond, and M. Thonnat, “Boosted human reidentification using Riemannian manifolds,” Image and Vision Computing, vol. 30, no. 6-7, pp. 443–452, 2012. I. Kviatkovsky, A. Adam, and E. Rivlin, “Color Invariants for Person Re-Identification,” IEEE TPAMI, vol. 35, no. 7, pp. 1622–1634, 2013. R. Zhao, W. Ouyang, and X. Wang, “Unsupervised Salience Learning for Person Re-identification,” in CVPR, 2013. O. Javed, K. Shafique, Z. Rasheed, and M. Shah, “Modeling intercamera spacetime and appearance relationships for tracking across nonoverlapping views,” CVIU, vol. 109, no. 2, pp. 146–162, 2008. F. Porikli and M. Hill, “Inter-Camera Color Calibration Using CrossCorrelation Model Function,” in ICIP, 2003, pp. 133–136. T. Avraham, I. Gurvich, M. Lindenbaum, and S. Markovitch, “Learning Implicit Transfer for Person Re-identification,” in ECCV, 2012, pp. 381– 390. A. Datta, L. M. Brown, R. Feris, and S. Pankanti, “Appearance Modeling for Person Re-Identification using Weighted Brightness Transfer Functions,” in ICPR, 2012. M. Dikmen, E. Akbas, T. Huang, and N. Ahuja, “Pedestrian recognition with a learned metric,” in ACCV, 2011, pp. 501–512. W. Li, R. Zhao, and X. Wang, “Human Reidentification with Transferred Metric Learning,” in ACCV, 2012, pp. 31–44. S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian, “Local fisher discriminant analysis for pedestrian re-identification,” in CVPR, 2013, pp. 3318–3325. G. Zhang, Y. Wang, J. Kato, T. Marutani, and M. Kenji, “Local distance comparison for multiple-shot people re-identification,” in ACCV, 2013, pp. 677–690. N. Martinel, C. Micheloni, and C. Piciarelli, “Learning Pairwise Feature Dissimilarities for Person Re-Identification,” in ICDSC, 2013. J. Garcia, A. Gardel, I. Bravo, J. Lazaro, M. Martinez, and D. Rodriguez, “Directional people counter based on head tracking,” IEEE Trans. on Ind. Electr., vol. 60, no. 9, pp. 3991–4000, 2013. N. Martinel and C. Micheloni, “Re-identify people in wide area camera network,” in CVPRW, 2012, pp. 31–36. D. Baltieri, R. Vezzani, and R. Cucchiara, “3dpes: 3d people dataset for surveillance and forensics,” in International ACM Workshop on Multimedia access to 3D Human Objects, 2011, pp. 59–64. M. Kostinger, M. Hirzer, P. Wohlhart, P. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in CVPR, 2012, pp. 2288–2295. A. Bialkowski, S. Denman, P. Lucey, S. Sridharan, and C. B. Fookes, “A database for person re-identification in multi-camera surveillance networks,” in DICTA, 2012, pp. 1–8.