Locality-constrained and Spatially Regularized Coding for Scene Categorization Aymen Shabou Herv´e Le Borgne CEA, LIST, Vision & Content Engineering Laboratory Gif-sur-Yvettes, France
[email protected]
[email protected]
Abstract Improving coding and spatial pooling for bag-of-words based feature design have gained a lot of attention in recent works addressing object recognition and scene classification. Regarding the coding step in particular, properties such as sparsity, locality and saliency have been investigated. The main contribution of this work consists in taking into acount the local spatial context of an image into the usual coding strategies proposed in the state-ofthe-art. For this purpose, given an imgae, dense local features are extracted and structured in a lattice. The latter is endowed with a neighborhood system and pairwise interactions. We propose a new objective function to encode local features, which preserves locality constraints both in the feature space and the spatial domain of the image. In addition, an appropriate efficient optimization algorithm is provided, inspired from the graph-cut framework. In conjunction with the maximum-pooling operation and the spatial pyramid matching, that reflects a global spatial layout, the proposed method improves the performances of several state-of-the-art coding schemes for scene classification on three publicly available benchmarks (UIUC 8-sport, Scene15 and Caltech-101).
1. Introduction In recent works addressing object recognition and scene classification tasks, the bag-of-words (BoW) is one of the most popular model for feature design. Inspired by the seminal work of [26], different approaches have been proposed to improve both its generative property to describe accurately images and its discriminatory power for classification. Despite remarkable progresses, it remains challenges concerning the extraction of local descriptors, codebook design, local descriptors coding and pooling, including a spatial layout into the final feature, and the final classification. Given a training dataset, the first step of the BoW method consists in extracting local features, such as SIFT [21], HOG [8] and SURF [1], from images. Then a codebook (or
Spatial domain of dense local features Xp
0.9
X r 0.7
2. Locality-constrained and spatially regularized assignement
2
{
Xq X p X q : Local features Xr
0.9: similarity between X p , X q 0.7: similarity between X p , X r
1 1. Locality-constrained assignement
Xp Xq Xr Codebook
Visual word Selected visual word
Figure 1. Schematic comparison of basis selection methods to code dense descriptors. The first configuration is the one adopted by some recent coding approaches [30, 12, 20]. The second configuration corresponds to the proposed LCSR method.
a dictionary), which is a set of visual words, is built to represent them. Initial methods are based on clustering techniques, such as K-means [26]. Despite their efficiency, the obtained codebooks suffer from several drawbacks such as distortion errors and low discriminative ability [14, 28]. A more appropriate unsupervised dictionary learning method is sparse coding which aims to learn an over-complete codebook ensuring sparse representation of local descriptors [23, 22]. However, this approach is computationally expensive even if progress was made toward accelerating the process [16]. Other approaches have rather attempted to improve the discriminative power of the codebook while compacting it relying on supervised methods [14, 22, 2]. However, recent works of [6, 24] show that, for the recognition task, codebook design is less critical than the next stages (coding, pooling and spatial layout).
Coding consists in decomposing local features over a codebook in order to satisfy some desirable properties. Various strategies are proposed in the literature. The earliest one is the hard coding [26], a voting scheme that is simple yet highly sensitive to reconstruction errors induced by the codebook. A more robust voting approach is the soft coding [28], which assigns a descriptor to all the visual words according to their distances. Sparse coding is an alternative [32] that is time consuming and which is, moreover, non-consistent to encode similar descriptors [30, 11]. Authors of [33] introduced another coding property, called locality, that ensures sparsity while remaining efficient. Several implementations have been proposed by [30, 20], where each descriptor is coded on locally selected bases. Note also that in [12], the authors give another explanation about the success of the locality coding, which is saliency. Indeed, for a given descriptor and corresponding local bases, the closer the nearest visual word to the descriptor in comparison to the remaining local bases, the stronger its coding response should be. The next step of BoW design is pooling the obtained codes to obtain a compact signature. Usually, the maxpooling operation is used, leading to signatures that are appropriate to linear classifiers [32, 3, 2, 20]. Finally, the Spatial Pyramid Matching (SPM) step, proposed in [15], is usually exploited to include some spatial layout information to the BoW. Such vectors of fixed size can then feed a machine learning algorithm such as SVM [7] or Boosting [25]. In the current work, the local feature coding step is investigated. While several techniques have outperformed the classic hard assignment by introducing either the locality or the similarity constraints in the feature space [30, 20, 12, 11], we propose a new formalism that implicitly preserves these properties while adding the local contextual information from the spatial domain of the image. Figure 1 shows a schematic comparison. The proposed coding approach is divided into two steps. 1. The first step is an optimal basis selection for each local feature, formulated as a labeling problem. For this purpose, we introduce a novel objective function that includes locality and similarity (or coherency) constraints in both the feature space and the spatial domain of the image. Furthermore, we provide an appropriate efficient optimization algorithm, called αknn expansion, which is inspired from the fast optimization tools dedicated to Markov Random Field (MRF) based energy minimization task [4]. 2. The second step consists in assigning responses (or values) to the selected optimal bases. This new approach enriches the BoW signature leading to more accurate features for classification than the state-of-
the-art methods. Furthermore, it is generic and can thus be added to several recent coding strategies. The remainder of this paper is as follows. In section 2, some details about related work to the coding step within the BoW feature generation framework are discussed. The new coding strategy is introduced in section 3. Section 4 highlights experimental studies and results on the following benchmarks: UIUC 8-sport [17] and scenes-15 [15] for event and scene classification respectively and Caltech101 [9] for object recognition.
2. Related work Let us consider a codebook denoted by B = {bi ; bi ∈