Locality-constrained and Spatially Regularized Coding for Scene ...

Locality-constrained and Spatially Regularized Coding for Scene Categorization Aymen Shabou Herv´e Le Borgne CEA, LIST, Vision & Content Engineering Laboratory Gif-sur-Yvettes, France [email protected]

[email protected]

Abstract Improving coding and spatial pooling for bag-of-words based feature design have gained a lot of attention in recent works addressing object recognition and scene classification. Regarding the coding step in particular, properties such as sparsity, locality and saliency have been investigated. The main contribution of this work consists in taking into acount the local spatial context of an image into the usual coding strategies proposed in the state-ofthe-art. For this purpose, given an imgae, dense local features are extracted and structured in a lattice. The latter is endowed with a neighborhood system and pairwise interactions. We propose a new objective function to encode local features, which preserves locality constraints both in the feature space and the spatial domain of the image. In addition, an appropriate efficient optimization algorithm is provided, inspired from the graph-cut framework. In conjunction with the maximum-pooling operation and the spatial pyramid matching, that reflects a global spatial layout, the proposed method improves the performances of several state-of-the-art coding schemes for scene classification on three publicly available benchmarks (UIUC 8-sport, Scene15 and Caltech-101).

1. Introduction In recent works addressing object recognition and scene classification tasks, the bag-of-words (BoW) is one of the most popular model for feature design. Inspired by the seminal work of [26], different approaches have been proposed to improve both its generative property to describe accurately images and its discriminatory power for classification. Despite remarkable progresses, it remains challenges concerning the extraction of local descriptors, codebook design, local descriptors coding and pooling, including a spatial layout into the final feature, and the final classification. Given a training dataset, the first step of the BoW method consists in extracting local features, such as SIFT [21], HOG [8] and SURF [1], from images. Then a codebook (or

Spatial domain of dense local features Xp

0.9

X r 0.7

2. Locality-constrained and spatially regularized assignement

2

{

Xq X p X q : Local features Xr

0.9: similarity between X p , X q 0.7: similarity between X p , X r

1 1. Locality-constrained assignement

Xp Xq Xr Codebook

Visual word Selected visual word

Figure 1. Schematic comparison of basis selection methods to code dense descriptors. The first configuration is the one adopted by some recent coding approaches [30, 12, 20]. The second configuration corresponds to the proposed LCSR method.

a dictionary), which is a set of visual words, is built to represent them. Initial methods are based on clustering techniques, such as K-means [26]. Despite their efficiency, the obtained codebooks suffer from several drawbacks such as distortion errors and low discriminative ability [14, 28]. A more appropriate unsupervised dictionary learning method is sparse coding which aims to learn an over-complete codebook ensuring sparse representation of local descriptors [23, 22]. However, this approach is computationally expensive even if progress was made toward accelerating the process [16]. Other approaches have rather attempted to improve the discriminative power of the codebook while compacting it relying on supervised methods [14, 22, 2]. However, recent works of [6, 24] show that, for the recognition task, codebook design is less critical than the next stages (coding, pooling and spatial layout).

Coding consists in decomposing local features over a codebook in order to satisfy some desirable properties. Various strategies are proposed in the literature. The earliest one is the hard coding [26], a voting scheme that is simple yet highly sensitive to reconstruction errors induced by the codebook. A more robust voting approach is the soft coding [28], which assigns a descriptor to all the visual words according to their distances. Sparse coding is an alternative [32] that is time consuming and which is, moreover, non-consistent to encode similar descriptors [30, 11]. Authors of [33] introduced another coding property, called locality, that ensures sparsity while remaining efficient. Several implementations have been proposed by [30, 20], where each descriptor is coded on locally selected bases. Note also that in [12], the authors give another explanation about the success of the locality coding, which is saliency. Indeed, for a given descriptor and corresponding local bases, the closer the nearest visual word to the descriptor in comparison to the remaining local bases, the stronger its coding response should be. The next step of BoW design is pooling the obtained codes to obtain a compact signature. Usually, the maxpooling operation is used, leading to signatures that are appropriate to linear classifiers [32, 3, 2, 20]. Finally, the Spatial Pyramid Matching (SPM) step, proposed in [15], is usually exploited to include some spatial layout information to the BoW. Such vectors of fixed size can then feed a machine learning algorithm such as SVM [7] or Boosting [25]. In the current work, the local feature coding step is investigated. While several techniques have outperformed the classic hard assignment by introducing either the locality or the similarity constraints in the feature space [30, 20, 12, 11], we propose a new formalism that implicitly preserves these properties while adding the local contextual information from the spatial domain of the image. Figure 1 shows a schematic comparison. The proposed coding approach is divided into two steps. 1. The first step is an optimal basis selection for each local feature, formulated as a labeling problem. For this purpose, we introduce a novel objective function that includes locality and similarity (or coherency) constraints in both the feature space and the spatial domain of the image. Furthermore, we provide an appropriate efficient optimization algorithm, called αknn expansion, which is inspired from the fast optimization tools dedicated to Markov Random Field (MRF) based energy minimization task [4]. 2. The second step consists in assigning responses (or values) to the selected optimal bases. This new approach enriches the BoW signature leading to more accurate features for classification than the state-of-

the-art methods. Furthermore, it is generic and can thus be added to several recent coding strategies. The remainder of this paper is as follows. In section 2, some details about related work to the coding step within the BoW feature generation framework are discussed. The new coding strategy is introduced in section 3. Section 4 highlights experimental studies and results on the following benchmarks: UIUC 8-sport [17] and scenes-15 [15] for event and scene classification respectively and Caltech101 [9] for object recognition.

2. Related work Let us consider a codebook denoted by B = {bi ; bi ∈

Locality-constrained and Spatially Regularized Coding for Scene ...

Locality-constrained and Spatially Regularized Coding for Scene ...

Suggest Documents

Fisher Discrimination Regularized Robust Coding

Hierarchical Coding Vectors for Scene Level Land

Spatially Local Coding for Object Recognition

Spatially Regularized Common Spatial Patterns for EEG Classification

An Error-Bound-Regularized Sparse Coding for Spatiotemporal ...

coding of faded scene transitions - Semantic Scholar

Understand Scene Categories by Objects: A Semantic Regularized

Spatially regularized compressed sensing of diffusion MRI data

Hierarchical Coding Vectors for Scene Level Land-Use ... - MDPI

Comparing Regularized and Non-Regularized ... - Springer Link

Speeding Up Graph Regularized Sparse Coding by Dual ... - NSFC

Robust and Regularized Algorithms for Vehicle ...

LINEAR AND REGULARIZED SOLUTIONS FOR ... - inb.uni-luebeck.de

Modality-Independent Coding of Scene Categories in Prefrontal Cortex

Regularized LIML for many instruments

h.264-compatible spatially scalable video coding with in ... - Microsoft

Spatially Invariant Coding of Numerical Information in ... - UNICOG

Spatially scalable video coding with in-band prediction - Microsoft

Adaptive Downsampling Video Coding With Spatially Scalable Rate

LNCS 5414 - Video Coding Using Spatially Varying ... - CiteSeerX

Regularized Clustering for Documents - CiteSeerX

INCREMENTAL REGULARIZED LEAST SQUARES FOR

INCREMENTAL REGULARIZED LEAST SQUARES FOR

Compression approaches for the regularized