A novel method for content-based image retrieval to ... - SAGE Journals

Research Paper

A novel method for content-based image retrieval to improve the effectiveness of the bag-of-words model using a support vector machine

Journal of Information Science 1–19 Ó The Author(s) 2018 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0165551518782825 journals.sagepub.com/home/jis

Amna Sarwar Department of Software Engineering, University of Engineering and Technology – Taxila, Pakistan

Zahid Mehmood Department of Software Engineering, University of Engineering and Technology – Taxila, Pakistan

Tanzila Saba College of Computer & Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia

Khurram Ashfaq Qazi Department of Software Engineering, University of Engineering and Technology – Taxila, Pakistan

Ahmed Adnan Department of Computer Science, University of Engineering and Technology – Taxila, Pakistan

Habibullah Jamal Faculty of Engineering Sciences, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, Pakistan

Abstract The advancements in the multimedia technologies result in the growth of the image databases. To retrieve images from such image databases using visual attributes of the images is a challenging task due to the close visual appearance among the visual attributes of these images, which also introduces the issue of the semantic gap. In this article, we recommend a novel method established on the bag-of-words (BoW) model, which perform visual words integration of the local intensity order pattern (LIOP) feature and local binary pattern variance (LBPV) feature to reduce the issue of the semantic gap and enhance the performance of the content-based image retrieval (CBIR). The recommended method uses LIOP and LBPV features to build two smaller size visual vocabularies (one from each feature), which are integrated together to build a larger size of the visual vocabulary, which also contains complementary features of both descriptors. Because for efficient CBIR, the smaller size of the visual vocabulary improves the recall, while the bigger size of the visual vocabulary improves the precision or accuracy of the CBIR. The comparative analysis of the recommended method is performed on three image databases, namely, WANG-1K, WANG-1.5K and Holidays. The experimental analysis of the recommended method on these image databases proves its robust performance as compared with the recent CBIR methods.

Keywords Object classification; scene retrieval; semantic gap; sparse features; visual vocabulary

Corresponding author: Zahid Mehmood, Department of Software Engineering, University of Engineering and Technology – Taxila, Taxila 47050, Punjab, Pakistan. Email: [email protected]

Sarwar et al.

2

1. Introduction From past one decade, the storage capacity of the image databases is increasing due to the accessibility of different lowprice image acquisition devices, for example, digital cameras and mobile phones. Therefore, saving, searching and organising digital databases have become significant and essential for efficient content-based image retrieval (CBIR). Several penetrating and retrieval utilities are essential for end users from different domains, such as medical, education, weather forecasting, criminal investigation, advertising, social media, web, art design and entertainment, to retrieve the images efficiently from these types of the image databases. Different methods for image retrieval have been developed for this purpose [1]. They are distributed into two contexts: text-based image retrieval (TBIR) and CBIR. TBIR was introduced in 1970 for searching and retrieving images from image databases. According to this method, the images are by hand described using text or annotated descriptors [2]. These manually annotated descriptors are then used by database management system (DBMS) to perform image retrieval. There are some drawbacks of TBIR-based methods: first, it is based on manual annotation of the images, which is time-wasting and second is that the accuracy of manual annotation–based method is affected due to the different levels of perception of individuals as well as these methods are also language dependent [3,4]. CBIR was familiarised in early 1990 to overwhelm the complications of the TBIR methods [5]. In CBIR, the visual attributes of the images are normally described using shape, colour and texture-based local features [6]. These features are utilised along with machine learning methods to retrieve images from image databases [7]. In past era, different systems for efficient CBIR are introduced like Netra [8], Virage [9] and Photobook [10]. Researchers are currently concentrating on challenging problems in CBIR in different domains such as machine learning and computer vision. The review on the most challenging issues in CBIR is presented in the work by Smeulders et al. [7]. According to Smeulders et al. [7], some of the challenges in CBIR are the searching of objects from a large number of classes. There is the absence of explicit phase of training to select features and to tune the classification. The semantic gap between visual contents and human semantics and the exponential growth in multimedia archives, the variation in illumination and spatial layout are some of the main reasons for making CBIR a challenging research problem [11,12]. The focus of computer vision–based methods is to discover, characterise, understand and improve features to detect and extract local characteristics in images for efficient image classification. Many features have been designed either automated like deep learning (DL) or handcrafted like support vector machine (SVM) to resolve different issues, such as occlusions and variations due to scale and illumination. In order to make concise decisions, the DL needs extensive unlabeled learning data. The DL implies a hierarchical artificial neural network (ANN)-based approach to automatically learn the most significant characteristics from the data. It requires high-performance hardware like graphics processing unit (GPU) and tensor processing unit (TPU) and a lot of time for training. In DL, problems are solved on end-to-end basis building functionality by itself as long as available at the current time, whereas handcrafted approaches utilise small amounts of data presented by users and need features to be appropriately determined by the users. They split tasks into small pieces and then merge received results into one conclusion providing adequate transparency for its decisions [13]. The images are shown in Figure 1, which may decrease the performance of the image retrieval due to the close visual appearance among the visual attributes of the images.

Figure 1. Semantic gap-WANG database images of two different classes (mountains and beach) with adjacent visual attributes [14]. Journal of Information Science, 2018, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551518782825

Sarwar et al.

3

According to the visual attributes of the images, the CBIR methods are grouped into two groups known as global attributes and local attributes. Global attributes are used to retrieve those images, which are visually similar [5]. The global attributes capture the overall features of the image. These attributes represent the abstract-level of semantic similarity between the images [5,15], but these attributes are not able to classify significant visual characteristics of the images because they are oversensitive to the location of the salient objects. On the other hand, local attributes represent the details of the image. They designate some parts or keypoints of the images such as edges or corners acquired via gradient or segmentation process. During local attributes extraction, attributes of each pixel of the image are computed by considering the attributes of its neighbourhood. The image may be divided into small non-overlapping blocks and attributes are computed for each block that reduces computation [16]. In contrast with early years, the research attention is shifted from the representation of the visual attributes of the images using global attributes to the local attributes. As local attributes–based descriptors are invariant to scale and rotation as well as uses robust way of matching in various situations, they have many advantages as compared with global attributes [12]. The recent CBIR literature shows that local attributes improve the image retrieval performance as compared with the global attributes [15,17,18]. Local attributes– based features like maximally stable extremal regions (MSER) [19], speeded-up robust features (SURF) [20], histogram of oriented gradients (HOG) [21] and binary robust invariant scalable keypoints (BRISK) [22] have been used to achieve the vigorous performance of the CBIR. Numerous studies are conducted on local attributes that are utilised in different applications [15,17,23]. In CBIR, image retrieval established on the single feature has not been yet effective for reporting perfect retrieval results. So, the different attributes of the images are integrated to increase the efficiency of the image retrieval [24,25]. This article presents a novel method to extract local attributes of the images using local intensity order pattern (LIOP) and local binary pattern variance (LBPV) descriptors to achieve complementary features which also advances CBIR performance. Local attributes represent the image in form of local patches, so they are more robust and perform better for object recognition even in substantial clutter and occlusion. The performance of the proposed method is also equated with the feature integration of LIOP-LBPV descriptors and single feature–based LIOP and LBPV descriptor methods based on the BoW method. The proposed method which employs visual words integration performs better than its competitor CBIR methods as well as recent CBIR methods due to the robust demonstration of the visual attributes of the images. The subsequent sections of this article are as follows. Section 2 describes the literature review of CBIR methods. Section 3 describes the proposed method. Section 4 describes the performance evaluation parameters, experimental results and analysis on three standard image databases (i.e. WANG-1K, WANG-1.5K and Holidays) and running cost of the proposed method. Section 5 describes conclusion analysis of the proposed method and future directions.

2. Literature review The first CBIR system was introduced by the IBM, which is known as QBIC. Later on, several feature abstraction methods were introduced, which were established on spatial layout, shape, texture and colour attributes for image retrieval. The one of the challenging issue to enhance CBIR performance is the problem of the semantic gap, which occurs during machine learning process and local features extraction of the image [5]. This gap is reduced by introducing complementary feature integration-based methods as well as using spatial information of an image [26–28]. Yuan et al. [24] present a local descriptor that integrates SIFT and LBP to obtain a high-dimensional feature vector for every image. Two fusion models, that is, patch-level and image-level, are employed for feature fusion. For the compact representation of high-dimensional feature vector, a clustering technique based on the k-means is smeared to construct a dictionary. The appropriate images are retrieved according to the semantic category of the inquiry image and ranked based on the similarity measure. Yu et al. [25] propose a framework of feature integration using HOG and SIFT with LBP. The clustering method based on the k-means is applied for clustering the data. The new features were not dependent on image segmentation process and automatically detected interest points in an image. Experimental results show that the image retrieval results are improved using feature integration of SIFT and LBP. The properties of the mage can be effectively described using Gabor filter and three-dimensional (3D) colour histogram, but the integration of multiple features may result in the curse of dimensionality, which in turn increase the calculation cost and time in the process of the image retrieval. Therefore, ElAlami [29] address this problem by proposing feature selection technique to extract only relevant features. This technique uses Gabor filter and 3D colour histogram for extracting texture and colour features, respectively. The optimal segmentation of features is obtained by applying a genetic algorithm (GA). The most relevant features are extracted by feature selection algorithm using preliminary and deeply reduction. The main contribution of this method is to make image retrieval precise in a short time. As LBP lacks spatial information of texture features, Xia et al. [30] presented an improved version of LBP and a novel texture feature descriptor called multi-scale Journal of Information Science, 2018, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551518782825

Sarwar et al.

4

local spatial binary patterns (MLSBPs) for CBIR. The feature vector is constructed by computing LBP at different scales and in different directions for image retrieval. The experimental results show that MLSBP performs better than traditional CBIR techniques. Walia and Pal [31] propose a hybrid method for image retrieval. In this method, all the local attributes of the image are combined. The shape, texture and colour attributes of the image are extracted by applying the angular radial transform (ART) and colour difference histogram (CDH) methods. The propose system is made effective by suggesting a modification of hybrid features in the original CDH method. Tian et al. [32] propose a feature demonstration descriptor, which is established on edge orientation difference histogram (EODH) descriptor. This descriptor is rotation-invariant and scale-invariant. Steerable filter and vector sum are used to acquire the main orientation of individual edge pixel. EODH and colour-SIFT descriptors are then integrated to obtain a weighted word distribution. Alkhawlani et al. [33] propose a CBIR technique based on a BoW layout, which integrates local features (SIFT and SURF). For compact feature representation, clustering technique based on the kmeans is smeared to construct a codebook, and the SVM is employed for the classification of semantic categories. Kaur and Verma [34] propose an image retrieval technique rely on enhanced SURF descriptor, which also uses the SVM classifier and neural networks. The SURF descriptor is applied for image features extraction and for classification, SVM is employed along with the neural networks to achieve better retrieval results. Dubey et al. [35] propose a hybrid image descriptor that is rotation and scale-invariant to make image retrieval efficient. The quantisation based on the RGB colour space is used to extract the colour features, whereas the patterns generated from locally structuring element are assembled to extract the texture features. The colour and texture features are integrated together to build the hybrid feature descriptor known as rotation and scale-invariant hybrid descriptor (RSHD). The RSHD descriptor is robust in case of rotation and scaling. Feng et al. [36] propose an innovative image descriptor to extract the colour and texture features of the image. This descriptor is known as global correlation descriptor (GCD). The colour feature is characterised using global correlation vector (GCV), whereas the texture feature is characterised using directional global correlation vector (DGCV). Experimental results show that the performance of the GCD outperforms as compared with the recent CBIR techniques. Karakasis et al. [26] propose a framework that uses affine image moment invariants as a feature descriptor for image retrieval. The moments are fed into the BoW model to produce feature vectors. The novelty of this work is that the affine moment invariants are used as local features and are extracted using SURF feature descriptor. Zeng et al. [37] propose an image representation technique in which Gaussian mixture models (GMMs) are employed to characterise an image as a spectrogram (i.e. general histogram of colours). The expectation-maximisation (EM) technique is employed to determine quantised colour space from the training data set. Gaussian components, that is, the number of colour bins, are characterised according to the Bayesian information criterion (BIC). Spatiogram is modified and incorporated into a quantised Gaussian mixture colour model. Finally, a comparative analysis of two spectrograms is done and a new measurement adopting technique is suggested known as Jensen–Shannon divergence (JSD) to make image retrieval more efficient. Zhao et al. [38] propose a descriptor based on the local and multi-trend structure for feature representation called multi-trend structure descriptor (MTSD). This descriptor integrates features like colour, edge orientation, shape and intensity information for robust image representation. It also represents the local spatial structure information to extract image features. The experimental results demonstrated that MTSD produced discriminative results for effective CBIR. Douik et al. [39] present a hybrid method by integrating local and global features The local features are extracted by applying SIFT descriptor, while the global features are extracted by applying upper-lower of LBP (UL-LBP) descriptor based on LBP. Then, features are quantised into the BoW model to improve image retrieval performance. Although LBP is not suitable for colour images, it is not suitable for capturing similarity between colour images. LBP only captures textual information from the images. Liu et al. [40] overcame this problem by proposing a descriptor with an additional colour feature called colour information feature (CIF) incorporated with LBP for image retrieval. Experimental results show that combining these two features yields good performance in a retrieval system. Srivastava and Khare [41] propose a CBIR method that combines LBP and wavelet transform. In this method, the texture features of the image are extracted by computing LBP codes using the coefficients of discrete wavelet transform (DWT). Then, shape features are extracted from texture features by computing Legendre moments using these LBP codes to construct a feature vector. The experimental results show improved performance on small image databases as compared with the large image databases. Bala and Kaur [42] propose a novel descriptor for CBIR called local texton XOR patterns (LTxXORP). Then, the LTxXORP feature vector is integrated with the feature vector of the HSV colour space–based colour histogram to improve CBIR performance. A novel method using a low-level feature called composite moment (CoMo) for image retrieval is proposed by Vassou et al. [43]. This method combines moment invariants and colour unit of the colour and edge directivity descriptor (CEDD). This method improves CBIR performance by overcoming rotation, translation and scaling issues.

Journal of Information Science, 2018, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551518782825

Sarwar et al.

5

Figure 2. The complete layout of the proposed method that employs visual words integration of the LIOP and LBPV descriptors.

3. Proposed methodology 3.1. Methodology of the traditional BoW model for image retrieval The BoW model is adapted from the bag-of-features (BoF) model, which is used for document retrieval. In a traditional BoW model, features of the image are extracted using feature descriptors [44]. The single visual vocabulary (which is a formation of the salient objects or visual attributes) is constructed by applying a quantisation algorithm like k-means on the extracted features, which transform high-dimensional feature space into low-dimensional feature space. These visual attributes of each image are used to build an order-less histogram. Due to the formation of the order less histogram from each image, the spatial relationship between salient objects of the image is lost, which affects the performance of the CBIR [45,46]. The learning of the classifier is performed using these orderless histograms of the training images. The images are retrieved by selecting a sample image from the test group and evaluating closeness between sample image and images stored in an image database by applying similarity measure method. The complete layout of the proposed method which employs visual words integration of the LIOP and LBPV descriptors is shown in Figure 2. The detailed methodology of the proposed method is described in the subsequent sections. The proposed method improves the performance of the image retrieval as compared with recent CBIR methods. The first step of the proposed method is the splitting of the each reported image database into training and test groups and then extracting the LIOP and LBPV features (i.e. complementary features) from each image, whose details are mentioned in the following sub-sections.

3.2. Extraction of the LIOP features The LIOP descriptor [47] is selected for the feature extraction of the proposed method so that the relative order of pixel intensities remain unchanged even when monotonic intensities change within an image, which affects the performance of the CBIR. It considers the intensities of all sample points to describe the relationships between local intensity of an image. The overall intensity order is used to divide the local patch into sub-regions. These sub-regions are called ordinal bins. Then, a LIOP of each point is derived representing a binary vector based on the relationship between intensities of its adjacent sample points. The binary LIOPs of points are accumulated in each ordinal bin and concatenated together to form an LIOP descriptor. This descriptor is highly discriminative because it encodes both local and global intensity order information of each local patch. The mathematical foundation for the computation of an LIOP descriptor on an image pixel denoted by x is defined as follows Ind ðγ ðPð xÞÞÞ

LIOPð xÞ = φðγ ðPð xÞÞÞ = VN !

= 0, . . . 0, Indðγ ðPð xÞÞÞ 1, 0, . . . 0

Journal of Information Science, 2018, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551518782825

ð1Þ

Sarwar et al.

6

where (x) = fI(x1 ), I(x1 ), . . . , I(xn )g ∈ PN . The PN is a set of the N-dimensional vector. The index table of all possible permutations is set to encode the partitions of PN in ΠN as there is a one-to-one correspondence between the partition and the permutation. The I(xi ) denotes the intensity of the ith neighbouring sampling point xi . The φ represents a feature mapping function. It maps the permutation π to an N-dimensional feature vector VNi ! , which is defined as follows Ind ðπÞ

φð xÞ = VN!

,

π ∈ ΠN

ð2Þ

All the elements in this vector are 1 except for the ith element, which is 0 as mentioned in equation (1). The Ind(π) represents the index of π in the index table. The LIOPs of points in each ordinal bin are accumulated, respectively, and then they are concatenated together to build the LIOP descriptor, which is mathematically defined as follows LIOP descriptor = ðdes1 , des2 , . . . , desB Þ

ð3Þ

desi = x ∈ bini LIOPð xÞ

ð4Þ

where desi represents the computed descriptor on the ith sub-region, bini represents the ith sub-region and B represents the total number of the ordinal bins.

3.3. Extraction of the LBPV features The drawback of local invariant features, for example, LBP is that they do not preserve global spatial information, while the global features represent the local texture information. The LBPV is proposed to solve this issue as it characterises the local contrast information [48]. Usually, the variance is high in the regions with higher frequency texture and their contribution is significant for discrimination of texture images. The LBPV feature vector is generated by embedding variance of the local region of the image. Hence, the variance is added as a weight to obtain the LBPV feature vector. Although the feature dimensions of LBPV is same as LBP, the additional contrast methods are added into the LBPV feature vector. The mathematical representation of the LBPV descriptor is as follows LBPVI, J ðqÞ =

N X M X

q ∈ ½0, Q

ð5Þ

VARI, J ðs, tÞ LBPI, J ðs, tÞ = q 0 otherwise

ð6Þ

wðLBPI, J ðs, tÞ, qÞ,

s=1 t=1

and wðLBPI, J ðs, tÞ, qÞ =

and R1 1X ðfr vÞ2 R R=0

VARI, J =

ð7Þ

P where v = (1=R) RR1 = 0 fr . For a given central pixel in the image, its value is compared with pixels in its neighbourhood and a pattern number is computed by applying following mathematical equation LBPI, J =

I1 X

mðfr fc Þ2I

ð8Þ

a≥0 a