Object Detection with DoG Scale-Space - CiteSeerX

5 downloads 0 Views 5MB Size Report
Abstract—Difference of Gaussians (DoG) scale-space for an image is a significant way to generate features for object detection and classification.
IEEE TRANSACTION ON IMAGE PROCESSING

2

Object Detection with DoG Scale-Space: A Multiple Kernel Learning Approach Sharmin Nilufar, Nilanjan Ray, and Hong Zhang

Abstract—Difference of Gaussians (DoG) scale-space for an image is a significant way to generate features for object detection and classification. While applying DoG scale-space features for object detection/classification, we face two inevitable issues: dealing with high dimensional data and selecting/weighting of proper scales. The scale selection process is mostly ad-hoc to date. In this paper, we propose a multiple kernel learning (MKL) method for both DoG scale selection/weighting and dealing with high dimensional scale-space data. We design a novel shift invariant kernel function for DoG scale-space. To select only the useful scales in the DoG scale-space, a novel framework of MKL is also proposed. We utilize a 1-norm support vector machine (SVM) in the MKL optimization problem for sparse weighting of scales from DoG scale-space. The optimized data-dependent kernel accommodates only a few scales that are most discriminatory according to the large margin principle. With a 2-norm SVM this learned kernel is applied to a challenging detection problem in oil sand mining: to detect large lumps in oil sand videos. We tested our method on several challenging oil sand data sets. Our method yields encouraging results on these difficult-to-process images and compares favorably against other popular multiple kernel methods. Index Terms—DoG scale-space, Circular convolution, multiple kernel learning, 1-norm Support vector machine.

I. I NTRODUCTION

T

HE concept of scale-space was coined to the image analysis community by Witkin in 1983 [3]. The main idea in Witkin’s work is that important signal features would persist through relatively coarse scales. Especially, regions that appear to stand out from the surroundings in the original image, seem to be further enhanced within the scale-space. Scale-space analysis typically consists of applying filters at different scaling parameters to an image to obtain a number of output images. The size of the object we are looking for as well as its texture contents are related to these multi-scale representations. In this paper, We investigate difference of Gaussians (DoG) scale-space as the features for object detection from videos. DoG filter provides a close approximation to the scale normalized Laplacian of Gaussian (LoG) filter, and it is computationally more efficient than the LoG filter. DoG has been applied widely in computer vision and image analysis to obtain regions of interest [1], [2]. These regions could represent presence of objects or parts of objects in the image domain S. Nilufar, N. Ray and H. Zhang are with the Department of Computing Science, University of Alberta, Alberta, Canada T6G 2E8. Emails: (sharmin, nray1, hzhang)@ualberta.ca Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

with application to object recognition and/or object tracking. For specific classification tasks, such as object detection, it is neither practical nor profitable to use the entire DoG scalespace, because first of all, DoG scale-space features have enormously large dimensions, and secondly, the scale-space features are discriminatory, i.e., useful in classification, only for a handful of scales. On the other hand, the DoG response at a single scale is not usually powerful enough, and may in fact lead to severe noise sensitivity in classification. Thus, it is important to rely on a few discriminating scales for object detection. In this paper, we map the DoG scale selection and weighting problem to multiple kernel learning (MKL). In addition to the scale selection/weighting, the MKL framework deals naturally with the high dimensionality of scale-space features via a kernel function. Towards this end, we have designed a shift invariant circular convolution kernel function for DoG scalespace. To select the appropriate scales, we have also proposed a novel MKL technique for sparse weighting of scales. The proposed MKL utilizes a 1-norm linear support vector machine (SVM), and accordingly we refer to our method as linear multiple kernel learning (LMKL). Once the kernel function is learned with LMKL, it can be used with a 2-norm SVM classier to detect objects. We apply the proposed MKL based technique for detection of frozen, large lumps of oil sand from mining videos [4]. One of the major difficulties in the oil sand mining process is the presence of large lumps, viz., huge chunks of rock materials or a collection of smaller particles frozen together in cold winter. These large lumps can block the crushers resulting in significant production downtime, and as jamming needs to be cleared, the downtime translates directly into economic losses. The traditional approach of visual inspection by a human operator for large lump detection is tedious and expensive. Thus, it is important to design an effective and reliable automatic detection system. Oil sand images pose considerable challenge for automated processing due to poor lighting, varying weather, and harsh outdoor imaging conditions [4]. Moreover, the shape, size and texture of oil sands show a significant variation. Figure 1 shows some of typical images of large lumps in oil sand and Figure 2 shows images with no significant large lumps. As a novel and significant application of the proposed LMKL method, we consider the binary classification problem: large lump vs. no large lump using the DoG scale-space.

IEEE TRANSACTION ON IMAGE PROCESSING

3

0.04

0.035

sigma=2 sigma=3.2 difference

0.03

Magnitude

0.025

Fig. 1.

Examples of oil sand images containing large lumps.

0.02

0.015

0.01

0.005

0

-0.005 -10

-8

-6

-4

-2

0 Position

2

4

6

8

10

Fig. 3. Gaussians and Difference of gaussian in one dimension and corresponding DoG in two dimensions.

Fig. 2.

Examples of oil sand images with no large lump.

II. BACKGROUND

for an image. DoG scale space had been successfully applied by Lowe in designing SIFT, which is one of the most popular keypoint detection techniques used to detect and describe local features in images [1]. In this method, DoG scale-space is applied to find key points that are invariant to scales.

In this section, we provide background information on two indispensable components in the proposed method, viz. scale-space features, and multiple kernel function learning. In addition, we provide necessary literature survey on oil sand mining image analysis. A. DoG Scale-Space The scale-space framework has been developed to handle the multi-scale nature of image data. There are at least two important advantages of the scale-space approach. First, scale-space representation allows multiple interpretations of the data from fine details to high level descriptions of the overall structure of the image content. Second, the scalespace approach provides the flexibility for selecting a scale or a set of scales by looking at how the interpretation of the structure/object captured in the scale-space changes with the scale. Different principles can be employed to obtain scale-spaces that achieve a description of image structures through scales. A classical approach for choosing a scale-space representation for a particular application is to establish a set of scale-space axioms [5], describing basic properties of the desired scale-space representation. DoG is an edge enhancement algorithm for grayscale images that is obtained by convolving an image with two different Gaussian kernels and subtracting one response from the other. Convolving an image with a Gaussian kernel has the excellent property of suppressing noise. Most of the edge enhancement algorithms used in digital image processing often produce the undesirable side effect of increasing random noise in the image, while the DoG algorithm successfully eliminate high frequency noise component. Conceptually, the DoG acts as a band-pass filter on a grayscale image and it is very similar to the architecture of the retina’s visual receptive field. The retina in fact implements DoG bandpass filters at several spatial frequencies. The plot of the cross sectionss of two Gaussian curves with different standard deviations and their difference and corresponding 2D difference of gaussian are drawn in Figure 3. Figure 4 illustrates the DoG scale-space

Fig. 4.

Multiscale representation of an image.

B. Multiple Kernel Learning Selection of discriminating features is important in machine learning-based classification. In the present case, DoG output image at every scale provides a set of extremely high dimensional features- the number of pixels in a DoG output image is the feature dimensionality at that scale. Moreover, considering all the DoG output images at various scales, the dimensionality of the feature set is staggeringly high. Selection of features from this high dimensional space is daunting. Fortunately, this issue can be dealt with by designing suitable kernel functions on the DoG output images. A kernel function computes similarity or dissimilarity between two sets of (usually high dimensional) features. For example, if the two sets of features come from the same class, the kernel function measures similarity between the two sets. Otherwise, if the sets of features belong to two different classes, then the kernel function reveals very little similarity, or equivalently, strong dissimilarity between them. Construction of our proposed kernel function for large lump detection is discussed in detail in Section III.B. Here, kernel functions provide an abstraction

IEEE TRANSACTION ON IMAGE PROCESSING

for DoG scale selection via MKL that we discuss in detail in Section III.C. In the present section, we provide relevant literature survey on MKL. MKL problems can be sometimes tackled via exhaustive cross validation (CV). However, if the number of parameters is large, then CV is not computationally feasible. Recently, several researchers have focused on finding efficient methods for multiple kernel learning. The MKL framework proposed by Lanckriet et al. [6] is the conic combinations of kernel matrices that results in a convex quadratically-constrained quadratic program (QCQP). However, the semidefinite programming based method proposed by Lanckriet et al. [6] can solve this problem only for a small number of kernels and a small number of data points. It becomes rapidly intractable as the number of learning examples or kernels increases. To overcome this limitation Bach et al. [7] have reformulated the problem so that sequential minimal optimization techniques can be applied to handle medium-scale problems. Their method is known as support kernel machine (SKM). Sonnenburg et al. [8] also reformulate the binary classification MKL problem as a semiinfinite linear program, which can be solved by recycling the standard SVM implementations. Sonnenburg’s large-scale multiple kernel learning (LSMKL) method makes the MKL approach tractable for large problems, employing existing support vector machine code iteratively. All the aforementioned MKL techniques used linear combination of kernels. Varma et al. [9] show how existing MKL formulations can be easily extended to learn general kernel combinations subject to regularizations. This generalized multiple kernel learning (GMKL) method provides richer representations of feature spaces by combining kernels in other fashions rather than linear combination. As a competitive alternative to the aforementioned MKL techniques, in this paper, we propose to utilize the principle of large margin in 1-norm support vector machine to solve the MKL problem. Because we utilize a linear classifier for the MKL, we refer to our method as linear MKL or LMKL. We tested our LMKL on the aforementioned large lump detection problem and found it to compare favorably against three popular MKL methods namely support kernel machine (SKM) [7], large scale multiple kernel learning (LSMKL) [8] and generalized multiple kernel learning (GMKL) [9]. C. Oil Sand Image Analysis Oil sand images are relatively novel images. To date, very little research toward automated analysis has been done on these images. Zhang first investigated the application of different image processing algorithms to improve efficiency, reduce cost, and minimize the environmental impact in various stages of oil sand mining process [4]. Majority of work on oil sand image analysis concentrated on image segmentation for size analysis [10], [11], [12], [13]. These methods apply various machine learning techniques combined with active contour or snake [10], watershed [11], [12], and grayscale image threshold method [13]. Ray et al. show that connected operator-based pre-filtering helps snake methods to a significant degree in oil sand mining image segmentation [14].

4

Although several research works have been done for oil sand image segmentation, only few works have been proposed in large lump detection problem. Recently Wang et al. [15] describe a particle filter based solution for detecting large lump from oil sand images. They proposed an observation model for the Bayesian tracker for joint detection and tracking of large lump in an image sequence. To define the observation model, a feature detector is proposed. First, a local Otsu thresholding is done on the original image to obtain the thresholded image. Then white pixel density within a local window for each location is calculated to get a density image. This step is due to the observation that in the thresholded image, white pixels in the real lump region tend to be more compact and smooth than the white pixels in the clutter regions, and consequently real lump pixels have bigger densities in the density image than clutter pixels. Finally, a global thresholding on the density image is done to get the final observation. However this method is heavily dependent on thesholding parameters. In addition it does not perform well on images with very low contrast and images captured in more difficult outdoor condition for example nightlight and snow. III. P ROPOSED M ETHOD In this work, we propose a method to detect the presence of blob-objects from a video stream utilizing DoG scalespace. To efficiently utilize information from multiple frames, we propose a temporally aligned kernel function in which temporal matching is performed in a multi-scale fashion in the frames belonging to a video. Although such constraints may seem to be over constricting, it explicitly utilizes the temporal information and is experimentally demonstrated to be effective for detecting of large lump events. Our proposed solution to object detection classifies every image frame of a video stream with a moving window in the time direction. In the learning stage, training video clips of positive and negative examples are randomly chosen. The number of image frames belonging to each of the training video clips is equal to the size of the moving window. DoG scale-spaces are constructed on every frame of the training video clips. DoG scale-space is essentially a set of DoG response images, where each response image is associated with a scale parameter value. So, between any two video clips, we have two sets of time aligned and scale aligned DoG response images. For every pair of video clips in the training set, kernel functions are computed on corresponding time and scale aligned DoG responses. In this regard, we have proposed a novel kernel function based on circular convolution between two DoG responses. Learning proceeds in two cascaded stages. In the first stage, LMKL learns a mixture kernel function, which is a convex combination of the circular convolution kernel functions for different time points and scales. A convex combination is a linear combination with non-negative coefficients. Thus, by construction, the mixture kernel function reflects similarity between two video clips observed at various time and DoG scale values. In the second stage of the cascade, the aforementioned mixture kernel function is used for learning a SVM to classify

IEEE TRANSACTION ON IMAGE PROCESSING

5

the training video clips into two categories: large lump and non-large lump. In the first cascading stage, we utilize a 1norm linear SVM to learn sparse convex combination of the circular convolution kernels. Besides the selection of discriminatory features, sparsity is important because the classifier, then, needs to compute DoG response images only for a few scales in the moving window classification. Thus, sparsity also helps in time-efficient classification. The following three subsections describe DoG scale-space construction, circular convolution kernel function, and LMKL learning. A. DoG Scale-Space Construction DoG is the difference of two Gaussians with nearby scales separated by a constant multiplicative factor c [1]: DIσ (i, j) = (G(i, j; cσ) − G(i, j; σ)) ∗ I(i, j), 2

2

(1)

Fig. 6.

DoG responses for the image in Figure 5(a).

2

1 −(i +j )/2σ is the Gaussian kerwhere G(i, j; σ) = 2πσ 2e nel and ∗ is the convolution operation between the kernel G(i, j; σ) and the original image I(i, j). The parameter σ is referred to as scale. We follow the efficient approach for construction of Dσ (i, j) as proposed in [1]. The input image is successively convolved with Gaussian functions to produce a scale-space stack called octave where scales of the smoothed images are separated by a constant factor c. Each octave of scalespace is divided into an integer number, s, called sub-level, so that c = 21/s . Adjacent image scales are subtracted to produce the DoG images. Once a complete octave has been processed, the Gaussian smoothed images are re-sampled that has twice the initial value of σ by taking every second pixel in each row and column. This technique is able to reduce the computation greatly [1]. The octave index o and sub-level index s respectively are mapped to the corresponding scale σ using the following equation:

image containing a large lump and not containing one have different DoG responses, especially at the coarser scales. In the presence of a large lump there is a blob like structure, which corresponds to the large lump in some of scale values.

σ(o, s) = σ0 2o+s/S , o ∈ [Omin , · · · , Omax ], s = [0, · · · , S−1]. (2) where Omin and Omax are the minimum and maximum octave numbers, S is the number of sub-level and σ0 is the base scale Fig. 7. DoG responses for the image in Figure 5(b). level.

(a)

Fig. 5.

(b)

Image frame containing (a) large lump and (b) no large lump.

DoG plays an important role in blob-like object detection [1]. For the application at hand, Figure 6 and Figure 7 show the DoG scale-spaces with different scales for images in Figure 5(a) and Figure 5(b) respectively. Note that the

From these DoG responses we can see that, in order to extract image structures stably, the concept of a blob at a single scale is not sufficient. Here, each of the DoG smoothed images can be considered as features. However, for a specific classification task, it is important to find relevant scales instead of using the entire scale-space. Selection of a few relevant scales would not only enhance the classification performance, but it would also compute classification in real or near-real time. B. Basis Kernel Function Construction The choice of kernel function in kernel based methods plays an important role. In considering such a kernel function, inclusion of prior knowledge about possible variations of the patterns can play a significant role to design an effective support vector machine (SVM) classifier. Shift invariance is

IEEE TRANSACTION ON IMAGE PROCESSING

6

important for our application, because large lumps can appear anywhere in the image frame and yet the kernel function needs to find out similarity among all these cases. The standard kernels functions, such as linear, Gaussian or polynomial kernels, are not always appropriate to be used with scalespace features for object detection. One principal reason is that these kernels are not shift invariant. We propose a base kernel function defined with circular convolution, which is shift invariant. The circular convolution between two scalespace images DIσ and DJσ of size M ×N obtained from images I and J for the scale σ can be computed as follows: (DIσ

⊗c DJσ )(i, j)

=

M X N X

DIσ (i − u, j − v)DJσ (u, v). (3)

u=1 v=1

Using (3) we can define a kernel function between two images I and J for scale the σ as [16]: kσ (I, J) =

M X N X

2

((DIσ ⊗c DJσ ) (i, j)) ,

(4)

i=1 j=1

and call this kernel function as circular convolution kernel function. Note that this is equivalent to point-wise multiplication of the Fourier transform coefficients of DIσ and DJσ and sum of their squares. Thus, this kernel not only takes into account the shift invariance, it also compares frequency bands produced in the DoG scale-space. Keeping in our mind that DoG is a band pass filter, the circular convolution kernel function compares frequency responses. It can be shown that circular convolution kernel function is indeed a kernel function i.e., symmetric and positive semi-definite (See Appendix A). Notice that the basis kernel function (4) is defined between corresponding DoG responses for two images I and J. To extend this kernel for two video clips of same length, consider U = (I1 , I2 , . . . , IT ) and V = (J1 , J2 , . . . , JT ) as two such clips of length T, where Ii and Ji are ith frames in the clips. The circular convolution kernel function can be extended between U and V as kt,σ (U, V ) = kσ (It , Jt )

(5)

for t = 1, 2, . . . , T . Thus, (5) is the time aligned and scale aligned circular convolution kernel function between two video clips. C. Linear Multiple Kernel Learning Instead of choosing any particular single time and scale based kernel function, we propose to use a data dependent mixture kernel function, which is a convex combination of basis kernel functions (5): XX k(U, V ) = w0,0 + wt,σ kt,σ (U, V ), (6) t

σ

where each kernel kt,σ is the basis kernel function for video frame t and scale σ and wt,σ is the weight for the same frame and scale. For the large lump detection problem, the reason for considering multiple basis kernel functions can be justified from the perspective that a large lump can span over more than a single video frame and a single scale in the DoG scale

space. The goal of MKL is to obtain an optimized combination of the non-negative coefficients of linear combination in (6). There is a straightforward interpretation of (6): if the basis kernel functions kt,σ are considered features, then k in (6) is a prediction function with linear combination of nonnegative weights that need to be learned from data. A positive sign for k(U, V ) would indicate that U and V belong to the same class, while a negative sign for k(U, V ) would indicate otherwise. The standard machinery of large margin hyper plane classifier, viz., linear SVM can be applied on (6) with only slight modifications to yield non-negative weights (see Appendix B). Alternatively, one can use Mangasarian’s unconstrained minimization [17]. The 1-norm SVM has the advantage of a sparse solution [18]. In our case, the sparse solution ensures that only a handful of the coefficients wt,σ ’s will be non-zero and a few convolutions corresponding to these scales need to be evaluated at the run time. Once the basis kernel weights (wt,σ ) are learned, the next step is to classify a test video clip V . We apply a non-linear SVM [19] to classify a video clip V : X f (V ) = α0 + αi yi k(Ui , V ), (7) i

where {U1 , U2 , · · · } are support vectors (training video clips) with weights {α1 , α2 , · · · } and corresponding labels {y1 , y2 , · · · }. We assume yi = 1 implies a large lump, and yi = −1 implies non-large lump. α0 is the bias. The prediction label for an unknown video sequence V is obtained as the sign of f (V ). In our experiments, each training video clip consists five image frames. For a test video stream (J1 , J2 , . . . ), if the current time is denoted by t, then the test video clip V for the current time t would consist of five frames: V = (Jt−2 , Jt−1 , Jt , Jt+1 , Jt+2 ). We learn the two P classifiers (6) and (7) in a cascade. Notice that because i αi yi = 0 in a standard 2-norm SVM [19], the bias term w0,0 in (6) is not required in (7). Below we provide the proposed LMKL algorithm. LMKL Algorithm Inputs: Training set {U1 , U2 , U3 , · · · , Ul } of l video clips and holdout set {V1 , V2 , V3 , · · · , Vp } of video clips Outputs: 1) kernel weights wt,σ for t = 1, 2, · · · , T and σ = σ1 , σ2 , · · · , σN 2) SVM weights α0 , α1 , α2 , · · · , αl Perform the following three steps (a) through (c) in a sequence: (a) From the training set, obtain all possible pairing of video clips of the form (Ui , Uj ), i > j. With these paired videos compute the non-negative weights (wt,σ ) for (6) using linear 1-norm SVM. (b) Using the training set, compute the support vector weights α0 , α1 , α2 , · · · , αl in (7) with a standard 2-norm nonlinear support vector machine with the learned mixture kernel function k from (6).

IEEE TRANSACTION ON IMAGE PROCESSING

(c) Obtain classification accuracy on the hold-out test set using (7). 1-norm SVM has one regularization parameter (λ of (21) as shown in the Appendix B). The non-linear SVM has another tuning parameter ν that bounds the weights: αi ≤ 1/(νl), i = 1, · · · , l [19]. The LMKL algorithm is run on all possible combinations of these two tuning parameters. The combination that yields the highest classification accuracy is considered and the associated learned parameters α0 , α1 , α2 , · · · , αl and wt,σ ’s are retained. Our algorithm requires linear programming or alternatively the method proposed in [17] making LMKL implementation simple, fast, and easily accessible. Our proposed classification margin function for the linear classifier (6) takes the form: yU yV k(U, V ), where yU and yV are classification labels (+1 or −1) for respectively clips U and V . Note that the factor yU yV is the ideal kernel function [19] taking a value +1, when U and V come from the same class, and taking −1 when U and V belong to different classes. With this setting, an apparent concern is the violation of independence among sample points, because the LMKL algorithm pairs up i.i.d (independent and identically distributed) sample points and inflates it from size l to l(l − 1)/2. However, we have established that our margin function, viz. yU yV k(U, V ) has a sharp concentration bound, even when it is estimated from paired up sample points. If |k(U, V )| ≤ R, i.e., the kernel function is bounded between −R and +R, the following concentration inequality holds with probability 1 − δ (see Appendix C for a derivation): r 8R2 2 ˆ [yU yV k(U, V )] ≤ ln( ), E [yU yV k(U, V )] − E l δ (8) where E denotes expectation of the margin function and ˆ denotes empirical estimation of expectation from paired E sample points, l is the sample size, i.e., the size of the training set as defined in the LMKL algorithm. Thus, as expected, with increasing sample size l, the concentration becomes sharper. The learned kernel function can easily achieve a bound, such as |k(U, V )| ≤ R, if every base kernel function kt,σ has a bound and in addition the weight wt,σ is normalized. The ˆ concentration bound (8) proves that the empirical estimate E of our proposed margin function, embedded in a loss (e.g., hinge loss of SVM), is a reasonable target for optimization that a binary classifier can achieve with paired sample points constructed from an i.i.d set. As an additional theoretical motivation for binary classifiers using paired sample points, we cite the work by Usunier et al. [20] that finds a generalization error bound for a classifier, which deterministically combines training patterns from an i.i.d. set. Bipartite ranking is one such problem where paired sample points are used [20]. IV. R ESULTS This section is divided into four subsections. First we describe the data sets used in our experiments. Next, we illustrate the efficacy of the proposed kernel function compared to other standard kernel functions, such as Gaussian, polynomial etc. Next section shows the performance of some standard object

7

detection features, such as HoG, bag of visual words etc. on the large lump dataset. Finally, we compare our proposed LMKL with three other popular MKL techniques from recent literature. A. Description of Data

Fig. 8.

Region of interest for largelump detection.

Experiments have been performed on three oil sand data sets captured in three different outdoor conditions: normal daylight, night light and snowy day. For all experiments, a region of interest is defined as shown in Figure 8. The image sequence is cut according to the region of interest and perspective corrected to make it rectangular. In daylight video set there are 768 image frames. Within those, 183 frames contain large lump and 585 contain no large lump. The main challenge with daylight dataset is the changing of the lighting condition. Night light video set has 585 frames with 316 large lump frames and 269 no large lump images. Although the lighting is fixed in this dataset, the main problem is the presence of shadow. Video on the snowy day contains 378 frames with 144 positive cases and 234 negative cases. Most of the image frames of this dataset contains snow flakes captured in a heavy snowy day. Some of the example large lump images from three data sets are shown in Figure 9.

(a)

(b)

(c)

Fig. 9. Example large-lump images from (a) Daylight (b) Nightlight and (c) Snow datasets.

Any large lump event usually spans over several image frames. For example, Figure 10 shows a large lump event spanning over five consecutive image frames. Given such a multi-frame structures, it makes sense to extend the frame based classification to video clip based classification as we discussed above. For each data set, we first construct training set with 50 video clips, where 25 clips came from large lump events and 25 clips came from non- large lump events. Each such training video clip contains T = 5 consecutive frames. The number

IEEE TRANSACTION ON IMAGE PROCESSING

8

of DoG scales N = 40. Thus, the MKL algorithms will be sparsely choosing weights from 5 × 40 = 200 base kernel functions.

0.18 Circ Conv Polynomial Gaussian Sigmoid

0.16

Value of align function

0.14 0.12 0.1 0.08 0.06 0.04

Fig. 10.

Large lump event sequence from Daylight dataset.

0.02 0

The performance of our classification method is evaluated with precision, recall, accuracy and MCC (Matthew Correlation Coefficient) defined as follows: tp P recision = , tp + f p tp , Recall = tp + f n tp + tn , Accuracy = tp + tn + f p + f n (tp)(tn) − (f p)(f n) M CC = p , (tp + f n)(tp + f p)(tn + f p)(tn + f n) where tp, f p, tn and f n are true positive, false positive, true negative and false negative, respectively. MCC is generally regarded as a balanced measure that is very effective if the classes are of quite different sizes. B. Convolution Kernel Function vs. Others For each data set, we have generated its DoG scale spaces, which contains 40 DoG responses corresponding to 40 consecutive scales. For each scale, we have constructed the base kernel kσ . To compare our base convolution kernel function (4) with traditional kernel functions we have created kernel matrix from our training data using linear, polynomial, Gaussian, sigmoid kernel functions and computed the kernel alignment score [21] for each scale to evaluate the compliance of a kernel to the data. The range of the alignment score is [0, 1]. The larger its value is, the better the kernel function. The graph in Figure 11 shows the alignment score on daylight training set for different scales using different kernel functions. The proposed circular convolution kernel function obtains a better alignment score than the other traditional kernel functions, especially in the coarser scales. This score plot illustrates that circular convolution kernel is appropriate for our application. C. DoG vs. Other Features In this section we have compared DoG with different popular features for object detection namely, Histogram of oriented gradient (HoG), bag of visual words and dense word. 1) Histogram of Oriented Gradients: Histogram of Oriented Gradients (HOG) proposed in [22] is one of the popular feature descriptors used widely for object detection problem. HoG calculates occurrences of gradient orientation in

Fig. 11.

0

5

10

15

20 Scales

25

30

35

40

Alignment scores for different kernels at different scales.

localized portions of an image. The key concept behind the HoG descriptors is that local object appearance and shape of the image objects can be represented by the distribution of intensity gradients or edge directions. To calculate the descriptors, images are divided into cells, which are small connected regions, and for each cell a histogram of gradient directions or edge orientations for the pixels is calculated. The combination of these histograms then represents the descriptor. For improved accuracy, the local histograms can be contrastnormalized by calculating a measure of the intensity across a larger region of the image, called a block, and then using this value to normalize all cells within the block. This normalization results in better invariance to changes in illumination or shadowing. The HOG descriptor has several important advantages over other descriptor methods. HoG descriptor is resilient to geometric and photometric transformations as it operates on localized cells. Table I shows the performance for different block sizes on daylight dataset. TABLE I P REDICTION PERFORMANCE OF H O G FEATURES FOR DIFFERENT BLOCK SIZE ON DAYLIGHT DATASET Block Size 5×5 9×9 13 × 13 17 × 17 19 × 19

Recall 0.4481 0.4426 0.6230 0.6011 0.6230

Precision 0.5694 0.6136 0.7451 0.7639 0.7451

MCC 0.3738 0.4018 0.5936 0.5929 0.5936

Accuracy 0.7883 0.8013 0.8597 0.8610 0.8597

TABLE II P REDICTION PERFORMANCE OF DIFFERENT FEATURES FOR DAYLIGHT DATASET

Feature HoG Bag of visual words Dense word DoG

Recall 0.6011 0.6612 0.6503 0.7760

Precision 0.7639 0.7035 0.7391 0.8023

MCC 0.5929 0.5869 0.6058 0.7247

Accuracy 0.8610 0.8532 0.8623 0.9013

IEEE TRANSACTION ON IMAGE PROCESSING

9

2) Bag of Visual Words: This feature is proposed in [23]. Bag of Visual Words is one of the most popular techniques in object categorization because of their simplicity and relatively good performance over other prevalent methods. In this method, the SIFT descriptors are extracted at Hessian-Laplace points and quantized in a vocabulary of 3000 words, trained on features from several object instances. By using agglomerative information bottleneck (AIB) [23] the vocabulary is discriminatively compressed down to 64 words for each class. In our experiment, the same training images as before from each class are used.

http://homes.esat.kuleuven.be/ sistawww/bioi/syu/l2lssvm.html and the Matlab code of GMKL are downloaded from http://research.microsoft.com/enus/um/people/manik/code/GMKL/download.html. For LSMKL, MOSEK Optimization Software which combines the convenience of MATLAB with the speed of C code is used to solve optimization problems. Table V shows the CPU time needed for each multiple kernel learning method. It can be seen that the computational efficiency of the proposed method is comparable to the other popular MKL techniques. TABLE V T HE CPU TIME ( IN SECONDS ) NEEDED FOR EACH METHOD

TABLE III P REDICTION PERFORMANCE OF DIFFERENT FEATURES FOR N IGHTLIGHT DATASET

Feature HoG Bag of visual words Dense word DoG

Recall 0.7532 0.8418 0.8323 0.9076

Precision 0.8655 0.8693 0.8709 0.8333

MCC 0.6147 0.6916 0.6854 0.7043

Accuracy 0.8034 0.8462 0.8427 0.8525

3) Dense Word: Dense word features applied in [24] for object detection are used here large lump detection problem. This feature is also based on the SIFT descriptor. Rotationally invariant SIFT descriptors are extracted on a regular grid of five pixels at four multiple scales with raddi r = 10, 15, 20 and 25 pixels, zeroing the low contrast ones. Finally descriptors are quantized into 300 visual words. Tables II, III and IV show the performances of the different features for three different data sets. Notice that the performance of the LMKL-based DoG feature appears as the last row in these tables and compares very well with its competitors. TABLE IV P REDICTION PERFORMANCE OF DIFFERENT FEATURES FOR S NOW DATASET

Feature HoG Bag of visual words Dense word DoG

Recall 0.6736 0.6528 0.8333 0.8056

Precision 0.8291 0.9400 0.7453 0.9431

MCC 0.6178 0.6904 0.6463 0.8039

Accuracy 0.8228 0.8519 0.8280 0.9074

Dataset Daylight Nightlight Snow

LMKL 0.6473 0.6510 0.6033

SKM 0.6138 0.6867 0.3411

LSMKL 0.3105 0.3495 0.3126

GMKL 2.4056 2.3910 3.7630

Table VI shows the number of base kernels selected by each method. Although the average number of selected scales by LMKL compared to other MKL methods is relatively high, the number is still sparse and represents only 12% of the total kernels on an average. As a result LMKL is reasonable for real/near real time applications. Figure 12 shows the selected weights for five consecutive frames using our method. From these plots we can see that most of weights are selected in the middle frames. This is because we have designed the training set in such a way that for the positive video clip the middle frame will always contain large lump and for the negative video clip the middle frame will have no large lump. The beginning and trailing frames may or may not contain large lump based on the time point of the middle frame. For example, if the middle frame is the staring point of a large lump event, then the beginning frames will not have any large lump. On the other hand, if the middle frame is end point of large lump event then beginning frames will certainly contain the large lump. So, the middle frame is always more important than other frames in our moving window classifier. TABLE VI N UMBER OF KERNELS NEEDED FOR EACH METHOD .

D. LMKL vs. Others After constructing basis kernels on each DoG scales, MKL algorithms are applied for sparse selection and weighting of kernels. For performance comparisons, three popular existing MKL techniques discussed in section II-B i.e. SKM [7], LSMKL [8] and GMKL [9] are chosen. For each MKL method a five-fold cross-validation has been performed to determine the value of the tuning parameters. First, a comparison of the computational time by proposed method, SKM, LSMKL and GMKL on the three data sets are presented. All experiments were run on Intel core (R) TM(2) Dual processor with 2.43 GHz 64 bit machine with 3 GB RAM. The proposed 1-norm LMKL is implemented in Matlab 2009. The Matlab code of SKM and LSMKL is downloaded from the web site

Dataset Daylight Nightlight Snow

LMKL 21 21 32

GMKL 15 40 18

SKM 9 11 7

LSMKL 9 12 10

TABLE VII P REDICTION R ESULTS OF DIFFERENT MKL METHODS FOR DAY LIGHT DATASET. Method LMKL GMKL SKM LSMKL

Recall 0.7760 0.7049 0.6995 0.6831

Precision 0.8023 0.7914 0.7901 0.6720

MCC 0.7247 0.6742 0.6700 0.5759

Accuracy 0.9013 0.8857 0.8844 0.8455

IEEE TRANSACTION ON IMAGE PROCESSING

10

TABLE IX P REDICTION RESULTS OF DIFFERENT MKL METHODS FOR S NOW DATA SET. 0.7

0.5

0.6

Method LMKL GMKL SKM LSMKL

0.5

Weight

Weight

0.4 0.3 0.2

0.4 0.3 0.2

0.1 0 5

0.1 40 4

3

0 5

20 2

0

1

40 4

2

1

0

Time

(a) Daylight

Training Size

Weight

0.4

Method LMKL GMKL LMKL GMKL LMKL GMKL LMKL GMKL

10 pos+10 neg

0.3

15 pos+15 neg

0.2 0.1

20 pos+20 neg

40 20

3

2

1

Time

0

Accuracy 0.9074 0.9021 0.8704 0.9021

TABLE X C LASSIFICATION PERFORMANCE FOR DIFFERENT TRAINING SET SIZES .

0.5

4

MCC 0.8039 0.7914 0.7312 0.7923

Kernel

(b) Nightlight

0 5

Precision 0.9431 0.9213 0.9612 0.9350

20

3

Kernel

Time

Recall 0.8056 0.8125 0.6875 0.7986

25 pos+25 neg

Kernel

Recall 0.5847 0.5574 0.7213 0.5902 0.7268 0.6011 0.7760 0.7049

Precision 0.8359 0.7500 0.7674 0.7714 0.7778 0.7746 0.8023 0.7914

MCC 0.6276 0.5575 0.6675 0.5911 0.6780 0.5999 0.7247 0.6742

ACC 0.8740 0.8506 0.8818 0.8610 0.8857 0.8636 0.9013 0.8857

(c) Snow Kernel weights selected by LMKL for different data sets.

Method LMKL GMKL SKM LSMKL

Recall 0.9076 0.8766 0.6234 0.7500

Precision 0.8333 0.8318 0.8955 0.8525

MCC 0.7043 0.6727 0.5534 0.5964

Accuracy 0.8525 0.8376 0.7573 0.7949

From tables VII, VIII and IX we can see that our sparse LMKL method outperforms other three popular multiple kernel learning techniques with respect to most of the performance metrics for daylight and night light data sets. For the snow data set our method achieves the same performance as LSMKL and they perform better than the other two other methods. To show the effect of variation in the the training data set size on the classification performance, we have performed experiments with different sizes of the training set for the day light data. As a competing method we chose GMKL, as GMKL has the closest performance to the proposed method for all the data sets considered here. The results are provided in Table X. To explain the first column, say the entry 10 pos+10 neg here refers to a training data set with 10 positive and 10 negative examples. Comparing the performance metrics in Table X, we can say that the proposed method has better performance than GMKL, especially when training data set size is small. Note that as expected, performances for both the methods have been enhanced with the increase in the training data size. Being a two stage learning method, our algorithm is easily able to tune precision or recall based on user requirement. Here we have combined the scheme provided in [25] in our proposed system to adjust precision and recall in support

vector machine and provide the performance using recallprecision curve. The balance between recall and precision can be controlled using the following technique: the diagonal elements of the optimized kernel matrix are supplemented by fixed positive contributor + and −, respectively for the positive and negative classes. Controlling these two parameters one can vary precision and recall. This method actually corresponds to an asymmetric margin; i.e., the class with smaller  will be kept further away from the decision boundary. Figure 13, 14 and 15 show the recall precision graph of the proposed classifier for three different dataset with (−=0) and varying (+) and with (+=0) and varying (−), respectively. To compare with other MKL methods more precisely, we have interpolated the recall-precision curves to obtain recall values of our method at different precision levels appeared in other methods. This arrangement allows us to compare recalls for different methods at a fixed precision. Figure 16(a), (b) and (c) show this comparisons for GMKL, SKM and LSMKL, respectively. The data sets utilized are also indicated in the diagrams.

1

1

Recall Precision

0.9

0.9

Recall Precision

0.8

0.8

Value in fraction

TABLE VIII P REDICTION RESULTS OF DIFFERENT MKL METHODS FOR N IGHT LIGHT DATA SET.

Value in fraction

Fig. 12.

0.7

0.6

0.7

0.6

0.5

0.5 0.4

0.4

0.3

0.2 0

0.01

0.02

0.03

0.04

0.05

0.06

Value of ε+

0.07

0.08

0.09

0.1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Value of ε-

Fig. 13. Recall-precision graph for day light data set: the value of recall and precision versus + and −(x-axis) respectively.

IEEE TRANSACTION ON IMAGE PROCESSING

1

11

1

0.95

Recall Precision

Recall Precision

0.9

Value in fraction

Value in fraction

0.9

0.8

0.7

0.6

0.85

0.8

0.75

0.7 0.5 0.65

0.4 0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0

0.01

0.02

0.03

0.04

Value of ε+

0.05

0.06

0.07

0.08

0.09

0.1

Value of ε-

Fig. 14. Recall-precision graph for night light data set: the value of recall and precision versus + and −(x-axis) respectively.

1

over other popular features for object detection. Our LMKL algorithm uses the principle of maximum margin 1-norm SVM classifier and compares well with other competitive MKL techniques. In the future, we will investigate spatio temporal features. One of the interesting features we are working on currently is 3D DoG responses obtained from video clips. We believe 3D DoG which is able to capture spatial and temporal information will increase the precision of the classification system. Testing of the proposed classification method on other diverse applications such as human detection, is also left for our future endeavor.

A PPENDIX A C IRCULAR CONVOLUTION KERNEL FUNCTION IS SYMMETRIC AND POSITIVE SEMI - DEFINITE

1

0.95 0.95

Recall Precision

0.9

0.9

Value in fraction

Value in fraction

0.85

0.8

0.75

0.7

Recall Precision

0.85

0.8

0.75

0.65 0.7 0.6 0.65 0.55

0.5 0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0

0.01

0.02

0.03

0.04

Value of ε+

0.05

0.06

0.07

0.08

0.09

0.1

Value of ε-

In this section we prove that the function given by (4) is a kernel function. To prove this, we need to define a kernel matrix Kp,q , which is defined for a set of images I1 , I2 , ..., and a kernel function k as: Ki,j = k(Ii , Ij ), ∀ i, j = 1, 2, · · · , .

Fig. 15. Recall-precision graph for snow data set: the value of recall and precision versus + and −(x-axis) respectively.

V. C ONCLUSIONS A ND F UTURE W ORK This paper provides a novel automated object detection framework. Our solution technique uses DoG scale-space as features. Feature selection has been performed via a multiple kernel function learning framework. We applied our proposed detection system to the challenging large lump detection problem. Our proposed shift invariant, convolution kernel function outperforms other traditional kernel functions. Further, our experiments revealed the efficacy of DoG based features 1

LMKL GMKL

0.8

0.8

0.6

0.6

Recall

Recall

1

0.4

0.2

0

LMKL SKM

To prove k is a kernel function, we need to show that the associated kernel matrix K is symmetric and positive semidefinitive. Toward this goal, we prove the following Lemma. Lemma 1: If A and B are two doubly block circulant matrices, then AT B = BAT . Proof: We first prove the result for circulant matrices. Let X and Y be two circulant matrices: 

(a) LMKL vs. GMKL



1

y0 y1 .. .

  Y =  yN −1

Day(Pre=0.79) Night(Pre=0.89) Snow(Pre=0.96)

(b) LMKL vs. SKM

xN −1 x0 .. .

··· ··· .. .

 x1 x2   ..  .

xN −2

···

x0

yN −1 y0 .. .

··· ··· .. .

 y1 y2   ..  .

yN −2

···

y0

(10)

and

0.4

0

x0 x1 .. .

  X=  xN −1

0.2

Day(Pre=0.79) Night(Pre=0.83) Snow(Pre=0.92)

(9)

(11)

Then, the (k, m)th element of matrix X T Y is

LMKL LSMKL

Recall

0.8

0.6

T

(X Y )k,m

0.4

=

0.2

0

Day(Pre=0.67) Night(Pre=0.85) Snow(Pre=0.93)

=

(c) LMKL vs. LSMKL Fig. 16. Comparison of proposed LMKL with different MKL at the fixed precision values for different data sets.

=

N −1 X

T

(X )k,l (Y )l,m =

l=0 N −1 X l=0 N −1 X p=0

N −1 X

xl−k yl−m

l=0

xl−m+m−k yl−m xp+m−k yp

(12)

IEEE TRANSACTION ON IMAGE PROCESSING

12

where the notation F¯ stands for the following column vector of size M N × 1 F¯ = [F0,0 F0,1 · · · F0,M −1 · · · FN −1,0 FN −1,1 · · · FN −1,M −1 ]T F˜ is a doubly block circulant matrix of size M N × M N :

On the other hand (Y X T )k,m

N −1 X

N −1 X

l=0 N −1 X

l=0

(Y )k,l (X)Tl,m =

=

=

l=0 N −1 X

=

yk−l xm−l

yk−l xm−k+k−l F˜ = yp xp+m−k

(13)

F0,0 F0,1

F0,M −1 F0,0

··· ···

F0,1 F0,2

.. .

.. .

.. .

.. .

··· ···

F1,0 F1,1

F1,M −1 ··· F1,1 F1,0 ··· F1,2

FN −1,M −1 FN −1,M −2 ··· FN −1,0 ··· F0,M −1 F0,M −2 ··· F0,0

(19)

p=0

So, X T Y = Y X T . We will now extend the result to the doubly block circulant matrices. Note that a doubly block circulant matrix has a circulant block structure and each block is in turn a circulant matrix. For two such matrices A and B, let us consider the (k, m)th block of AT B: [AT B]k,m

=

X X [AT ]k,l [B]l,m = [A]Tl−k [B]l−m

=

X [A]Tp+m−k [B]p

l

l

(14)



Thus using lemma 1 Kp,q

= =

T T F¯q F˜p F˜q F¯p T T T T F¯q F˜q F˜p F¯p = (F˜q F¯q )T (F˜p F¯p )

(20)

Thus K can be written as K = LLT for some matrix L. Hence K is a positive semidefinite symmetric matrix. A PPENDIX B M ODIFICATION OF 1- NORM SVM FOR NON - NEGATIVE W EIGHTS

p

Following [17] 1-norm linear SVM for the binary classification problem can be written as:

On the other hand, [BAT ]k,m

= =

X l X

[B]k,l [AT ]l,m = [A]Tm−l [B]k−l =

Therefore, A B = BA

[B]k−l [A]Tm−l

l X

[A]Tp+m−k [B](15) p

p

l T

X

T

Poposition 1: The function given in (4) is a kenel function. Proof: In order for a function to be a kernel function, it needs to be (a) symmetric and (b) positive semidefinite. Symmetry is satisfied for function (4) because circular convolution is commutative. It remains to show if function (4) is positive semidefinite. To show this let F and G be two matrices of size N × M . Thus circular convolution between them is defined as:

min z,w,w0 λ kzk1 +kwk1

s.t., D(Aw−ew0 )+z ≥ e, z ≥ 0. (21)

where m×n matrix A represent m points in Rn to be separated by maximal margin with a separating plane: xT w = w0 . D is a m × m diagonal matrix with elements Dii = +1, or − 1 according to the class of each row of A. e is a vector of all ones in (21). The objective function (21) is a trade off between a large margin (1/ kwk1 ) and an error penalty on slack variables z. User tuning parameter λ controls this trade off. The terms kwk1 and kzk1 denote 1-norm for vectors w and z respectively. In our case, the weight vector w is nonnegative, so, one can rewrite equation (21) as: min T T z,w,w0 λe z+e w, s.t., D(Aw−ew0 )+z

≥ e, z, w ≥ 0. (22)

The linear program in (22) has a solution because it is feasible and its objective function is bounded below by zero. For PN −1 PM −1 a fairly large-scale problem, a standard package, such as (F ⊗c G)i,j = u=0 v=0 Fi−u,j−v Gu,v = CPLEX, is able to solve the linear program (22). Alternatively, one can also apply the unconstrained Newton optimization [Fi,j Fi,j−1 · · · Fi,j−M +1 · · · Fi−N +1,j · · · Fi−N +1,j−M +1 ] T [G0,0 G0,1 · · · G0,M −1 · · · GN −1,0 · · · GN −1,M −1 ] (16) method [17] defined for a large-scale linear programming of the form such as (22). In a similar way, PN −1 PM −1 (F ⊗c G)i,j = u=0 v=0 Gi−u,j−v Fu,v = A PPENDIX C [Gi,j Gi,j−1 · · · Gi,j−M +1 · · · Gi−N +1,j · · · Gi−N +1,j−M +1 ]

P ROOF OF CONCENTRATION BOUND

(17) It is natural to ask if our proposed margin function yy 0 k(x, x0 ) has a concentration around an empirical estimation Let F1 , F2 , · · · are 2D matices for which function (4) is from paired sample points belonging to an independent and computed pairwise. Let K denote the kernel matix. (p, q)th identically distributed (i.i.d.) set of patterns and their responses element of K is given by S = {(x1 , y1 ), · · · , (xl , yl )}. To derive a concentration bound, note that N −1 M −1 X X yy 0 k(x, x0 ) = yy 0 φ(x)T φ(x0 ) Kp,q = (Fp ⊗c Fq )2i,j = (F˜p F¯q )T (F˜p F¯q )(18) = ρ(z)T ρ(z0 ), (23) i=0 j=0 [F0,0 F0,1 · · · F0,M −1 · · · FN −1,0 · · · FN −1,M −1 ]T

IEEE TRANSACTION ON IMAGE PROCESSING

13

where the kernel function k can be expressed as a inner product of some feature map φ and we denoted z = (x, y) as a pattern (x) and corresponding response (y) together, further, we note that ρ(z) = yφ(x) and ρ(z0 ) = y 0 φ(x0 ).

(24)

With this setup, the expectation of our proposed margin function is as follows: E[yy 0 k(xx0 )]

= E[ρ(z)T ρ(z0 )] = E[ρ(z)]T E[ρ(z0 )] 2

= E[ρ(z)]T E[ρ(z)] = kE[ρ(z)]k (25) The second equality holds because z and z0 are independent. The third equality holds because z and z0 are identically distributed. The empirical expectation estimated from the set S = {z1 , · · · , zl } with paired sample points (zi , zj ) is as follows: b 0 k(x, x0 )] E[yy

= =

X 2 yi yj k(xi , xj ) l(l − 1) i>j X 2 ρ(zi )T ρ(zj ) l(l − 1) i>j

X 2 2 ρ(zi )T ρ(zj ) − kE[ρ(z)]k l(l − 1) i>j

(26)

(27)

Thus, it is our aim now to find a bound on |g(S)|. To do so, let us consider another i.i.d. set of sample points S˜ = {z1 , z2 , · · · , zi−1 , z0i , zi−1 , · · · , zl }, where only the ith ˜ Now, we sample point is different between the sets S and S. find a bound on the following: X 2 0 T ˜ (ρ(zi ) − ρ(z i )) ρ(zk ) g(S) − g(S) = l(l − 1) k6=i



4R , l

(28)

assuming the kernel function has a bound |k(x, x0 )| ≤ R. Because S is an i.i.d. sample and the function g(S) has a bounded variation for each sample point zi , we can apply McDiarmid’s inequality [19] and obtain: P {g(S) − E[g(S)] ≥ } ≤ exp(−

l2 ), 8R2

(29)

where  is any given positive number. Further, we note that X 2 2 E[g(S)] = E[ ρ(zi )T ρ(zj )] − kE[ρ(z)]k l(l − 1) i>j X 2 2 = E[ρ(zi )T ]E[ρ(zj )] − kE[ρ(z)]k l(l − 1) i>j =

0.

l2 ). 8R2 Adding these two inequalities, we obtain: P {−g(S) ≥ } ≤ exp(−

(32)

l2 ). (33) 8R2 Setting the right hand side to δ and solving for  = q 8R2 2 l ln( δ ), we are able to assert that with probability 1 − δ, the following bound holds: r 8R2 2 |g(s)| ≤ ln( ), (34) l δ from which we write (8). P {|g(S)| ≥ } ≤ 2exp(−

R EFERENCES

Let us now define a function g(S) on the set of i.i.d. sample S as follows:

g(S) =

The significance of E[g(S)] being zero is that our empirical estimate using pairwise sample points is unbiased. Thus, we obtain: l2 (31) P {g(S) ≥ } ≤ exp(− 2 ). 8R However, reversing the sign of g(S) and following practically the similar derivation as above, we also obtain:

(30)

[1] D. G. Lowe, “Object recognition from local scale-invariant features,” in The Proceedings of the Seventh IEEE International Conference on Computer Vision (ICCV), vol. 2, 1999, pp. 1150–1157. [2] ——, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, pp. 91–110, 2004. [3] A. Witkin, “Scale-space filtering: A new approach to multi-scale description,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 9, 1984, pp. 150–153. [4] H. Zhang, “Image processing for the oil sands mining industry,” IEEE Signal Processing Magazine, vol. 25, no. 6, pp. 198–200, 2008. [5] T. Lindeberg, “Scale-space theory: A basic tool for analysing structures at different scales,” Journal of Applied Statistics, vol. 21(2), pp. 224– 270, 1994. [6] G. Lanckriet, N. Cristianini, P. Bartlett, and L. E. Ghaoui, “Learning the kernel matrix with semi-definite programming,” Journal of Machine Learning Research, pp. 27–72, 2004. [7] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the smo algorithm,” in International conference on Machine learning (ICML), 2004, p. 6. [8] S. Sonnenburg, G. R¨atsch, C. Sch¨afer, and B. Sch¨olkopf, “Large scale multiple kernel learning,” Journal of Machine Learning Research, vol. 7, pp. 1531–1565, 2006. [9] M. Varma and B. R. Babu, “More generality in efficient multiple kernel learning,” In Proceedings of the International Conference on Machine Learning, pp. 1065–1072, June 2009. [10] B. Saha, N. Ray, and H. Zhang, “Computing oil sand particle size distribution by snake-pca algorithm,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008, pp. 977– 980. [11] I. Levner and H. Zhang, “Classification-driven watershed segmentation,” IEEE Transactions on Image Processing, vol. 16, no. 5, pp. 1437–1445, 2007. [12] D. P. Mukherjee, Y. Potapovich, I. Levner, and H. Zhang, “Ore image segmentation by learning image and shape features,” Pattern Recognition Letters, vol. 30, no. 6, pp. 615–622, 2009. [13] J. Shi, H. Zhang, and N. Ray, “Solidity based local threshold for oil sand image segmentation,” in IEEE International Conference on Image Processing (ICIP), 2009, pp. 2385–2388. [14] N. Ray, B. Saha, and S. Acton, “Oil sand image segmentation using the inclusion filter,” in IEEE International Conference on Image Processing (ICIP), 2008, pp. 2188–2191. [15] Z. Wang and H. Zhang, “Object detection with multiple motion models,” in Asian Conference on Computer Vision (ACCV), 2010, pp. 183–192. [16] S. Nilufar, N. Ray, and H. Zhang, “Optimum kernel function design from scale space features for object detection,” in IEEE International Conference on Image Processing (ICIP), 2009, pp. 861–864.

IEEE TRANSACTION ON IMAGE PROCESSING

[17] O. L. Mangasarian, “Exact 1-norm support vector machines via unconstrained convex differentiable minimization,” Journal of Machine Learning Research, pp. 1517–1530, 2006. [18] Z. J. Rosset, S. Hastie, and T. R., “1-norm support vector machines,” in Neural Information Processing Systems (NIPS). MIT Press, 2003, p. 16. [19] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge University Press, June 2004. [20] N. Usunier, M. reza Amini, and P. Gallinari, “Generalization error bounds for classifiers trained with interdependent data,” in Advances in Neural Information Processing Systems (NIPS). MIT Press, 2006, pp. 1369–1376. [21] N. Cristianini, J. Shawe-taylor, A. Elisseeff, and J. Kandola, “On kernel-target alignment,” in Advances in Neural Information Processing Systems 14. MIT Press, 2001, pp. 367–373. [22] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886–893. [23] B.Fulkerson, A. Vedaldi, and S. Soatto, “Localizing objects with smart dictionaries,” in Proceedings of the European Conference on Computer Vision (ECCV), 2008, pp. 179–192. [24] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, “Multiple kernels for object detection,” in International Conference on Computer Vision (ICCV), September 2009, pp. 606–613. [25] C. C. Veropoulos K. and C. N., “Controlling the sensitivity of support vector machines,” in International Joint Conference on AI, 1999, pp. 55–60.

Sharmin Nilufar has received BSc in Computer Science and Technology from Rajshahi University, Bangladesh in 1996; MSc. in Computer Science from University of Northern British Columbia, Canada in 2005 and PhD in Computing Science from University of Alberta in 2011. She is currently working as a postdoctoral research fellow in Ottawa Hospital Research Institute. Her research interests include image processing, pattern recognition, advanced machine learning techniques and their application to biomedical and oil sand images. She has published more than 15 research papers in international journals, conferences and workshops. She is a recipient of several postgraduate research awards including NSERC IPS (2004-2005), NSERC PGS (2006-2009), Alberta Ingenuity Scholarship (2009-2011), and most recently, MITCAS elevate strategic fund for her postdoctoral research on microscopy image analysis.

Nilanjan Ray received his bachelor degree in mechanical engineering from Jadavpur University, Calcutta, India, in 1995, master’s degree in computer science from the Indian Statistical Institute, Calcutta, in 1997, and Ph.D. in electrical engineering from the University of Virginia, Charlottesville, in May, 2003. After two years of postdoctoral research and a year of industrial work experience he joined the department of Computing Science, University of Alberta in July 2006 as assistant professor. He is a recipient of the CIMPA-UNESCO fellowship for image processing in 1999; graduate student fellowship at Indian Statistical Institute from 1995 to 1997; and the best student paper award from IBM Picture Processing Society presented at the IEEE International Conference on Image Processing, Rochester, NY, 2002. Nilanjan’s research area is image and video analysis: segmentation, object detection, image classification, and object tracking.

14

Hong Zhang received his BSc degree from Northeastern University, Boston, USA, in 1982 and his PhD degree from Purdue University, West Lafayette, IN, USA, in 1986, both in Electrical and Computer Engineering. He is currently a Professor in the Department of Computing Science, University of Alberta, and the Director of the Centre for Intelligent Mining Systems. He is the holder of a senior NSERC Industrial Research Chair supported by Syncrude Research Canada. His current research interests include robotics, computer vision, and image processing.

Suggest Documents