Semi-Supervised Manifold-Embedded Hashing with

0 downloads 0 Views 1MB Size Report
Dec 25, 2016 - Recently, with the rapidly-growing image data on the Internet and the ... Some of the representative methods use deep networks [48][49] ... Correlation Analysis (CCA) to improve the hashing performance for image retrieval.
This paper has been accepted by PR.

000

Semi-Supervised Manifold-Embedded Hashing with Joint Feature Representation and Classifier Learning

001 002 003 004

Tiecheng Songa,∗, Jianfei Caib , Tianqi Zhanga , Chenqiang Gaoa , Fanman Mengc , Qingbo Wuc

005 006 007 008 009 010

a

Chongqing Key Laboratory of Signal and Information Processing (CQKLS&IP), Chongqing University of Posts and Telecommunications (CQUPT), Chongqing 400065, China b School of Computer Engineering, Nanyang Technological University, 639798, Singapore c School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031

Abstract Recently, learning-based hashing methods which are designed to preserve the semantic information, have shown promising results for approximate nearest neighbor (ANN) search problems. However, most of these methods require a large number of labeled data which are difficult to access in many real applications. With very limited labeled data available, in this paper we propose a semi-supervised hashing method by integrating manifold embedding, feature representation and classifier learning into a joint framework. Specifically, a semi-supervised manifold embedding is explored to simultaneously optimize

032 033

feature representation and classifier learning to make the learned binary codes optimal for classification.

034 035

A two-stage hashing strategy is proposed to effectively address the corresponding optimization problem.

036 037

At the first stage, an iterative algorithm is designed to obtain a relaxed solution. At the second stage, the

038 039 040 041

hashing function is refined by introducing an orthogonal transformation to reduce the quantization er-

042 043

ror. Extensive experiments on three benchmark databases demonstrate the effectiveness of the proposed

044

method in comparison with several state-of-the-art hashing methods.

045 046 047 048 049 050 051 052 053

Keywords: Hashing, manifold embedding, locality sensitive hashing (LSH), nearest neighbor search, image retrieval ∗

Corresponding author. Tel.: +86-18202397227 Email address: [email protected] (Tiecheng Song)

Preprint submitted to Pattern Recognition

December 25, 2016

054 055 056 057

1. Introduction

058 059 060

Constructing effective feature representations of data is an important task in many compute vision

061 062

and pattern recognition applications. During the past decades, a variety of methods have been proposed

063 064 065 066

for feature representations which can be broadly divided into two categories, i.e., local and global ones.

067 068

For local feature representations, the real-valued descriptors such as SIFT [1], GLOH [2], DAISY[3],

069 070

MROGH/MRRID [4], LPDF[5], LSD [6] and the binary descriptors such as BRIEF [7], FREAK [8],

071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088

LDB [9], USB [10], CT/HOT-BFR [11], etc. were developed based on interest points, which have been widely used in key-point detection, image matching, motion tracking and object recognition. For global feature representations, GIST [12], LBP [13] [14] [15], HOG [16][17], Bag-of-Word model[18] [19], Texton model [20][21], Fisher Vector [22][23][24], and WLD [25][26] were proposed based on holistic images or regions, which have been widely used in scene parsing, image retrieval, texture classification, pedestrian detection, and face recognition. Recently, with the rapidly-growing image data on the Internet and the emerging applications with mobile terminal devices, fast similarity search is of particular interest

089

in the fields of information retrieval, data mining and computer vision. In this respect, the similarity-

090 091

preserving hashing which maps the high-dimensional data points (e.g., the holistic image features) to a

092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107

low-dimensional Hamming space (compact binary features), has received considerable attention due to its computational and storage efficiency [27][28][29][30]. Great research efforts have been devoted to approximate nearest neighbor (ANN) search via hashing methods. These methods can be classified into two categories: data-independent methods and datadependent methods. One of the well-known data-independent hashing method is Locality Sensitive Hashing (LSH) [31], which requires no training data and employs random projections to generate binary

2

108 109 110 111 112 113 114

codes. LSH provides a high collision probability of similar samples points in the Hamming space. Later, it was extended to Kernelized LSH (KLSH) [32] and Shift Invariant Kernel based Hashing (SIKH) [33] using the kernel similarity. The above data-independent methods do not consider the intrinsic data

115 116

structure and usually require long codes and multiple hash tables to achieve satisfactory hashing perfor-

117 118

mance. As a result, these methods lead to high storage cost and tend to be less efficient in query for real

119 120 121 122

applications. To overcome the aforementioned drawbacks, many data-dependent hashing methods were

123 124

developed to learn more compact binary codes from the training data. For example, Spectral Hashing

125 126

(SH) [34] casts the hashing learning as a graph partitioning problem which is solved using a spectral

127 128

relaxation. SH was recently extended to Multidimensional Spectral Hashing (MDSH) [35], Spectral

129 130

Embedded Hashing (SEH) [36] and SH with semantically consistent graph [37]. Principal Component

131 132 133 134

Analysis based Hashing (PCAH) [38][39] explores PCA to learn hash functions to preserve the data

135 136

variance. By first performing PCA on input data, Iterative Quantization (ITQ) [38] learns an orthogo-

137 138

nal rotation matrix to minimize the quantization error during the data binarization. Spherical Hashing

139 140 141 142 143

(SpH) [40] employs a hyper-sphere based hashing with a tailored distance function to compute binary codes. Anchor Graph Hashing (AGH) [41] utilizes an anchor graph to capture the underlying manifold

144 145

structure of the data for hashing code generation. Shared Structure Hashing (SSH) [42] formulates each

146 147

hashing projection as a combination of two parts which are contributed from the entire feature space and

148 149 150 151

a shared subspace, respectively. Density Sensitive Hashing (DSH) [43] and Robust Hashing with Local

152 153

Models (RHLM) [44] respectively exploit the density of the data and local structural information to con-

154 155

struct hash functions. Circulant Binary Embedding (CBE) [45] generates binary codes by projecting the

156 157

data with a circulant matrix. More recently, Special Structure-based Hashing (SSBH) [46] and Sparse

158 159

Embedding and Least Variance Encoding (SELVE) [47] take the advantage of sparse coding for hashing

160 161

3

162 163 164 165 166 167 168

and yield state-of-the-art results on several databases for image retrieval. While the above methods are promising to preserve the neighbor similarity with certain distance metrics, they cannot well guarantee the semantic similarity (i.e., this is the semantic gap). Therefore, many

169 170

recent works focus on supervised methods to improve the hashing performance by utilizing available

171 172

class label, ranking or tag information. Some of the representative methods use deep networks [48][49]

173 174 175 176

to learn compact binary codes in a supervised manner [50][51]. For instance, the deep network with

177 178

Restricted Boltzmann Machines (RBMs) was used in [27] for hashing such that semantically consistent

179 180

points have close binary codes. Convolutional neural networks were adopted in [52] to learn deep face

181 182

features to improve the predictability of binary codes for face indexing. Based on similar/dissimilar

183 184

pairs of points, Binary Reconstructive Embedding (BRE) [53] provides a supervised manner to min-

185 186 187 188

imize the reconstruction error between the input distances and the Hamming distances. In [54][39],

189 190

semi-supervised hashing methods were developed to learn hash functions by minimizing the empirical

191 192

error over labeled data and maximizing the information from each bit over all data. In [55], Kernel-based

193 194 195 196 197

Supervised Hashing (KSH) was proposed by incorporating the supervised information of pair-wise similarity constrains into binary code learning. In [38], Iterative Quantization (ITQ) was combined with

198 199

Canonical Correlation Analysis (CCA) to improve the hashing performance for image retrieval. In [56],

200 201

Graph Cut Coding (GCC) formulates the supervised binary coding as a discrete optimization problem

202 203 204 205

which is solved using the graph cuts algorithm. By treating the hashing learning as a multi-class clas-

206 207

sification problem, Order Preserving Hashing (OPH) [57] learns the hash function by maximizing the

208 209

alignment of category orders and Supervised Discrete Hashing (SDH) [58] learns the optimal binary

210 211

codes by developing a discrete cyclic coordinate descent algorithm. Recently, the ranking information

212 213

in the form of triplet losses [59][60][61][62] and local neighborhood topology [63] was utilized to learn

214 215

4

216 217 218 219 220 221 222

the desirable hash functions. In addition, the semantic tag information was incorporated into the binary code learning for fast image tagging [64]. Supervised hashing methods often require a large number of labeled data to learn good hash codes

223 224

and collecting these labeled data is time-consuming and labor-intensive. In contrast, it is much easier

225 226

to obtain a large number of unlabeled data in many practical applications. To leverage a large number

227 228 229 230

of unlabeled data and very limited labeled data, in this paper we propose a Semi-Supervised Manifold-

231 232

Embedded Hashing (SMH) method. Specifically, we formulate SMH in the recent manifold embedding

233 234

framework [65][66] and jointly optimize feature representation and classifier learning. The proposed

235 236

hashing method consists of two stages. At the first stage, we relax the optimization problem and provide

237 238

an iterative algorithm to obtain the optimal solution. At the second stage, we refine the hashing function

239 240 241 242

by introducing an orthogonal transformation to minimize the quantization error. Extensive experiments

243 244

on three benchmark databases demonstrate the effectiveness of the proposed hashing method. The main

245 246

contributions of this paper are summarized as follows.

247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

1. We propose a semi-supervised hashing method by integrating manifold embedding, feature representation and classifier learning into a joint framework. In particular, the l2,1 -norm [67] is adopted in our formulation to obtain a robust loss function. By utilizing both labeled and unlabeled data, we simultaneously optimize feature representation and classifier learning to make the learned binary codes optimal for classification. 2. We design an iterative algorithm to solve the corresponding optimization problem and obtain a relaxed solution. Furthermore, we have proved the convergence of the iterative algorithm. 3. After the relaxation, we further refine the hashing function by minimizing the quantization error through an orthogonal transformation. 5

270 271 272 273

4. Our method boosts the retrieval results on three different types of benchmark databases in comparison with state-of-the-art hashing methods.

274 275 276 277 278 279 280 281 282 283 284 285 286

The rest of this paper is organized as follows. The proposed hashing method, i.e., Semi-Supervised Manifold-Embedded Hashing (SMH) is presented in Section 2. The experimental results on three benchmark databases are reported in Section 3, and the conclusions are drawn in Section 4.

2. The Proposed Method

287 288

In this section, we elaborate the proposed Semi-Supervised Manifold-Embedded Hashing (SMH)

289 290

method. Specifically, we first present the problem formulation, which is followed by an iterative algo-

291 292 293 294

rithm to solve the relaxed optimization problem. After the relaxation, we introduce a refinement step

295 296

to improve the hashing performance. Finally, we discuss the implementation issue and analyze the

297 298

computational complexity of SMH.

299 300 301 302 303 304 305 306 307

2.1. Problem Formulation To begin with, let us define a set of training data as X = [x1 , x2 , ..., xm , xm+1 , ..., xn ]T ∈ Rn×d , where n xi |m i=1 and xi |i=m+1 are labeled and unlabeled data, respectively, d is the feature dimension of data points,

308 309

and n is the total number of training data. We also define a label matrix Y = [y1 , ..., yn ]T ∈ {0, 1}n×c ,

310 311

where yi |ni=1 ∈ {0, 1}c is the label vector of xi and c is the total number of data classes. Let Yij be the

312 313 314 315 316 317

j-th element of yi : Yij = 1 if xi belongs to the j-th class and Yij = 0 otherwise. Note that if the label information of xi is not available, we set Yij = 0 (∀j = 1, ..., c). Suppose the data X has been zero-

318 319

centered, the goal of SMH is to learn a hash function f to map the input data X to a binary code matrix

320 321

B = f (X) = sgn(XQ) ∈ {−1, +1}n×r , where sgn(·) is the sign function, Q ∈ Rd×r is a projection

322 323

6

324 325 326 327 328 329 330

matrix, and r is the code length1 . To achieve this goal, our preliminary objective function is formulated as a multi-class classification problem, i.e.,

331 332

min

W,Q:QT Q=I

333 334 335 336

||Y − sgn(XQ)W ||2F + β||W ||2F

(1)

where W ∈ Rr×c is the weight matrix, ||·||F is the Frobenius norm of matrix, and I is an identity matrix.

337 338

It is worth pointing out that our formulation is different from that used in the traditional multi-class

339 340

classification problem because we have introduced a projection matrix Q in (1). This projection matrix

341 342 343 344

Q not only serves as reducing the dimension [68] but also learning the optimal feature representation

345 346

for classification [69]. In this way, the constructed binary codes are expected to be both compact and

347 348

discriminative. However, the above formulation of objective function has the following two issues: First,

349 350

as mentioned before, there are often limited labeled data and a large number of unlabeled data available

351 352

in real applications. When the labeled data are inadequate, the hashing methods based on (1) are prone

353 354 355 356 357 358 359 360 361 362 363 364 365

to over-fitting. Second, the least square loss used for classifier learning in (1) is sensitive to outliers [67][70][44]. To handle the first issue, we formulate our modified object function in a recent manifold embedding framework [65] [66]. Specifically, we construct a weighted graph G over the input data and then introduce a label prediction matrix F = [F1 , ..., Fn ]T ∈ Rn×c to satisfy the label fitness and manifold

366 367

smoothness. That is, F should be consistent with both the ground truth labels of labeled data and the

368 369

whole graph G over all data for label propagation [65] [66]. This is achieved by optimizing the following

370 371 372 373

1 In our formulation, we have used the values of {-1, +1} which can be readily converted into the corresponding hash bits {0, +1}.

374 375 376 377

7

378 379

cost function in a semi-supervised manner:

380 381 382 383

min F

384

c [ ∑ n ∑ 1 p=1

(Fip − Fjp ) Sij + 2 i,j=1 2

n ∑

] Uii (Fip − Yip )

2

(2)

i=1

385 386 387 388

where Fip is the p-th element of Fi that is used to predict the label of xi (i = 1, ..., n). U ∈ Rn×n is a

389 390

diagonal matrix whose diagonal element Uii = ∞ if xi is labeled2 and Uii = 0 otherwise. S ∈ Rn×n is

391 392 393 394 395 396

the affinity matrix of the weighted graph G and its element Sij reflects the visual similarity of two data points xi and xj :

397 398

Sij =

399 400

  2   exp(− ||xi −x2 j || ), if xi and xj are k-NN σ    0,

(3)

otherwise

401 402

where σ is the width parameter and the Euclidean distance is typically used to compute the distance

403 404

between two data points in k-NN (Nearest Neighbors). As shown in [65] [66], the cost function in (2) is

405 406 407 408

equivalent to [ ] min T r(F T LF ) + T r (F − Y )T U (F − Y )

409 410 411 412 413

F

(4)

where T r(·) denotes the trace operator, L is a Laplacian matrix computed by L = D − S, and D ∑n

414 415

is a diagonal matrix with diagonal entries being Dii =

416 417

unlabeled data, we now integrate (1) and (4) into a joint framework to simultaneously optimize feature

418 419 420 421

j=1

Sij . By leveraging both labeled and

representation and classifier learning:

422 423

min

424 425

F,W,Q:QT Q=I

[ ] T r(F T LF ) + T r (F − Y )T U (F − Y ) + [

426 427

α ||F − sgn(XQ)W ||2F + β||W ||2F

]

428 429 430 431

2

In practice, we set Uii (1 0 is a regularization parameter.

434 435 436 437 438 439 440

To handle the second issue, we apply the l2,1 -norm3 to our loss function for classifier learning. As indicated in [67][70][44], the l2,1 -norm is robust to outliers. Our final objective function is formulated as

441 442 443 444

min

F,W,Q:QT Q=I

[

445 446

α ||F − sgn(XQ)W ||2,1 + β||W ||2F

447 448 449 450

[ ] T r(F T LF ) + T r (F − Y )T U (F − Y ) + (6)

]

The characteristics of SMH are summarised as follows:

451 452

• It employs the manifold embedding to propagate the labels from partially labeled data while simul-

453 454 455 456

taneously learning the hash function. This semi-supervised hashing method is crucial to mitigate

457 458

the sematic gap, especially when there are only very limited labeled data available in real applica-

459 460

tions.

461 462 463 464

• It adaptively learns an optimal feature representation from the training data for hashing.

465 466

• It adopts the l2,1 -norm to obtain a robust model.

467 468 469

• It integrates manifold embedding, feature representation and classifier learning into a joint opti-

470 471 472 473

mization framework. The learned binary codes are not only compact but also discriminative for

474 475

classification.

476 477 478 479

Note that the objective function in (6) is not differentiable due to the presence of sgn(·). Therefore,

480 481

we propose a two-stage hashing strategy to address this problem. First, we relax the objective function

482 483 484 485

3

According to [67], the l2,1 -norm of an arbitrary matrix A ∈ Rn×c is defined as ||A||2,1 =

n ∑ i=1

9



c ∑ j=1

A2ij .

486 487 488 489 490 491 492 493 494 495 496 497 498 499 500

by using the signed magnitude to obtain the optimal solution. Then, we refine the hash function by minimizing the quantization error. The details of these two stages will be presented in Sections 2.2 and 2.3, respectively.

2.2. Relaxation and Optimization To make the problem in (6) computationally tractable, we relax the objective function by replacing the sign function with its signed magnitude [71]. The relaxed objective function becomes

501 502 503 504 505 506

min

F,W,Q:QT Q=I

[

α ||F − XQW ||2,1 + β||W ||2F

507 508 509 510 511 512 513 514 515 516 517 518

522 523 524 525

(7)

]

However, it is still difficult to optimize because i) the l2,1 -norm involved in (7) is non-smooth, and ii) the relaxed objective function itself is non-convex with respect to F , Q and W jointly. Following [67][70] [72], we try to solve the relaxed problem in (7) via alternating optimization as follows. By denoting F − XQW = Z = [z1 , ..., zn ]T , the solutions to (7) can be solved by optimizing

519 520 521

[ ] T r(F T LF ) + T r (F − Y )T U (F − Y ) +

min

F,W,Q:QT Q=I

[ ] T r(F T LF ) + T r (F − Y )T U (F − Y ) + (8)

[

] α T r(Z T DZ) + βT r(W T W )

526 527 528 529

where D is a diagonal matrix with its diagonal entries being Dii =

530 531

property that ||A||2F = T r(AT A) for any matrix A.

532 533 534 535 536 537 538 539

10

1 . 2||zi ||2

Here, we have used the

540 541

By taking the derivative of (8) with respect to W and setting it to zero, we obtain the optimal W ∗ :

542 543 544 545

− 2QT X T DF + 2QT X T DXQW + 2βW = 0 (9)

546



⇒ W = R Q X DF

547 548 549 550 551 552 553 554 555 556

T

optimization, and it will be updated at the next iteration as described in the following steps. Substituting W ∗ into (8), the objective function becomes

min

F,Q:QT Q=I

[

] T r(F LF ) + T r (F − Y ) U (F − Y ) + T

T

[

] α T r(F T DF ) − T r(F T DXQR−1 QT X T DF )

561 562 563 564 565 566

T

where R = QT N Q and N = X T DX + βI. Note that D is treated as a constant during the alternating

557 558 559 560

−1

(10)

Then, by setting the derivative of (10) with respect to F to zero, we obtain the optimal F ∗ :

567 568 569 570 571 572 573 574

LF + U (F − Y ) + α(DF − DXQR−1 QT X T DF ) = 0 (11) ⇒ F ∗ = (M − αDXQR−1 QT X T D)−1 U Y

575 576 577 578 579

where M = L + U + αD. Substituting F ∗ into (10), the objective function is equivalent to the following

580 581 582 583 584 585 586 587 588 589 590 591

[

−1

−1

max T r Y U (M − αDXQR Q X D) U Y T

QT Q=I

T

T

] (12)

According to the Sherman-Woodbury-Morrison formula [73], we have (M −αDXQR−1 QT X T D)−1 = M −1 + αM −1 DXQ(QT N Q − αQT X T DM −1 DXQ)−1 QT X T DM −1 (note that M is invertible). Since

592 593

11

594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620

Algorithm 1 Algorithm for solving the relaxed objective function in (7) Input: The training data X ∈ Rn×d ; The label matrix Y ∈ {0, 1}n×c ; Parameters α and β. Output: The label prediction matrix F ∈ Rn×c ; The weight matrix W ∈ Rr×c ; The projection matrix Q ∈ Rd×r . 1: Compute the Laplacian matrix L ∈ Rn×n ; 2: Compute the diagonal matrix U ∈ Rn×n ; 3: Initialize F ∈ Rn×c , W ∈ Rr×c and Q ∈ Rd×r , randomly; 4: repeat 5: Compute Z = [z1 , ..., zn ]T = F − XQW ; 6: Compute the diagonal matrix D as  1  7:

D=

2||z1 ||2

;

... 1 2||zn ||2

8: Compute N = X T DX + βI; 9: Compute M = L + U + αD; 10: Compute S1 = N − αX T DM −1 DX; 11: Compute S2 = X T DM −1 U Y Y T U M −1 DX; 12: Update Q via the eigen-decomposition of S1−1 S2 ; 13: Update F according to (11); 14: Update W according to (9); 15: until Convergence 16: Return F , W and Q.

621 622 623 624

the term M −1 is independent of Q, we re-write (12) as

625 626

[

627 628

T

max T r Y U M

QT Q=I

629

−1

DXQJ

−1

T

T

Q X DM

−1

] UY

(13)

630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647

where J = QT N Q − αQT X T DM −1 DXQ = QT (N − αX T DM −1 DX)Q. By using the property that T r(AB) = T r(BA) for any two matrices A and B, the objective function in (13) becomes [ ] max T r (QT S1 Q)−1 QT S2 Q

QT Q=I

(14)

where S1 = N − αX T DM −1 DX and S2 = X T DM −1 U Y Y T U M −1 DX. The optimal Q∗ in (14) can be obtained via the eigen-decomposition of S1−1 S2 . However, solving Q

12

648 649

655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678

800

300 250 200 150 100 50 0

5

10

15

20

1400

Objective function value

654

USPS

ISOLET Objective function value

652 653

Objective function value

650 651

CIFAR10 350

1200 1000 800 600 400 0

5

Iteration number

10

15

20

Iteration number

700 600 500 400 300 200 100 0

5

10

15

20

Iteration number

Fig. 1. Convergence curves of Algorithm 1 with 24-bit hash codes on CIFAR10, ISOLET and USPS databases (α=1, β=1).

requires the input of D which is dependent on F , Q and W . Inspired by [67][70] [72], we propose an iterative algorithm to solve this optimization problem. As described in Algorithm 1, in each iteration, we first compute D and then optimize Q, F and W alternately. The iteration procedure is repeated until the algorithm converges. The proof of the convergence is provided in APPENDIX A. Fig. 1 plots the convergence curves of Algorithm 1 on three benchmark databases (see Section 3 for details). One can see that the objective function value in (7) rapidly converges within a small number of iterations, demonstrating the efficiency of our iterative algorithm.

679 680 681 682

2.3. Hashing Refinement

683 684 685

After obtaining the optimized Q from the relaxed objective function in (7), the binary codes can be

686 687

directly generated by B = sgn(XQ), as is done in many traditional methods. However, this leads to a

688 689

quantization error which is measured by ||B − XQ||2F . To reduce this quantization error, we introduce

690 691 692 693 694 695 696 697 698 699

an orthogonal transformation [38] to improve the hashing performance. This is achieved by optimizing

min ||B − XQR||2F B,R

(15)

s.t. B ∈ {−1, +1}n×r , RT R = I

700 701

13

702 703

where R ∈ Rr×r is an orthogonal transformation matrix which is used to rotate the data to align with the

704 705

hypercube {−1, +1}n×r .

706 707 708

The problem in (15) can be efficiently solved using an alternating optimization algorithm. Specifically,

709 710

we first initialize R with a random orthogonal matrix and then minimize (15) with respect to B and R

711 712

alternatively. This alternating optimization algorithm involves the following two steps in each iteration:

713 714 715 716 717 718 719 720

• Fix R and update B. The objective function has a closed-form solution B = sgn(XQR). • Fix B and update R. The objective function reduces to the Orthogonal Procrustes problem [38,

721 722

74], which can be efficiently solved by the singular value decomposition (SVD). Denoting the

723 724

SVD of B T XQ as U SV T , the solution to (15) is obtained by R = V U T .

725 726 727 728 729 730 731 732 733 734 735 736 737 738 739

It was empirically found that the optimization algorithm can converge within about 50 iterations [38, 64]. The proposed hashing method, called Semi-Supervised Manifold-Embedded Hashing (SMH), is summarised in Algorithm 2. As can be seen, SMH consists of the following two stages: At the first stage, the optimized Q is obtained from the relaxed problem in (7) by Algorithm 1. At the second stage, the optimized R is obtained by minimizing the quantization error in (15). Once Q and R are learned, the f = QR: binary code b of a new query x is generated using the projection matrix W

740 741 742 743

fT x) b = sgn(W

744 745 746 747 748 749

where x ∈ Rd and b ∈ {−1, +1}r .

750 751 752 753 754 755

14

(16)

756 757 758 759 760 761 762 763 764 765 766 767 768 769 770

Algorithm 2 The proposed SMH algorithm Input: The training data X ∈ Rn×d ; The label matrix Y ∈ {0, 1}n×c ; Parameters α and β. f ∈ Rd×r . Output: The binary code matrix B ∈ {−1, +1}n×r ; The projection matrix W 1: Initialize R ∈ Rr×r randomly; 2: Obtain Q by running Algorithm 1; 3: repeat 4: Update B according to B = sgn(XQR); 5: Compute U and V via SVD of B T XQ; 6: Update R according to R = V U T ; 7: until Convergence f = QR; 8: Compute W f. 9: Return B, W

771 772 773 774 775 776 777 778 779 780 781 782 783 784

2.4. Implementation Issue The computation of Laplacian matrix L via the affinity matrix S in (3) is not flexible4 for large scale problems. For efficiency, in this paper we use the recently proposed anchor graph [75] to approximate the affinity matrix S and then obtain L. Specifically, we randomly sample q data points xi |qi=1 (anchors) from the training set {xi }ni=1 and then build an affinity matrix A = ZΛ−1 Z T ∈ Rn×n , where Λ =

785 786

diag(Z T 1) ∈ Rq×q and Z ∈ Rn×q is the similarity matrix between n data points and q anchors. In

787 788

practice, Z is highly sparse by keeping nonzero Zij for s (s < q) closest anchors to xi (we empirically

789 790 791 792 793

set q = 400 and s = 3 in our implementation). The resulting Laplacian matrix is derived by L = I −A = I − ZΛ−1 Z T .

794 795 796 797

2.5. Computational Complexity

798 799

The computational complexity of SMH for both the training and the test (query) stage is briefly an-

800 801

alyzed as follows. In the training stage, the time complexity for computing the Laplacian matrix L is

802 803 804 805 806 807 808 809

O(nqs). To obtain the relaxed solution for the objective function (7), we need to compute the inverse of a few matrices (e.g., M −1 and R−1 ) and conduct the eigen-decomposition of S1−1 S2 . These operations 4

The time cost of constructing k-NN graph is O(kn2 ), which is intractable when n becomes very large.

15

810 811

have the computational complexity O(n3 + d3 + r3 ). In addition, we need to compute B and R in (15)

812 813

using an alternating algorithm which is O(ndr + nr2 + r3 ) in complexity. Note that we typically have

814 815 816

n ≫ q ≫ s, n ≫ d and n ≫ r in real applications. Thus the total computational complexity for learning

817 818

Q and R is approximately O(n3 ). Once Q and R are learned as a “one-time” cost, the hashing time for

819 820

a new query is O(dr).

821 822 823 824

3. Experiments

825 826 827 828

In this section, we perform extensive experiments to evaluate the proposed SMH on three benchmark

829 830

databases. We compare SMH with several state-of-the-art hashing approaches: Locality Sensitive Hash-

831 832

ing (LSH) [31], Spectral Hashing (SH) [34], Spherical Hashing (SpH) [40], data-dependent CBE (CBE)

833 834 835 836

[45], PCA with random orthogonal rotation (PCA-RR) [38], Sparse Embedding and Least Variance En-

837 838

coding (SELVE) [47], Kernel-based Supervised Hashing (KSH) [55], Iterative Quantization with CCA

839 840

(CCA-ITQ) [38], and Sequential Projection Learning for Hashing (SPLH) [39]. Among them, KSH and

841 842

CCA-ITQ are supervised hashing approaches, SPLH is semi-supervised, and others are unsupervised

843 844

ones.

845 846 847 848 849 850 851 852 853 854 855 856 857 858 859

3.1. Databases and Evaluation Protocols We use three benchmark databases, i.e., CIFAR105 , ISOLET6 , and USPS7 , to evaluate the hashing methods. 5

http://www.cs.toronto.edu/∼kriz/cifar.html http://archive.ics.uci.edu/ml/datasets/ISOLET 7 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/multiclass.html#usps 6

860 861 862 863

16

864 865 866 867 868 869 870

3.1.1. CIFAR10 CIFAR10 is a subset of the 80M tiny images [76] which consists of 60,000 32×32 color images from 10 classes (e.g., airplane, automobile, bird, ship, etc.). There are 6,000 images for each class and

871 872

each image is represented as a 384-dimensional GIST [12] feature vector. We randomly partition the

873 874

whole database into two parts: a training set of 6,000 samples for learning hash functions and a test

875 876 877 878

set of 54,000 samples for queries. For supervised and semi-supervised methods, 50 samples randomly

879 880

chosen from the training set are used as labeled data while the rest are treated as unlabeled ones. For

881 882

unsupervised methods, the whole training set is treated as unlabeled data.

883 884 885 886

3.1.2. ISOLET

887 888

ISOLET is a spoken letter recognition database, which contains 150 subjects who spoke the name of

889 890

each letter of the alphabet twice. Therefore, there are 52 training examples from each speaker, resulting

891 892

in 7,797 samples in total from 26 categories (3 examples are missing). All the speakers are grouped into

893 894 895 896

sets of 30 speakers each, referred to as isolet1, isolet2, isolet3, isolet4 and isolet5. The features of each

897 898

sample are represented as a 617-dimensional vector including spectral coefficients, contour features,

899

sonorant features, etc. Following the popular setup, we choose isolet1+2+3+4 as the training set and

900 901 902 903

isolet5 as queries. This results in a training set of 6,238 samples and a query set of 1,559 samples.

904 905

Similarly, for supervised and semi-supervised methods, 300 samples randomly chosen from the training

906 907

set are used as labeled data while the rest are treated as unlabeled ones. For unsupervised methods, the

908 909

whole training set is treated as unlabeled data.

910 911 912 913 914 915 916 917

17

918 919 920 921 922 923 924

3.1.3. USPS USPS contains 11K handwritten digit images from “0” to “9”. We select a popular subset which contains 9,298 16×16 images for our experiments. Each image is represented as a 256-dimensional

925 926

feature vector. For this database, 8,298 images are randomly sampled as the training set and the rest

927 928

1K images are used as queries. For supervised and semi-supervised methods, 100 samples randomly

929 930 931 932 933 934 935 936 937 938

chosen from the training set are used as labeled data while the rest are treated as unlabeled ones. For unsupervised methods, the whole training set is treated as unlabeled data. To evaluate the hashing performance, the following two criteria [34, 38, 47] are adopted in our experiments:

939 940 941 942 943 944 945 946 947 948 949 950 951 952 953

• Hash Lookup. It builds a lookup table using the hash codes of all data points and returns the points that have a small Hamming radius r from the query point. In our experiments, we adopt HAM2 (i.e., r = 2) to compute the hash lookup precision under different hash bits. Specifically, for each query point, HAM2 is computed as the percentage of true neighbors (e.g., sharing the same label as the query point) among the returned points within a Hamming radius 2. Here, the mean of HAM2 for all query points is used to evaluate the hash lookup performance.

954 955 956 957

• Hamming Ranking. It ranks all data points according to their Hamming distances to the query

958 959

point and returns the top ones. In our experiments, the mean average precision (MAP) of all query

960 961 962 963 964 965 966 967 968 969 970 971

points is used to evaluate the ranking performance under different hash bits. In addition, we report the precision-recall curves as in [38, 47]. The precision and recall are respectively defined as P recision =

# of retrieved relevant pairs # of all retrieved pairs 18

(17)

972 973

Recall =

974 975 976 977 978 979 980 981 982

# of retrieved relevant pairs # of all relevant pairs

(18)

In our experiments, the search results are evaluated by using class labels as the ground truths. For the proposed SMH, we set α = 1 and β = 1 for all databases unless otherwise stated. The influences of these two parameters on the hashing performance will be discussed in Section 3.3. For compared methods,

983 984

we use the publicly available codes under the best settings suggested in their papers. All methods are

985 986

implemented in MATLAB and run on a workstation with an Intel Xeon CPU of 2.1GHz and 128GB

987 988 989 990 991 992 993 994 995 996

RAM.

3.2. Results Fig. 2 plots the HAM2 curves of all methods by varying the code length from 8 to 128 on CIFAR10,

997 998

ISOLET and USPS databases. As can be seen from Fig. 2, the HAM2 performance of almost all

999 1000

methods first increases and then drops as the code length increases. This is because the Hamming

1001 1002

code space becomes very sparse with the increased code length and many queries cannot retrieve true

1003 1004

neighbors in the Hamming radius of 2. When the code length is longer than 32, the semi-supervised

1005 1006 1007

SPLH performs best on CIFAR10 while the proposed SMH consistently performs best on both ISOLET

1008 1009

and USPS. In addition, one can observe that SELVE shows promising results on ISOLET and USPS

1010 1011 1012 1013 1014 1015 1016 1017

databases with short binary codes. However, its performance rapidly deteriorates with long codes and is inferior to SMH by a large margin. In addition, the supervised KSH performs fairly well on CIFAR10 but pretty poorly on both ISOLET and USPS.

1018 1019

Fig. 3 shows the MAP curves of all methods with the code length ranging from 8 to 128. It can

1020 1021

be seen that SMH achieves better or comparative results than other methods on all databases at almost

1022 1023 1024 1025

all code lengths. When the code length is shorter than 48, SPLH works slightly better than SMH on

19

1033 1034 1035 1036 1037 1038 1039 1040 1041 1042

LSH SH PCA−RR CCA−ITQ CBE SpH KSH SELVE SPLH SMH

0.2

HAM2

1032

0.15

0.1

1

0.6 0.8 0.5 0.4 0.3

0.6

0.4

0.2 0.2

0.05

0

USPS

0.7

HAM2

1028 1029 1030 1031

ISOLET

CIFAR10 0.25

HAM2

1026 1027

0.1 8 16 24 32

48

64 80 Code length

(a)

96

112 128

0

8 16 24 32

48

64 80 Code length

(b)

96

112 128

0

8 16 24 32

48

64 80 Code length

96

112 128

(c)

Fig. 2. HAM2 of all methods by varying the code length from 8 to 128 on CIFAR10, ISOLET and USPS databases. The legends of (b) and (c) are omitted and they are the same as the one of (a).

1043 1044

the CIFAR10 database. However, its performance drops quickly with the increase of hash bits and is

1045 1046

surpassed by other methods. With the longer hashing codes, PCA-RR generally performs the second

1047 1048 1049 1050 1051 1052

best on CIFAR10 while CCA-ITQ performs the second best on ISOLET and USPS databases. It is surprising that the supervised KSH performs the worst on both ISOLET and USPS databases.

1053 1054

Fig. 4 presents the precision-recall curves of all methods with 24-bit hash codes. One can see that the

1055 1056

precision of all methods drops significantly when the recall increases and vice versa—this is consistent

1057 1058

with the definitions in (17) and (18) [38, 47]. From Fig. 4 (a), we observe that all compared methods

1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069

are highly competitive on CIFAR10. From Fig. 4 (b) and Fig. 4 (c), we observe that SMH and CCAITQ perform better than all other methods on ISOLET and USPS databases. Particularly, when the recall is high, SMH significantly outperforms CCA-ITQ in terms of the precision. For the supervised KSH, it again performs the worst on these two databases. By comparing the semi-supervised methods,

1070 1071

the proposed SMH outperforms SPLH which learns to hashing via supervised empirical fitness and

1072 1073

unsupervised information theoretic regularization. These results demonstrate the effectiveness of SMH

1074 1075

by exploring the manifold embedding in a semi-supervised manner to propagate the labels from partially

1076 1077 1078 1079

labeled data while simultaneously learning good feature representations for binary code generation.

20

1084 1085

0.18

0.16

1087 1088

0.12

1105 1106 1107 1108 1109 1110 1111 1112 1113 1114

0.6

0.5 0.4

8 16 24 32

48

64 80 Code length

96

0.5 0.4

0.3 0.3 0.2

0

112 128

8 16 24 32

(a)

48

64 80 Code length

96

0.1

112 128

8 16 24 32

(b)

48

64 80 Code length

96

112 128

(c)

Fig. 3. MAP of all methods by varying the code length from 8 to 128 on CIFAR10, ISOLET and USPS databases. The legends of (b) and (c) are omitted and they are the same as the one of (a). CIFAR10

0.6 0.5 0.4 0.3

Precision

0.7

Precision

ISOLET LSH SH PCA−RR CCA−ITQ CBE SpH KSH SELVE SPLH SMH

USPS

1

1

0.8

0.8

0.6

0.6

Precision

0.8

1099 1100

1103 1104

0.7

0.1

1097 1098

1101 1102

0.7

0.2

1091 1092

1095 1096

0.8

0.14

1089 1090

1093 1094

USPS

0.8

0.6

MAP

0.2

LSH SH PCA−RR CCA−ITQ CBE SpH KSH SELVE SPLH SMH

MAP

1082 1083

1086

ISOLET

CIFAR10 0.22

MAP

1080 1081

0.4

0.2

0.4

0.2

0.2 0.1 0

0.2

0.4

0.6 Recall

(a)

0.8

1

0 0

0.2

0.4

0.6 Recall

(b)

0.8

1

0 0

0.2

0.4

0.6

0.8

1

Recall

(c)

Fig. 4. Precision-recall curves of all methods with 24-bit hash codes on CIFAR10, ISOLET and USPS databases. The legends of (b) and (c) are omitted and they are the same as the one of (a).

3.3. Influences of α and β

1115 1116 1117

The proposed SMH involves two parameters α and β. We now evaluate their impacts on HAM2 and

1118 1119

MAP performance by varying them in the candidate range of [10−3 , 10−2 , 10−1 , 1, 101 , 102 , 103 ]. Fig. 5

1120 1121 1122 1123

illustrates the performance variation of SMH with respect to α and β using 24-bit hash codes on three

1124 1125

databases. We can see that the best performance can be obtained by choosing different combinations of

1126 1127

α and β for different databases and there is a wide range to choose these best combinations. Without

1128 1129

specifically tuning these parameters, we choose them from some intermediate values in the candidate

1130 1131

range and set α = 1 and β = 1 for all databases by default.

1132 1133

21

1134 1135 1136 1137

0.6

0.1

1e−3 1e−2 1e−1 1 1e1 1e2

1145 1146

1e3

α

1155 1156

1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187

1e3

1e2

1e3

1 1e1

0.6

0.4

0.05

0.2

0

0

1e2 1e3

1 1e−1 1e−2 1e−3 β

1e1

1e2

MAP

0.6

α

1e3

1e2

1e3

0.4 0.2 0

1e−3 1e−2 1e−1

1e3

1e1

(c)

0.15 0.1

1 1e−1 1e−2 1e−3 β

1e2 α

0.8

1 1e1

1 1e−1 1e−2 1e−3 β

1e2 1e3

α

(d)

1e1

1e2

1e−3 1e−2 1e−1

1e3

1 1e1

1 1e−1 1e−2 1e−3 β

1e2 1e3

α

(e)

1e1

1e2

1e3

(f)

Fig. 5. Performance variation of the proposed SMH with respect to α and β using 24-bit hash codes. (a)(d) CIFAR10. (b)(e) ISOLET. (c)(f) USPS. ISOLET

CIFAR10

USPS

0.9 HAM2 MAP

0.24 0.22 0.2 0.18

HAM2 MAP

0.8

Performance

1170 1171

1e2 α

1e1

0.8

1e1

Performance

1169

1 1e−1 1e−2 1e−3 β

(b)

1

1165 1166 1167 1168

1

1e−3 1e−2 1e−1

0.2

1159 1160

1163 1164

1e3

1e1

1e−3 1e−2 1e−1

1157 1158

1161 1162

1e2

MAP

MAP

1149 1150

1153 1154

1e1

0.4

0

1e−3 1e−2 1e−1

(a)

1147 1148

1151 1152

1 1e−1 1e−2 1e−3 β

0.6

0.2

0

0

1143 1144

0.4 0.2

0.05

1141 1142

HAM2

HAM2

0.15

0.16

HAM2 MAP

0.9 0.8

0.7

Performance

1140

0.8

0.2

0.8

HAM2

1138 1139

0.25

0.6 0.5 0.4

0.7 0.6 0.5

0.3 0.4

0.14 10

50

100

150

200

250

500

Number of labeled data

0.2

10

50

100

300

500

700

Number of labeled data

1000

10

50

100

200

300

500

700

Number of labeled data

Fig. 6. Performance variation of the proposed SMH with respect to the number of labeled data using 24-bit hash codes. Note that the labeled data are randomly sampled from the corresponding training set on each database.

3.4. Effects of Size of Labeled Data In above experiments, we have used the fixed number of samples from each database as labeled data. We now vary the number of labeled data and see how the performance changes on each database.

22

1190 1191

1195 1196

HAM2

1194

0.1

0.05

1197 1198

ISOLET

0.7

SMH SMH_F SMH_NR

0.15

HAM2

1192 1193

CIFAR10

0.2

0.6

0.7

0.5

0.6

SMH SMH_F SMH_NR

0.4 0.3

8 162432

48

64

80

0.1

0.2 0.1 8 162432

48

CIFAR10

0.19

64

80

96 112 128

8 162432

Code length ISOLET

Code length

1199 1200

0.4 0.3

96 112 128

0.8

1203 1204

0.17

0.6

0.65

1207 1208 1209 1210 1211 1212

0.16

SMH SMH_F SMH_NR

0.15

MAP

0.7

MAP

0.7

MAP

0.18

8 162432

48

64

80

96 112 128

64

80

96 112 128

0.6

0.5

SMH SMH_F SMH_NR

0.4 0.3

0.14

48

Code length USPS

0.75

1201 1202

1205 1206

SMH SMH_F SMH_NR

0.5

0.2

0

0

USPS

0.8

HAM2

1188 1189

8 162432

Code length

48

64

80

96 112 128

Code length

SMH SMH_F SMH_NR

0.55 0.5 8 162432

48

64

80

96 112 128

Code length

Fig. 7. Performance comparison of different components with different length of hash codes on CIFAR10, ISOLET and USPS databases.

1213 1214 1215 1216

Fig. 6 shows the performance variation of SMH in terms of HAM2 and MAP using 24-bit hash codes.

1217 1218

Overall, one can observe the improved performance for all three databases as the number of labeled

1219 1220

data increases. Interestingly, the performance varies slightly when the number of labeled data becomes

1221 1222

large enough. This is advantageous for our semi-supervised hashing method because it means that we

1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235

can leverage the limited labeled data to obtain the largest performance gain. In fact, we have used a small portion of the training data as the labeled data (50 out of 6,000 for CIFAR10, 300 out of 6,238 for ISOLET, and 100 out of 8,298 for USPS) for label propagation and have achieved the promising results as demonstrated in Figs. 2-4. However, how to determine the optimal number of labeled data for different databases remains an open problem.

1236 1237 1238 1239 1240 1241

23

1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256

3.5. Component Analysis In the propose SMH, we have applied the l2,1 -norm to the objective function in (6) and introduced a refinement step in Section 2.3 to improve the hashing performance. In the following, we will evaluate these two components. For experimental comparison, we replace the l2,1 -norm with the F -norm based represatation (i.e., the objective function in (5)) and remove the hashing refinement step. The resulting algorithms are called SMH F and SMH NR, respectively. Fig. 7 shows the comparison results in terms

1257 1258

of HAM2 and MAP with different code lengths. We can see that SMH and SMH F yield competitive

1259 1260

results on ISOLET and USPS databases while SMH leads to better performance than SMH F on CI-

1261 1262

FAR10. The results on the three databases indicate the robustness of the l2,1 -norm to the loss function

1263 1264 1265 1266

in SMH, which are basically consistent with the conclusions in [67][70][44]. From Fig. 7, we see that

1267 1268

the refinement step can boost the hashing performance by a large margin. This demonstrates the neces-

1269 1270

sity and effectiveness of hashing refinement to minimize the quantization error after solving the relaxed

1271 1272

optimization problem.

1273 1274 1275 1276 1277 1278 1279

3.6. Running Time In this part, we empirically compare the running time, including training time and test time, for differ-

1280 1281

ent methods. The training time measures the cost of learning the hash function from the training set and

1282 1283

the test time measures the cost of transforming the raw test data to compact binary codes. Table 1 lists the

1284 1285

training and test time for different methods using 24-bit hash codes on three benchmark databases. We

1286 1287 1288 1289

observe that LSH takes negligible training time because the projections are randomly generated without

1290 1291

the learning process. The learning based CBE is relatively slower due to the computation of Fast Fourier

1292 1293

Transformation (FFT) in both the training and test stages. For the proposed SMH, the training time is

1294 1295

much longer than other methods. This is mainly because SMH involves the computation of the inverse 24

1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330

Table 1. Training and test time (in seconds) for different methods using 24-bit hash codes on three benchmark databases. For the proposed SMH, we run Algorithm 1 until it converges and run the alternating optimization algorithm in the hashing refinement step 50 iterations.

CIFAR10 ISOLET USPS training test training test training test LSH[31] 0.0133 0.0272 0.0071 0.0016 0.0063 0.0011 SH[34] 0.2155 0.2009 0.2382 0.0081 0.1954 0.0071 PCA-RR[38] 0.0654 0.0266 0.1081 0.0013 0.0577 0.0008 CCA-ITQ[38] 0.3033 0.0550 0.2878 0.0014 0.3576 0.0015 CBE[45] 13.2650 30.6177 20.9785 0.1305 9.1522 0.0326 SpH[40] 0.4468 0.1620 0.7492 0.0067 0.7403 0.0028 KSH[55] 4.6390 0.8795 5.9804 0.0403 4.7848 0.0210 SELVE[47] 1.3500 0.8755 2.0539 0.0313 1.4045 0.0209 SPLH[39] 1.4330 0.0202 3.3826 0.0012 0.9319 0.0008 SMH 169.7365 0.0217 103.4123 0.0014 85.8736 0.0007 Methods

of some big matrices which is very expensive, as discussed in Section 2.5. In comparison with offline training time, the test time is more crucial in real-time applications. As listed in Table 1, the test time for SMH is very fast since it only requires the linear projection and binarization to obtain the binary codes.

3.7. Discussion Recently, there are some supervised methods which use deep networks to learn compact hash codes and achieve state-of-the-art results. For instance, Erin Liong et al. [51] leveraged a deep neural network

1331 1332 1333

to seek multiple hierarchical non-linear transformations for hashing, and Lai et al. [61] developed a

1334 1335

deep architecture to learn image representations and hash functions in a one-stage manner. As reported

1336 1337

on CIFAR10 using 32-bit hash codes, [51] achieves 0.31 and 0.21 in terms of HAM2 and MAP, re-

1338 1339

spectively, while [61] achieves 0.60 and 0.56 in terms of HAM2 and MAP, respectively. These results

1340 1341 1342 1343 1344 1345

demonstrate that leaning deep features (instead of using hand-crafted features) coupled with hashing coding is promising to improve the hashing performance.

1346 1347

The major training time of our method is consumed by computing the inverse of some big matrices

1348 1349

which can take the time complexity O(n3 ). Fig. 8 shows the training time of SMH with respect to the 25

1350 1351

4

2.5

Training time (seconds)

1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366

x 10

2

1.5

1

0.5

0

6000

12000 18000 24000 30000 36000 42000 48000

Numbers of training data

Fig. 8. Training time of SMH with respect to the number of training data on CIFAR-10 using 24-bit hash codes.

1367 1368

number of training data on CIFAR-10 using 24-bit hash codes. For larger-scale problems, we need to

1369 1370

find ways to reduce the time complexity of SMH. Firstly, we may seek some fast algorithms [77] to

1371 1372

approximate or avoid the inversion of a matrix. For instance, an iteration algorithm, as used in [78],

1373 1374 1375 1376

may be adopted to solve F in (10), thus avoiding computing the inverse of big matrices. Secondly, we

1377 1378

may condense the training data or reduce the number of variables to be optimized in matrix inversion.

1379 1380

For instance, we may remove U in objective function (6) by introducing the constraint Fi = Yi (1

Suggest Documents