This paper has been accepted by PR.
000
Semi-Supervised Manifold-Embedded Hashing with Joint Feature Representation and Classifier Learning
001 002 003 004
Tiecheng Songa,∗, Jianfei Caib , Tianqi Zhanga , Chenqiang Gaoa , Fanman Mengc , Qingbo Wuc
005 006 007 008 009 010
a
Chongqing Key Laboratory of Signal and Information Processing (CQKLS&IP), Chongqing University of Posts and Telecommunications (CQUPT), Chongqing 400065, China b School of Computer Engineering, Nanyang Technological University, 639798, Singapore c School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031
Abstract Recently, learning-based hashing methods which are designed to preserve the semantic information, have shown promising results for approximate nearest neighbor (ANN) search problems. However, most of these methods require a large number of labeled data which are difficult to access in many real applications. With very limited labeled data available, in this paper we propose a semi-supervised hashing method by integrating manifold embedding, feature representation and classifier learning into a joint framework. Specifically, a semi-supervised manifold embedding is explored to simultaneously optimize
032 033
feature representation and classifier learning to make the learned binary codes optimal for classification.
034 035
A two-stage hashing strategy is proposed to effectively address the corresponding optimization problem.
036 037
At the first stage, an iterative algorithm is designed to obtain a relaxed solution. At the second stage, the
038 039 040 041
hashing function is refined by introducing an orthogonal transformation to reduce the quantization er-
042 043
ror. Extensive experiments on three benchmark databases demonstrate the effectiveness of the proposed
044
method in comparison with several state-of-the-art hashing methods.
045 046 047 048 049 050 051 052 053
Keywords: Hashing, manifold embedding, locality sensitive hashing (LSH), nearest neighbor search, image retrieval ∗
Corresponding author. Tel.: +86-18202397227 Email address:
[email protected] (Tiecheng Song)
Preprint submitted to Pattern Recognition
December 25, 2016
054 055 056 057
1. Introduction
058 059 060
Constructing effective feature representations of data is an important task in many compute vision
061 062
and pattern recognition applications. During the past decades, a variety of methods have been proposed
063 064 065 066
for feature representations which can be broadly divided into two categories, i.e., local and global ones.
067 068
For local feature representations, the real-valued descriptors such as SIFT [1], GLOH [2], DAISY[3],
069 070
MROGH/MRRID [4], LPDF[5], LSD [6] and the binary descriptors such as BRIEF [7], FREAK [8],
071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088
LDB [9], USB [10], CT/HOT-BFR [11], etc. were developed based on interest points, which have been widely used in key-point detection, image matching, motion tracking and object recognition. For global feature representations, GIST [12], LBP [13] [14] [15], HOG [16][17], Bag-of-Word model[18] [19], Texton model [20][21], Fisher Vector [22][23][24], and WLD [25][26] were proposed based on holistic images or regions, which have been widely used in scene parsing, image retrieval, texture classification, pedestrian detection, and face recognition. Recently, with the rapidly-growing image data on the Internet and the emerging applications with mobile terminal devices, fast similarity search is of particular interest
089
in the fields of information retrieval, data mining and computer vision. In this respect, the similarity-
090 091
preserving hashing which maps the high-dimensional data points (e.g., the holistic image features) to a
092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107
low-dimensional Hamming space (compact binary features), has received considerable attention due to its computational and storage efficiency [27][28][29][30]. Great research efforts have been devoted to approximate nearest neighbor (ANN) search via hashing methods. These methods can be classified into two categories: data-independent methods and datadependent methods. One of the well-known data-independent hashing method is Locality Sensitive Hashing (LSH) [31], which requires no training data and employs random projections to generate binary
2
108 109 110 111 112 113 114
codes. LSH provides a high collision probability of similar samples points in the Hamming space. Later, it was extended to Kernelized LSH (KLSH) [32] and Shift Invariant Kernel based Hashing (SIKH) [33] using the kernel similarity. The above data-independent methods do not consider the intrinsic data
115 116
structure and usually require long codes and multiple hash tables to achieve satisfactory hashing perfor-
117 118
mance. As a result, these methods lead to high storage cost and tend to be less efficient in query for real
119 120 121 122
applications. To overcome the aforementioned drawbacks, many data-dependent hashing methods were
123 124
developed to learn more compact binary codes from the training data. For example, Spectral Hashing
125 126
(SH) [34] casts the hashing learning as a graph partitioning problem which is solved using a spectral
127 128
relaxation. SH was recently extended to Multidimensional Spectral Hashing (MDSH) [35], Spectral
129 130
Embedded Hashing (SEH) [36] and SH with semantically consistent graph [37]. Principal Component
131 132 133 134
Analysis based Hashing (PCAH) [38][39] explores PCA to learn hash functions to preserve the data
135 136
variance. By first performing PCA on input data, Iterative Quantization (ITQ) [38] learns an orthogo-
137 138
nal rotation matrix to minimize the quantization error during the data binarization. Spherical Hashing
139 140 141 142 143
(SpH) [40] employs a hyper-sphere based hashing with a tailored distance function to compute binary codes. Anchor Graph Hashing (AGH) [41] utilizes an anchor graph to capture the underlying manifold
144 145
structure of the data for hashing code generation. Shared Structure Hashing (SSH) [42] formulates each
146 147
hashing projection as a combination of two parts which are contributed from the entire feature space and
148 149 150 151
a shared subspace, respectively. Density Sensitive Hashing (DSH) [43] and Robust Hashing with Local
152 153
Models (RHLM) [44] respectively exploit the density of the data and local structural information to con-
154 155
struct hash functions. Circulant Binary Embedding (CBE) [45] generates binary codes by projecting the
156 157
data with a circulant matrix. More recently, Special Structure-based Hashing (SSBH) [46] and Sparse
158 159
Embedding and Least Variance Encoding (SELVE) [47] take the advantage of sparse coding for hashing
160 161
3
162 163 164 165 166 167 168
and yield state-of-the-art results on several databases for image retrieval. While the above methods are promising to preserve the neighbor similarity with certain distance metrics, they cannot well guarantee the semantic similarity (i.e., this is the semantic gap). Therefore, many
169 170
recent works focus on supervised methods to improve the hashing performance by utilizing available
171 172
class label, ranking or tag information. Some of the representative methods use deep networks [48][49]
173 174 175 176
to learn compact binary codes in a supervised manner [50][51]. For instance, the deep network with
177 178
Restricted Boltzmann Machines (RBMs) was used in [27] for hashing such that semantically consistent
179 180
points have close binary codes. Convolutional neural networks were adopted in [52] to learn deep face
181 182
features to improve the predictability of binary codes for face indexing. Based on similar/dissimilar
183 184
pairs of points, Binary Reconstructive Embedding (BRE) [53] provides a supervised manner to min-
185 186 187 188
imize the reconstruction error between the input distances and the Hamming distances. In [54][39],
189 190
semi-supervised hashing methods were developed to learn hash functions by minimizing the empirical
191 192
error over labeled data and maximizing the information from each bit over all data. In [55], Kernel-based
193 194 195 196 197
Supervised Hashing (KSH) was proposed by incorporating the supervised information of pair-wise similarity constrains into binary code learning. In [38], Iterative Quantization (ITQ) was combined with
198 199
Canonical Correlation Analysis (CCA) to improve the hashing performance for image retrieval. In [56],
200 201
Graph Cut Coding (GCC) formulates the supervised binary coding as a discrete optimization problem
202 203 204 205
which is solved using the graph cuts algorithm. By treating the hashing learning as a multi-class clas-
206 207
sification problem, Order Preserving Hashing (OPH) [57] learns the hash function by maximizing the
208 209
alignment of category orders and Supervised Discrete Hashing (SDH) [58] learns the optimal binary
210 211
codes by developing a discrete cyclic coordinate descent algorithm. Recently, the ranking information
212 213
in the form of triplet losses [59][60][61][62] and local neighborhood topology [63] was utilized to learn
214 215
4
216 217 218 219 220 221 222
the desirable hash functions. In addition, the semantic tag information was incorporated into the binary code learning for fast image tagging [64]. Supervised hashing methods often require a large number of labeled data to learn good hash codes
223 224
and collecting these labeled data is time-consuming and labor-intensive. In contrast, it is much easier
225 226
to obtain a large number of unlabeled data in many practical applications. To leverage a large number
227 228 229 230
of unlabeled data and very limited labeled data, in this paper we propose a Semi-Supervised Manifold-
231 232
Embedded Hashing (SMH) method. Specifically, we formulate SMH in the recent manifold embedding
233 234
framework [65][66] and jointly optimize feature representation and classifier learning. The proposed
235 236
hashing method consists of two stages. At the first stage, we relax the optimization problem and provide
237 238
an iterative algorithm to obtain the optimal solution. At the second stage, we refine the hashing function
239 240 241 242
by introducing an orthogonal transformation to minimize the quantization error. Extensive experiments
243 244
on three benchmark databases demonstrate the effectiveness of the proposed hashing method. The main
245 246
contributions of this paper are summarized as follows.
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
1. We propose a semi-supervised hashing method by integrating manifold embedding, feature representation and classifier learning into a joint framework. In particular, the l2,1 -norm [67] is adopted in our formulation to obtain a robust loss function. By utilizing both labeled and unlabeled data, we simultaneously optimize feature representation and classifier learning to make the learned binary codes optimal for classification. 2. We design an iterative algorithm to solve the corresponding optimization problem and obtain a relaxed solution. Furthermore, we have proved the convergence of the iterative algorithm. 3. After the relaxation, we further refine the hashing function by minimizing the quantization error through an orthogonal transformation. 5
270 271 272 273
4. Our method boosts the retrieval results on three different types of benchmark databases in comparison with state-of-the-art hashing methods.
274 275 276 277 278 279 280 281 282 283 284 285 286
The rest of this paper is organized as follows. The proposed hashing method, i.e., Semi-Supervised Manifold-Embedded Hashing (SMH) is presented in Section 2. The experimental results on three benchmark databases are reported in Section 3, and the conclusions are drawn in Section 4.
2. The Proposed Method
287 288
In this section, we elaborate the proposed Semi-Supervised Manifold-Embedded Hashing (SMH)
289 290
method. Specifically, we first present the problem formulation, which is followed by an iterative algo-
291 292 293 294
rithm to solve the relaxed optimization problem. After the relaxation, we introduce a refinement step
295 296
to improve the hashing performance. Finally, we discuss the implementation issue and analyze the
297 298
computational complexity of SMH.
299 300 301 302 303 304 305 306 307
2.1. Problem Formulation To begin with, let us define a set of training data as X = [x1 , x2 , ..., xm , xm+1 , ..., xn ]T ∈ Rn×d , where n xi |m i=1 and xi |i=m+1 are labeled and unlabeled data, respectively, d is the feature dimension of data points,
308 309
and n is the total number of training data. We also define a label matrix Y = [y1 , ..., yn ]T ∈ {0, 1}n×c ,
310 311
where yi |ni=1 ∈ {0, 1}c is the label vector of xi and c is the total number of data classes. Let Yij be the
312 313 314 315 316 317
j-th element of yi : Yij = 1 if xi belongs to the j-th class and Yij = 0 otherwise. Note that if the label information of xi is not available, we set Yij = 0 (∀j = 1, ..., c). Suppose the data X has been zero-
318 319
centered, the goal of SMH is to learn a hash function f to map the input data X to a binary code matrix
320 321
B = f (X) = sgn(XQ) ∈ {−1, +1}n×r , where sgn(·) is the sign function, Q ∈ Rd×r is a projection
322 323
6
324 325 326 327 328 329 330
matrix, and r is the code length1 . To achieve this goal, our preliminary objective function is formulated as a multi-class classification problem, i.e.,
331 332
min
W,Q:QT Q=I
333 334 335 336
||Y − sgn(XQ)W ||2F + β||W ||2F
(1)
where W ∈ Rr×c is the weight matrix, ||·||F is the Frobenius norm of matrix, and I is an identity matrix.
337 338
It is worth pointing out that our formulation is different from that used in the traditional multi-class
339 340
classification problem because we have introduced a projection matrix Q in (1). This projection matrix
341 342 343 344
Q not only serves as reducing the dimension [68] but also learning the optimal feature representation
345 346
for classification [69]. In this way, the constructed binary codes are expected to be both compact and
347 348
discriminative. However, the above formulation of objective function has the following two issues: First,
349 350
as mentioned before, there are often limited labeled data and a large number of unlabeled data available
351 352
in real applications. When the labeled data are inadequate, the hashing methods based on (1) are prone
353 354 355 356 357 358 359 360 361 362 363 364 365
to over-fitting. Second, the least square loss used for classifier learning in (1) is sensitive to outliers [67][70][44]. To handle the first issue, we formulate our modified object function in a recent manifold embedding framework [65] [66]. Specifically, we construct a weighted graph G over the input data and then introduce a label prediction matrix F = [F1 , ..., Fn ]T ∈ Rn×c to satisfy the label fitness and manifold
366 367
smoothness. That is, F should be consistent with both the ground truth labels of labeled data and the
368 369
whole graph G over all data for label propagation [65] [66]. This is achieved by optimizing the following
370 371 372 373
1 In our formulation, we have used the values of {-1, +1} which can be readily converted into the corresponding hash bits {0, +1}.
374 375 376 377
7
378 379
cost function in a semi-supervised manner:
380 381 382 383
min F
384
c [ ∑ n ∑ 1 p=1
(Fip − Fjp ) Sij + 2 i,j=1 2
n ∑
] Uii (Fip − Yip )
2
(2)
i=1
385 386 387 388
where Fip is the p-th element of Fi that is used to predict the label of xi (i = 1, ..., n). U ∈ Rn×n is a
389 390
diagonal matrix whose diagonal element Uii = ∞ if xi is labeled2 and Uii = 0 otherwise. S ∈ Rn×n is
391 392 393 394 395 396
the affinity matrix of the weighted graph G and its element Sij reflects the visual similarity of two data points xi and xj :
397 398
Sij =
399 400
2 exp(− ||xi −x2 j || ), if xi and xj are k-NN σ 0,
(3)
otherwise
401 402
where σ is the width parameter and the Euclidean distance is typically used to compute the distance
403 404
between two data points in k-NN (Nearest Neighbors). As shown in [65] [66], the cost function in (2) is
405 406 407 408
equivalent to [ ] min T r(F T LF ) + T r (F − Y )T U (F − Y )
409 410 411 412 413
F
(4)
where T r(·) denotes the trace operator, L is a Laplacian matrix computed by L = D − S, and D ∑n
414 415
is a diagonal matrix with diagonal entries being Dii =
416 417
unlabeled data, we now integrate (1) and (4) into a joint framework to simultaneously optimize feature
418 419 420 421
j=1
Sij . By leveraging both labeled and
representation and classifier learning:
422 423
min
424 425
F,W,Q:QT Q=I
[ ] T r(F T LF ) + T r (F − Y )T U (F − Y ) + [
426 427
α ||F − sgn(XQ)W ||2F + β||W ||2F
]
428 429 430 431
2
In practice, we set Uii (1 0 is a regularization parameter.
434 435 436 437 438 439 440
To handle the second issue, we apply the l2,1 -norm3 to our loss function for classifier learning. As indicated in [67][70][44], the l2,1 -norm is robust to outliers. Our final objective function is formulated as
441 442 443 444
min
F,W,Q:QT Q=I
[
445 446
α ||F − sgn(XQ)W ||2,1 + β||W ||2F
447 448 449 450
[ ] T r(F T LF ) + T r (F − Y )T U (F − Y ) + (6)
]
The characteristics of SMH are summarised as follows:
451 452
• It employs the manifold embedding to propagate the labels from partially labeled data while simul-
453 454 455 456
taneously learning the hash function. This semi-supervised hashing method is crucial to mitigate
457 458
the sematic gap, especially when there are only very limited labeled data available in real applica-
459 460
tions.
461 462 463 464
• It adaptively learns an optimal feature representation from the training data for hashing.
465 466
• It adopts the l2,1 -norm to obtain a robust model.
467 468 469
• It integrates manifold embedding, feature representation and classifier learning into a joint opti-
470 471 472 473
mization framework. The learned binary codes are not only compact but also discriminative for
474 475
classification.
476 477 478 479
Note that the objective function in (6) is not differentiable due to the presence of sgn(·). Therefore,
480 481
we propose a two-stage hashing strategy to address this problem. First, we relax the objective function
482 483 484 485
3
According to [67], the l2,1 -norm of an arbitrary matrix A ∈ Rn×c is defined as ||A||2,1 =
n ∑ i=1
9
√
c ∑ j=1
A2ij .
486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
by using the signed magnitude to obtain the optimal solution. Then, we refine the hash function by minimizing the quantization error. The details of these two stages will be presented in Sections 2.2 and 2.3, respectively.
2.2. Relaxation and Optimization To make the problem in (6) computationally tractable, we relax the objective function by replacing the sign function with its signed magnitude [71]. The relaxed objective function becomes
501 502 503 504 505 506
min
F,W,Q:QT Q=I
[
α ||F − XQW ||2,1 + β||W ||2F
507 508 509 510 511 512 513 514 515 516 517 518
522 523 524 525
(7)
]
However, it is still difficult to optimize because i) the l2,1 -norm involved in (7) is non-smooth, and ii) the relaxed objective function itself is non-convex with respect to F , Q and W jointly. Following [67][70] [72], we try to solve the relaxed problem in (7) via alternating optimization as follows. By denoting F − XQW = Z = [z1 , ..., zn ]T , the solutions to (7) can be solved by optimizing
519 520 521
[ ] T r(F T LF ) + T r (F − Y )T U (F − Y ) +
min
F,W,Q:QT Q=I
[ ] T r(F T LF ) + T r (F − Y )T U (F − Y ) + (8)
[
] α T r(Z T DZ) + βT r(W T W )
526 527 528 529
where D is a diagonal matrix with its diagonal entries being Dii =
530 531
property that ||A||2F = T r(AT A) for any matrix A.
532 533 534 535 536 537 538 539
10
1 . 2||zi ||2
Here, we have used the
540 541
By taking the derivative of (8) with respect to W and setting it to zero, we obtain the optimal W ∗ :
542 543 544 545
− 2QT X T DF + 2QT X T DXQW + 2βW = 0 (9)
546
∗
⇒ W = R Q X DF
547 548 549 550 551 552 553 554 555 556
T
optimization, and it will be updated at the next iteration as described in the following steps. Substituting W ∗ into (8), the objective function becomes
min
F,Q:QT Q=I
[
] T r(F LF ) + T r (F − Y ) U (F − Y ) + T
T
[
] α T r(F T DF ) − T r(F T DXQR−1 QT X T DF )
561 562 563 564 565 566
T
where R = QT N Q and N = X T DX + βI. Note that D is treated as a constant during the alternating
557 558 559 560
−1
(10)
Then, by setting the derivative of (10) with respect to F to zero, we obtain the optimal F ∗ :
567 568 569 570 571 572 573 574
LF + U (F − Y ) + α(DF − DXQR−1 QT X T DF ) = 0 (11) ⇒ F ∗ = (M − αDXQR−1 QT X T D)−1 U Y
575 576 577 578 579
where M = L + U + αD. Substituting F ∗ into (10), the objective function is equivalent to the following
580 581 582 583 584 585 586 587 588 589 590 591
[
−1
−1
max T r Y U (M − αDXQR Q X D) U Y T
QT Q=I
T
T
] (12)
According to the Sherman-Woodbury-Morrison formula [73], we have (M −αDXQR−1 QT X T D)−1 = M −1 + αM −1 DXQ(QT N Q − αQT X T DM −1 DXQ)−1 QT X T DM −1 (note that M is invertible). Since
592 593
11
594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620
Algorithm 1 Algorithm for solving the relaxed objective function in (7) Input: The training data X ∈ Rn×d ; The label matrix Y ∈ {0, 1}n×c ; Parameters α and β. Output: The label prediction matrix F ∈ Rn×c ; The weight matrix W ∈ Rr×c ; The projection matrix Q ∈ Rd×r . 1: Compute the Laplacian matrix L ∈ Rn×n ; 2: Compute the diagonal matrix U ∈ Rn×n ; 3: Initialize F ∈ Rn×c , W ∈ Rr×c and Q ∈ Rd×r , randomly; 4: repeat 5: Compute Z = [z1 , ..., zn ]T = F − XQW ; 6: Compute the diagonal matrix D as 1 7:
D=
2||z1 ||2
;
... 1 2||zn ||2
8: Compute N = X T DX + βI; 9: Compute M = L + U + αD; 10: Compute S1 = N − αX T DM −1 DX; 11: Compute S2 = X T DM −1 U Y Y T U M −1 DX; 12: Update Q via the eigen-decomposition of S1−1 S2 ; 13: Update F according to (11); 14: Update W according to (9); 15: until Convergence 16: Return F , W and Q.
621 622 623 624
the term M −1 is independent of Q, we re-write (12) as
625 626
[
627 628
T
max T r Y U M
QT Q=I
629
−1
DXQJ
−1
T
T
Q X DM
−1
] UY
(13)
630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647
where J = QT N Q − αQT X T DM −1 DXQ = QT (N − αX T DM −1 DX)Q. By using the property that T r(AB) = T r(BA) for any two matrices A and B, the objective function in (13) becomes [ ] max T r (QT S1 Q)−1 QT S2 Q
QT Q=I
(14)
where S1 = N − αX T DM −1 DX and S2 = X T DM −1 U Y Y T U M −1 DX. The optimal Q∗ in (14) can be obtained via the eigen-decomposition of S1−1 S2 . However, solving Q
12
648 649
655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678
800
300 250 200 150 100 50 0
5
10
15
20
1400
Objective function value
654
USPS
ISOLET Objective function value
652 653
Objective function value
650 651
CIFAR10 350
1200 1000 800 600 400 0
5
Iteration number
10
15
20
Iteration number
700 600 500 400 300 200 100 0
5
10
15
20
Iteration number
Fig. 1. Convergence curves of Algorithm 1 with 24-bit hash codes on CIFAR10, ISOLET and USPS databases (α=1, β=1).
requires the input of D which is dependent on F , Q and W . Inspired by [67][70] [72], we propose an iterative algorithm to solve this optimization problem. As described in Algorithm 1, in each iteration, we first compute D and then optimize Q, F and W alternately. The iteration procedure is repeated until the algorithm converges. The proof of the convergence is provided in APPENDIX A. Fig. 1 plots the convergence curves of Algorithm 1 on three benchmark databases (see Section 3 for details). One can see that the objective function value in (7) rapidly converges within a small number of iterations, demonstrating the efficiency of our iterative algorithm.
679 680 681 682
2.3. Hashing Refinement
683 684 685
After obtaining the optimized Q from the relaxed objective function in (7), the binary codes can be
686 687
directly generated by B = sgn(XQ), as is done in many traditional methods. However, this leads to a
688 689
quantization error which is measured by ||B − XQ||2F . To reduce this quantization error, we introduce
690 691 692 693 694 695 696 697 698 699
an orthogonal transformation [38] to improve the hashing performance. This is achieved by optimizing
min ||B − XQR||2F B,R
(15)
s.t. B ∈ {−1, +1}n×r , RT R = I
700 701
13
702 703
where R ∈ Rr×r is an orthogonal transformation matrix which is used to rotate the data to align with the
704 705
hypercube {−1, +1}n×r .
706 707 708
The problem in (15) can be efficiently solved using an alternating optimization algorithm. Specifically,
709 710
we first initialize R with a random orthogonal matrix and then minimize (15) with respect to B and R
711 712
alternatively. This alternating optimization algorithm involves the following two steps in each iteration:
713 714 715 716 717 718 719 720
• Fix R and update B. The objective function has a closed-form solution B = sgn(XQR). • Fix B and update R. The objective function reduces to the Orthogonal Procrustes problem [38,
721 722
74], which can be efficiently solved by the singular value decomposition (SVD). Denoting the
723 724
SVD of B T XQ as U SV T , the solution to (15) is obtained by R = V U T .
725 726 727 728 729 730 731 732 733 734 735 736 737 738 739
It was empirically found that the optimization algorithm can converge within about 50 iterations [38, 64]. The proposed hashing method, called Semi-Supervised Manifold-Embedded Hashing (SMH), is summarised in Algorithm 2. As can be seen, SMH consists of the following two stages: At the first stage, the optimized Q is obtained from the relaxed problem in (7) by Algorithm 1. At the second stage, the optimized R is obtained by minimizing the quantization error in (15). Once Q and R are learned, the f = QR: binary code b of a new query x is generated using the projection matrix W
740 741 742 743
fT x) b = sgn(W
744 745 746 747 748 749
where x ∈ Rd and b ∈ {−1, +1}r .
750 751 752 753 754 755
14
(16)
756 757 758 759 760 761 762 763 764 765 766 767 768 769 770
Algorithm 2 The proposed SMH algorithm Input: The training data X ∈ Rn×d ; The label matrix Y ∈ {0, 1}n×c ; Parameters α and β. f ∈ Rd×r . Output: The binary code matrix B ∈ {−1, +1}n×r ; The projection matrix W 1: Initialize R ∈ Rr×r randomly; 2: Obtain Q by running Algorithm 1; 3: repeat 4: Update B according to B = sgn(XQR); 5: Compute U and V via SVD of B T XQ; 6: Update R according to R = V U T ; 7: until Convergence f = QR; 8: Compute W f. 9: Return B, W
771 772 773 774 775 776 777 778 779 780 781 782 783 784
2.4. Implementation Issue The computation of Laplacian matrix L via the affinity matrix S in (3) is not flexible4 for large scale problems. For efficiency, in this paper we use the recently proposed anchor graph [75] to approximate the affinity matrix S and then obtain L. Specifically, we randomly sample q data points xi |qi=1 (anchors) from the training set {xi }ni=1 and then build an affinity matrix A = ZΛ−1 Z T ∈ Rn×n , where Λ =
785 786
diag(Z T 1) ∈ Rq×q and Z ∈ Rn×q is the similarity matrix between n data points and q anchors. In
787 788
practice, Z is highly sparse by keeping nonzero Zij for s (s < q) closest anchors to xi (we empirically
789 790 791 792 793
set q = 400 and s = 3 in our implementation). The resulting Laplacian matrix is derived by L = I −A = I − ZΛ−1 Z T .
794 795 796 797
2.5. Computational Complexity
798 799
The computational complexity of SMH for both the training and the test (query) stage is briefly an-
800 801
alyzed as follows. In the training stage, the time complexity for computing the Laplacian matrix L is
802 803 804 805 806 807 808 809
O(nqs). To obtain the relaxed solution for the objective function (7), we need to compute the inverse of a few matrices (e.g., M −1 and R−1 ) and conduct the eigen-decomposition of S1−1 S2 . These operations 4
The time cost of constructing k-NN graph is O(kn2 ), which is intractable when n becomes very large.
15
810 811
have the computational complexity O(n3 + d3 + r3 ). In addition, we need to compute B and R in (15)
812 813
using an alternating algorithm which is O(ndr + nr2 + r3 ) in complexity. Note that we typically have
814 815 816
n ≫ q ≫ s, n ≫ d and n ≫ r in real applications. Thus the total computational complexity for learning
817 818
Q and R is approximately O(n3 ). Once Q and R are learned as a “one-time” cost, the hashing time for
819 820
a new query is O(dr).
821 822 823 824
3. Experiments
825 826 827 828
In this section, we perform extensive experiments to evaluate the proposed SMH on three benchmark
829 830
databases. We compare SMH with several state-of-the-art hashing approaches: Locality Sensitive Hash-
831 832
ing (LSH) [31], Spectral Hashing (SH) [34], Spherical Hashing (SpH) [40], data-dependent CBE (CBE)
833 834 835 836
[45], PCA with random orthogonal rotation (PCA-RR) [38], Sparse Embedding and Least Variance En-
837 838
coding (SELVE) [47], Kernel-based Supervised Hashing (KSH) [55], Iterative Quantization with CCA
839 840
(CCA-ITQ) [38], and Sequential Projection Learning for Hashing (SPLH) [39]. Among them, KSH and
841 842
CCA-ITQ are supervised hashing approaches, SPLH is semi-supervised, and others are unsupervised
843 844
ones.
845 846 847 848 849 850 851 852 853 854 855 856 857 858 859
3.1. Databases and Evaluation Protocols We use three benchmark databases, i.e., CIFAR105 , ISOLET6 , and USPS7 , to evaluate the hashing methods. 5
http://www.cs.toronto.edu/∼kriz/cifar.html http://archive.ics.uci.edu/ml/datasets/ISOLET 7 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/multiclass.html#usps 6
860 861 862 863
16
864 865 866 867 868 869 870
3.1.1. CIFAR10 CIFAR10 is a subset of the 80M tiny images [76] which consists of 60,000 32×32 color images from 10 classes (e.g., airplane, automobile, bird, ship, etc.). There are 6,000 images for each class and
871 872
each image is represented as a 384-dimensional GIST [12] feature vector. We randomly partition the
873 874
whole database into two parts: a training set of 6,000 samples for learning hash functions and a test
875 876 877 878
set of 54,000 samples for queries. For supervised and semi-supervised methods, 50 samples randomly
879 880
chosen from the training set are used as labeled data while the rest are treated as unlabeled ones. For
881 882
unsupervised methods, the whole training set is treated as unlabeled data.
883 884 885 886
3.1.2. ISOLET
887 888
ISOLET is a spoken letter recognition database, which contains 150 subjects who spoke the name of
889 890
each letter of the alphabet twice. Therefore, there are 52 training examples from each speaker, resulting
891 892
in 7,797 samples in total from 26 categories (3 examples are missing). All the speakers are grouped into
893 894 895 896
sets of 30 speakers each, referred to as isolet1, isolet2, isolet3, isolet4 and isolet5. The features of each
897 898
sample are represented as a 617-dimensional vector including spectral coefficients, contour features,
899
sonorant features, etc. Following the popular setup, we choose isolet1+2+3+4 as the training set and
900 901 902 903
isolet5 as queries. This results in a training set of 6,238 samples and a query set of 1,559 samples.
904 905
Similarly, for supervised and semi-supervised methods, 300 samples randomly chosen from the training
906 907
set are used as labeled data while the rest are treated as unlabeled ones. For unsupervised methods, the
908 909
whole training set is treated as unlabeled data.
910 911 912 913 914 915 916 917
17
918 919 920 921 922 923 924
3.1.3. USPS USPS contains 11K handwritten digit images from “0” to “9”. We select a popular subset which contains 9,298 16×16 images for our experiments. Each image is represented as a 256-dimensional
925 926
feature vector. For this database, 8,298 images are randomly sampled as the training set and the rest
927 928
1K images are used as queries. For supervised and semi-supervised methods, 100 samples randomly
929 930 931 932 933 934 935 936 937 938
chosen from the training set are used as labeled data while the rest are treated as unlabeled ones. For unsupervised methods, the whole training set is treated as unlabeled data. To evaluate the hashing performance, the following two criteria [34, 38, 47] are adopted in our experiments:
939 940 941 942 943 944 945 946 947 948 949 950 951 952 953
• Hash Lookup. It builds a lookup table using the hash codes of all data points and returns the points that have a small Hamming radius r from the query point. In our experiments, we adopt HAM2 (i.e., r = 2) to compute the hash lookup precision under different hash bits. Specifically, for each query point, HAM2 is computed as the percentage of true neighbors (e.g., sharing the same label as the query point) among the returned points within a Hamming radius 2. Here, the mean of HAM2 for all query points is used to evaluate the hash lookup performance.
954 955 956 957
• Hamming Ranking. It ranks all data points according to their Hamming distances to the query
958 959
point and returns the top ones. In our experiments, the mean average precision (MAP) of all query
960 961 962 963 964 965 966 967 968 969 970 971
points is used to evaluate the ranking performance under different hash bits. In addition, we report the precision-recall curves as in [38, 47]. The precision and recall are respectively defined as P recision =
# of retrieved relevant pairs # of all retrieved pairs 18
(17)
972 973
Recall =
974 975 976 977 978 979 980 981 982
# of retrieved relevant pairs # of all relevant pairs
(18)
In our experiments, the search results are evaluated by using class labels as the ground truths. For the proposed SMH, we set α = 1 and β = 1 for all databases unless otherwise stated. The influences of these two parameters on the hashing performance will be discussed in Section 3.3. For compared methods,
983 984
we use the publicly available codes under the best settings suggested in their papers. All methods are
985 986
implemented in MATLAB and run on a workstation with an Intel Xeon CPU of 2.1GHz and 128GB
987 988 989 990 991 992 993 994 995 996
RAM.
3.2. Results Fig. 2 plots the HAM2 curves of all methods by varying the code length from 8 to 128 on CIFAR10,
997 998
ISOLET and USPS databases. As can be seen from Fig. 2, the HAM2 performance of almost all
999 1000
methods first increases and then drops as the code length increases. This is because the Hamming
1001 1002
code space becomes very sparse with the increased code length and many queries cannot retrieve true
1003 1004
neighbors in the Hamming radius of 2. When the code length is longer than 32, the semi-supervised
1005 1006 1007
SPLH performs best on CIFAR10 while the proposed SMH consistently performs best on both ISOLET
1008 1009
and USPS. In addition, one can observe that SELVE shows promising results on ISOLET and USPS
1010 1011 1012 1013 1014 1015 1016 1017
databases with short binary codes. However, its performance rapidly deteriorates with long codes and is inferior to SMH by a large margin. In addition, the supervised KSH performs fairly well on CIFAR10 but pretty poorly on both ISOLET and USPS.
1018 1019
Fig. 3 shows the MAP curves of all methods with the code length ranging from 8 to 128. It can
1020 1021
be seen that SMH achieves better or comparative results than other methods on all databases at almost
1022 1023 1024 1025
all code lengths. When the code length is shorter than 48, SPLH works slightly better than SMH on
19
1033 1034 1035 1036 1037 1038 1039 1040 1041 1042
LSH SH PCA−RR CCA−ITQ CBE SpH KSH SELVE SPLH SMH
0.2
HAM2
1032
0.15
0.1
1
0.6 0.8 0.5 0.4 0.3
0.6
0.4
0.2 0.2
0.05
0
USPS
0.7
HAM2
1028 1029 1030 1031
ISOLET
CIFAR10 0.25
HAM2
1026 1027
0.1 8 16 24 32
48
64 80 Code length
(a)
96
112 128
0
8 16 24 32
48
64 80 Code length
(b)
96
112 128
0
8 16 24 32
48
64 80 Code length
96
112 128
(c)
Fig. 2. HAM2 of all methods by varying the code length from 8 to 128 on CIFAR10, ISOLET and USPS databases. The legends of (b) and (c) are omitted and they are the same as the one of (a).
1043 1044
the CIFAR10 database. However, its performance drops quickly with the increase of hash bits and is
1045 1046
surpassed by other methods. With the longer hashing codes, PCA-RR generally performs the second
1047 1048 1049 1050 1051 1052
best on CIFAR10 while CCA-ITQ performs the second best on ISOLET and USPS databases. It is surprising that the supervised KSH performs the worst on both ISOLET and USPS databases.
1053 1054
Fig. 4 presents the precision-recall curves of all methods with 24-bit hash codes. One can see that the
1055 1056
precision of all methods drops significantly when the recall increases and vice versa—this is consistent
1057 1058
with the definitions in (17) and (18) [38, 47]. From Fig. 4 (a), we observe that all compared methods
1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069
are highly competitive on CIFAR10. From Fig. 4 (b) and Fig. 4 (c), we observe that SMH and CCAITQ perform better than all other methods on ISOLET and USPS databases. Particularly, when the recall is high, SMH significantly outperforms CCA-ITQ in terms of the precision. For the supervised KSH, it again performs the worst on these two databases. By comparing the semi-supervised methods,
1070 1071
the proposed SMH outperforms SPLH which learns to hashing via supervised empirical fitness and
1072 1073
unsupervised information theoretic regularization. These results demonstrate the effectiveness of SMH
1074 1075
by exploring the manifold embedding in a semi-supervised manner to propagate the labels from partially
1076 1077 1078 1079
labeled data while simultaneously learning good feature representations for binary code generation.
20
1084 1085
0.18
0.16
1087 1088
0.12
1105 1106 1107 1108 1109 1110 1111 1112 1113 1114
0.6
0.5 0.4
8 16 24 32
48
64 80 Code length
96
0.5 0.4
0.3 0.3 0.2
0
112 128
8 16 24 32
(a)
48
64 80 Code length
96
0.1
112 128
8 16 24 32
(b)
48
64 80 Code length
96
112 128
(c)
Fig. 3. MAP of all methods by varying the code length from 8 to 128 on CIFAR10, ISOLET and USPS databases. The legends of (b) and (c) are omitted and they are the same as the one of (a). CIFAR10
0.6 0.5 0.4 0.3
Precision
0.7
Precision
ISOLET LSH SH PCA−RR CCA−ITQ CBE SpH KSH SELVE SPLH SMH
USPS
1
1
0.8
0.8
0.6
0.6
Precision
0.8
1099 1100
1103 1104
0.7
0.1
1097 1098
1101 1102
0.7
0.2
1091 1092
1095 1096
0.8
0.14
1089 1090
1093 1094
USPS
0.8
0.6
MAP
0.2
LSH SH PCA−RR CCA−ITQ CBE SpH KSH SELVE SPLH SMH
MAP
1082 1083
1086
ISOLET
CIFAR10 0.22
MAP
1080 1081
0.4
0.2
0.4
0.2
0.2 0.1 0
0.2
0.4
0.6 Recall
(a)
0.8
1
0 0
0.2
0.4
0.6 Recall
(b)
0.8
1
0 0
0.2
0.4
0.6
0.8
1
Recall
(c)
Fig. 4. Precision-recall curves of all methods with 24-bit hash codes on CIFAR10, ISOLET and USPS databases. The legends of (b) and (c) are omitted and they are the same as the one of (a).
3.3. Influences of α and β
1115 1116 1117
The proposed SMH involves two parameters α and β. We now evaluate their impacts on HAM2 and
1118 1119
MAP performance by varying them in the candidate range of [10−3 , 10−2 , 10−1 , 1, 101 , 102 , 103 ]. Fig. 5
1120 1121 1122 1123
illustrates the performance variation of SMH with respect to α and β using 24-bit hash codes on three
1124 1125
databases. We can see that the best performance can be obtained by choosing different combinations of
1126 1127
α and β for different databases and there is a wide range to choose these best combinations. Without
1128 1129
specifically tuning these parameters, we choose them from some intermediate values in the candidate
1130 1131
range and set α = 1 and β = 1 for all databases by default.
1132 1133
21
1134 1135 1136 1137
0.6
0.1
1e−3 1e−2 1e−1 1 1e1 1e2
1145 1146
1e3
α
1155 1156
1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187
1e3
1e2
1e3
1 1e1
0.6
0.4
0.05
0.2
0
0
1e2 1e3
1 1e−1 1e−2 1e−3 β
1e1
1e2
MAP
0.6
α
1e3
1e2
1e3
0.4 0.2 0
1e−3 1e−2 1e−1
1e3
1e1
(c)
0.15 0.1
1 1e−1 1e−2 1e−3 β
1e2 α
0.8
1 1e1
1 1e−1 1e−2 1e−3 β
1e2 1e3
α
(d)
1e1
1e2
1e−3 1e−2 1e−1
1e3
1 1e1
1 1e−1 1e−2 1e−3 β
1e2 1e3
α
(e)
1e1
1e2
1e3
(f)
Fig. 5. Performance variation of the proposed SMH with respect to α and β using 24-bit hash codes. (a)(d) CIFAR10. (b)(e) ISOLET. (c)(f) USPS. ISOLET
CIFAR10
USPS
0.9 HAM2 MAP
0.24 0.22 0.2 0.18
HAM2 MAP
0.8
Performance
1170 1171
1e2 α
1e1
0.8
1e1
Performance
1169
1 1e−1 1e−2 1e−3 β
(b)
1
1165 1166 1167 1168
1
1e−3 1e−2 1e−1
0.2
1159 1160
1163 1164
1e3
1e1
1e−3 1e−2 1e−1
1157 1158
1161 1162
1e2
MAP
MAP
1149 1150
1153 1154
1e1
0.4
0
1e−3 1e−2 1e−1
(a)
1147 1148
1151 1152
1 1e−1 1e−2 1e−3 β
0.6
0.2
0
0
1143 1144
0.4 0.2
0.05
1141 1142
HAM2
HAM2
0.15
0.16
HAM2 MAP
0.9 0.8
0.7
Performance
1140
0.8
0.2
0.8
HAM2
1138 1139
0.25
0.6 0.5 0.4
0.7 0.6 0.5
0.3 0.4
0.14 10
50
100
150
200
250
500
Number of labeled data
0.2
10
50
100
300
500
700
Number of labeled data
1000
10
50
100
200
300
500
700
Number of labeled data
Fig. 6. Performance variation of the proposed SMH with respect to the number of labeled data using 24-bit hash codes. Note that the labeled data are randomly sampled from the corresponding training set on each database.
3.4. Effects of Size of Labeled Data In above experiments, we have used the fixed number of samples from each database as labeled data. We now vary the number of labeled data and see how the performance changes on each database.
22
1190 1191
1195 1196
HAM2
1194
0.1
0.05
1197 1198
ISOLET
0.7
SMH SMH_F SMH_NR
0.15
HAM2
1192 1193
CIFAR10
0.2
0.6
0.7
0.5
0.6
SMH SMH_F SMH_NR
0.4 0.3
8 162432
48
64
80
0.1
0.2 0.1 8 162432
48
CIFAR10
0.19
64
80
96 112 128
8 162432
Code length ISOLET
Code length
1199 1200
0.4 0.3
96 112 128
0.8
1203 1204
0.17
0.6
0.65
1207 1208 1209 1210 1211 1212
0.16
SMH SMH_F SMH_NR
0.15
MAP
0.7
MAP
0.7
MAP
0.18
8 162432
48
64
80
96 112 128
64
80
96 112 128
0.6
0.5
SMH SMH_F SMH_NR
0.4 0.3
0.14
48
Code length USPS
0.75
1201 1202
1205 1206
SMH SMH_F SMH_NR
0.5
0.2
0
0
USPS
0.8
HAM2
1188 1189
8 162432
Code length
48
64
80
96 112 128
Code length
SMH SMH_F SMH_NR
0.55 0.5 8 162432
48
64
80
96 112 128
Code length
Fig. 7. Performance comparison of different components with different length of hash codes on CIFAR10, ISOLET and USPS databases.
1213 1214 1215 1216
Fig. 6 shows the performance variation of SMH in terms of HAM2 and MAP using 24-bit hash codes.
1217 1218
Overall, one can observe the improved performance for all three databases as the number of labeled
1219 1220
data increases. Interestingly, the performance varies slightly when the number of labeled data becomes
1221 1222
large enough. This is advantageous for our semi-supervised hashing method because it means that we
1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235
can leverage the limited labeled data to obtain the largest performance gain. In fact, we have used a small portion of the training data as the labeled data (50 out of 6,000 for CIFAR10, 300 out of 6,238 for ISOLET, and 100 out of 8,298 for USPS) for label propagation and have achieved the promising results as demonstrated in Figs. 2-4. However, how to determine the optimal number of labeled data for different databases remains an open problem.
1236 1237 1238 1239 1240 1241
23
1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256
3.5. Component Analysis In the propose SMH, we have applied the l2,1 -norm to the objective function in (6) and introduced a refinement step in Section 2.3 to improve the hashing performance. In the following, we will evaluate these two components. For experimental comparison, we replace the l2,1 -norm with the F -norm based represatation (i.e., the objective function in (5)) and remove the hashing refinement step. The resulting algorithms are called SMH F and SMH NR, respectively. Fig. 7 shows the comparison results in terms
1257 1258
of HAM2 and MAP with different code lengths. We can see that SMH and SMH F yield competitive
1259 1260
results on ISOLET and USPS databases while SMH leads to better performance than SMH F on CI-
1261 1262
FAR10. The results on the three databases indicate the robustness of the l2,1 -norm to the loss function
1263 1264 1265 1266
in SMH, which are basically consistent with the conclusions in [67][70][44]. From Fig. 7, we see that
1267 1268
the refinement step can boost the hashing performance by a large margin. This demonstrates the neces-
1269 1270
sity and effectiveness of hashing refinement to minimize the quantization error after solving the relaxed
1271 1272
optimization problem.
1273 1274 1275 1276 1277 1278 1279
3.6. Running Time In this part, we empirically compare the running time, including training time and test time, for differ-
1280 1281
ent methods. The training time measures the cost of learning the hash function from the training set and
1282 1283
the test time measures the cost of transforming the raw test data to compact binary codes. Table 1 lists the
1284 1285
training and test time for different methods using 24-bit hash codes on three benchmark databases. We
1286 1287 1288 1289
observe that LSH takes negligible training time because the projections are randomly generated without
1290 1291
the learning process. The learning based CBE is relatively slower due to the computation of Fast Fourier
1292 1293
Transformation (FFT) in both the training and test stages. For the proposed SMH, the training time is
1294 1295
much longer than other methods. This is mainly because SMH involves the computation of the inverse 24
1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330
Table 1. Training and test time (in seconds) for different methods using 24-bit hash codes on three benchmark databases. For the proposed SMH, we run Algorithm 1 until it converges and run the alternating optimization algorithm in the hashing refinement step 50 iterations.
CIFAR10 ISOLET USPS training test training test training test LSH[31] 0.0133 0.0272 0.0071 0.0016 0.0063 0.0011 SH[34] 0.2155 0.2009 0.2382 0.0081 0.1954 0.0071 PCA-RR[38] 0.0654 0.0266 0.1081 0.0013 0.0577 0.0008 CCA-ITQ[38] 0.3033 0.0550 0.2878 0.0014 0.3576 0.0015 CBE[45] 13.2650 30.6177 20.9785 0.1305 9.1522 0.0326 SpH[40] 0.4468 0.1620 0.7492 0.0067 0.7403 0.0028 KSH[55] 4.6390 0.8795 5.9804 0.0403 4.7848 0.0210 SELVE[47] 1.3500 0.8755 2.0539 0.0313 1.4045 0.0209 SPLH[39] 1.4330 0.0202 3.3826 0.0012 0.9319 0.0008 SMH 169.7365 0.0217 103.4123 0.0014 85.8736 0.0007 Methods
of some big matrices which is very expensive, as discussed in Section 2.5. In comparison with offline training time, the test time is more crucial in real-time applications. As listed in Table 1, the test time for SMH is very fast since it only requires the linear projection and binarization to obtain the binary codes.
3.7. Discussion Recently, there are some supervised methods which use deep networks to learn compact hash codes and achieve state-of-the-art results. For instance, Erin Liong et al. [51] leveraged a deep neural network
1331 1332 1333
to seek multiple hierarchical non-linear transformations for hashing, and Lai et al. [61] developed a
1334 1335
deep architecture to learn image representations and hash functions in a one-stage manner. As reported
1336 1337
on CIFAR10 using 32-bit hash codes, [51] achieves 0.31 and 0.21 in terms of HAM2 and MAP, re-
1338 1339
spectively, while [61] achieves 0.60 and 0.56 in terms of HAM2 and MAP, respectively. These results
1340 1341 1342 1343 1344 1345
demonstrate that leaning deep features (instead of using hand-crafted features) coupled with hashing coding is promising to improve the hashing performance.
1346 1347
The major training time of our method is consumed by computing the inverse of some big matrices
1348 1349
which can take the time complexity O(n3 ). Fig. 8 shows the training time of SMH with respect to the 25
1350 1351
4
2.5
Training time (seconds)
1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366
x 10
2
1.5
1
0.5
0
6000
12000 18000 24000 30000 36000 42000 48000
Numbers of training data
Fig. 8. Training time of SMH with respect to the number of training data on CIFAR-10 using 24-bit hash codes.
1367 1368
number of training data on CIFAR-10 using 24-bit hash codes. For larger-scale problems, we need to
1369 1370
find ways to reduce the time complexity of SMH. Firstly, we may seek some fast algorithms [77] to
1371 1372
approximate or avoid the inversion of a matrix. For instance, an iteration algorithm, as used in [78],
1373 1374 1375 1376
may be adopted to solve F in (10), thus avoiding computing the inverse of big matrices. Secondly, we
1377 1378
may condense the training data or reduce the number of variables to be optimized in matrix inversion.
1379 1380
For instance, we may remove U in objective function (6) by introducing the constraint Fi = Yi (1