particular histopathology image collection retrieval task, showing that the learned representation has a positive impact in retrieval performance for this particular ...
1
Unsupervised Feature Learning for Content-based Histopathology Image Retrieval Jorge A. Vanegas, John Arevalo, Fabio A. González MindLAB Research Group Universidad Nacional de Colombia {javanegasr, jearevaloo, fagonzalezo}@unal.edu.co
Abstract—This paper proposes a strategy for content-based image retrieval, which combines unsupervised feature learning (UFL) with the classical bag-of-features (BOF) representation. In BOF, patches are usually represented using standard classical descriptors (i.e., SIFT, SURF, DCT, among others). We propose to use UFL to learn the patch representation itself. This is achieved by applying a topographic UFL method, which automatically learns visual invariance properties of color, scale and rotation from an image collection. The learned image representation is used as input for a multimodal latent semantic indexing system, which enriches the visual representation with semantics from image annotations. The overall strategy is evaluated in a particular histopathology image collection retrieval task, showing that the learned representation has a positive impact in retrieval performance for this particular task. Index Terms—Unsupervised Feature Learning, Content-Based Image Retrieval, Multimodal Semantic Indexing
I. I NTRODUCTION In this work, we consider the problem of retrieving histopathology images using as query an example image. Under this setup, the system relies mainly on processing the visual contents to find relevant images. Part-based image representation schemes, such as the bag-of-features (BOF), has been successfully used to address this task [5]. In short, the BOF approach works as follows [7]: a set of patches are extracted from the image collection, the patches are represented using a conventional image descriptor (popular choices are SIFT, wavelet coefficients and discrete cosine transform), a clustering algorithm, such as k-means, is applied to the patches, the cluster centroids are used to build a codebook, and, finally, images in the collection are represented by extracting patches, finding the closest patch in the codebook for each image patch, and accumulating the frequency in a histogram, which constitutes the image representation. Different works [17] have shown the suitability of the BOF representation for content-based image retrieval tasks in both general (e.g. natural-scene images) and specific domains (e.g. histopathology images). The type of image descriptor used to represent patches depends on the type of image. For instance, for natural-scene images SIFT and HOG descriptors have shown the best performance, whereas DCT descriptor has shown better performance for histopathology images [5]. Thus, it makes sense to look for the patch representation that produce the best results in a particular CBIR domain. Traditionally, this
has been addressed by systematically testing different well known representations. An alternative is to use data-driven methods that directly learn an optimal representation from the image collection. This approach is known as unsupervised feature learning (UFL), and, along with deep learning, is driving a new wave in computer vision that has shown outstanding performance in different challenging computer vision tasks [1]. Even so, there is an important issue in CBIR, this is the well known semantic gap [19], which implies that matching visual similarities does not necessarily lead to results with semantic validity. In order to overcome this problem many multimodal semantic indexing strategies has been proposed in the last years [3], [2]. These strategies exploit additional information resources, such as text or another related data to build a common semantic representation. So, in this paper we also want to evaluate the effect of UFL in the quality of the multimodal semantic space defined by BOF descriptors in conjunction with related textual information. The main goal of the present work is to evaluate the impact of using UFL for representing BOF patches in a domain specific CBIR task. The reason to concentrate on a domain specific task, in this case basal cell carcinoma histopathology images, instead of a more general CBIR task is twofold: first, the specificity of the problem along with the size of the test collection makes a perfect testbed for an exploratory experimentation, second, there is previous work that suggests that UFL is able to capture meaningful visual patterns in this type of images [4]. This work presents three main contributions: first, it proposes a strategy for CBIR, which combines unsupervised feature learning methods with the classical bag-of-features strategy using a learned representation for patches, instead of standard classical descriptors; second, it is evaluated the use of features based on UFL as input for a multimodal latent semantic indexing system; and finally, it presents a performance evaluation of the proposed strategies in a content-based biomedical image retrieval task, comparing two well-known standard descriptors and three different strategies for learning the representation. The experimental results show that a learned representation does improve the retrieval performance in this particular task in two evaluation scenarios: pure visual indexing, using a BOF representation, and multimodal latent indexing, using a joint textual-visual embedding into a semantic latent space.
978-1-4799-3990-9/14/$31.00 © 2014 IEEE
2
The rest of this paper is organized as follows: Section 2 discusses the related work; Section 3 introduces the proposed general representation strategy; Section 4 details the unsupervised feature learning methods; Section 5 describe the used multimodal latent semantic indexing strategy; Section 6 presents some experimental results; and finally, Section 7 presents some concluding remarks. II. R ELATED W ORK In the last decades researchers have proposed many visual feature representations ranging from global descriptors (e.g., color, shape, and texture) to local features using a bag-offeatures representation (SIFT, HOG, DCT, among others). Some recent studies for classification tasks have found promising empirical results by applying unsupervised feature learning (UFL) for image representation [12], [4]. In the context of content–based image retrieval, unsupervised learning have been studied mainly on two fronts: learning similarity-based distance measures and learning to hashing to obtain a compact representation. For instance, in the context of similarity learning, Wu et al. [21] proposed a multimodal learning method which integrate multiple deep neural networks to learn a nonlinear similarity function from images, showing promising results for image search. In the front of learning to hashing, Krizhevsky et al. [14] use autoencodes to map image representations to short binary codes, making the retrieval process very efficient and allowing to extract more semantic information from the images. Also, Kang et al. [13] proposed a deep network model to learn a deep multi-view hashing which incorporates multiple visual descriptors into a hashing in a low-dimensional Hamming space. In this work, we propose a new representation model which combines UFL methods with the classical BOF representation to improve the performance in CBIR. Also, we propose to use this learned representation enriched with semantic information from image annotations as input for a final multimodal latent semantic representation. III. BAG - OF - FEATURES R EPRESENTATION A ND U NSUPERVISED F EATURE L EARNING One of the most important aspects for any content-based image retrieval (CBIR) system is to define an effective representation of images, which has as objective to provide information of interest than just the values of the pixels located in a matrix. An image can be represented by a visual content descriptor that can provide information like color, texture, shape and spatial relationship, among others. The Bag-of-features (or bag of visual words) representation is an adaptation of the classic Bag-of-words model used in text categorization and retrieval. The main idea is to construct a codebook or visual vocabulary, in which the most representative visual patterns are encoded as visual words. In this way, representations of the images are generated through a simple frequency analysis of each visual word within the image. The construction of a BOF representation comprises three main stages: (i) patch extraction and description, (ii) dictionary construction and, finally, (iii) histogram image representation.
In this work, we propose to replace the representation of the patches commonly based on standard features by a representation based on features learned through an UFL method which discovers visual patterns and relations in automatic fashion from the image dataset. This representation would be more meaningful and would improve image retrieval performance in CBIR tasks. IV. U NSUPERVISED L OCAL F EATURE L EARNING Patch description stage represents the content of a given patch through a set of transformations applied over its raw pixels. Traditional approaches select such transformations at hand using a priori information of the problem, for example in histology domain it is common to use features that are related with textures [10] such as DCT or Haar-based transformation, because textural characterization is a good descriptor to address pattern recognition tasks for this particular kind of images[5]. Recent approaches in computer vision tasks have tackled the representation stage by applying Unsupervised Feature Learning (UFL) methods [1]. While other descriptors such as DCT or SIFT use predefined transformation functions, UFL learns them directly from the dataset in a totally unsupervised way. This work explores autoencoders, a family of UFL methods that learns an encoding function and a decoding function such that their composition reconstructs the original input. Particularly, Sparse Autoencoders and Reconstruct Independent Component Analysis were studied and are detailed below. A. Sparse Autoencoders Sparsity is a desired property in UFL methods because promotes compact representations by finding latent factors in the images that can explain better the content of the collection. Sparse Autoencoders (sAE) is a popular UFL method, which learns features using a reconstruction penalty and a sparsity regularization [1]. A sAE may be seen as a two-stages neural network. The first stage encodes the input data to an internal representation and the second stage decodes it to the original representation. The output of the network is expected to match the input. The network is trained by solving an optimization problem with the following objective function: Jsparse (W) = J (W) +
k X
KL (⇢||ˆ ⇢j )
(1)
j=1
2
where J (W) = kg (f (x)) xk2 is the reconstruction cost, given by the distance between original input x and its reconstruction with f (·) as encoding function and g (·) as decoding function. Here, we chose f (x) = sigm(Wf x + b1 ) and g (s) = Wg s + b2 . The second term measures how much the input data activates the learned features, i.e. how sparse the new representation of the data is. In this case, the sparsity is estimated by summing the Kullback-Leibler divergence, KL(·), between desired sparsity parameter percentage (⇢) and the average activation of each feature j(⇢ˆj ). Setting low values to ⇢ (close to zero) induces sparse representations in the activations. When the algorithm seeks sparse representations it is implicitly building compact features with a more expressive power.
3
B. Reconstruction Independent Component Analysis (RICA) Independent Component Analysis (ICA) assumes that patches can be represented by a linear combination of a set of statistically independent features. ICA learns those features by solving the following optimization problem: m X n ⇣ ⌘ X g Wj x(i)
min W
WWT = I
s.t.
(2)
i=1 j=1
Where each x(i) is a patch in the training set of size m, and W is the set of n features to learn. g(·) is a smoothed function of L1-norm pthat promotes sparse representations. This work used g(s) = s2 + ✏, with s = Wj x(i) as activation of sample i for feature j and ✏ as smoothing parameter. ICA has been applied successfully in object recognition tasks, however it has two main drawbacks that come from its orthogonality constraint. It can not learn more features than the original input dimension and its training procedure with classical optimization techniques requires to solve an eigenvalue problem at each iteration, making it computationally expensive. Le et at. [15] proposed a soft reconstruction approach of ICA. Reconstruction ICA (RICA) replaces the orthogonality constraint by a reconstruction penalty. RICA objective function is defined by: J(W) =
m
m X i=1
kWT Wx(i)
x(i) k22 +
m X n ⇣ ⌘ X g Wj x(i)
(3)
i=1 j=1
where the first term corresponds to the reconstruction cost, measured by L2-norm, and the second term adds a sparsity constraint. This model finds a set of features W that reconstructs the original data using nearly orthogonal bases, while keeping the data representation (Wx) sparse. A key advantage of this formulation is that being an unconstrained problem, efficient implementations of gradient-based optimization solvers can be used. C. Topographic RICA Topographic models seek to arrange learned features such that similar ones are close together, while different ones are set apart. This approach is biologically inspired by the visual cortex model where neurons have a specific spatial organization and their response change in a systematic way [11]. Particularly, Topographic RICA (TICA) builds a squared matrix to organize features in groups such that adjacent features get similar outputs (i.e. proportional magnitudes) with respect to the same input. TICA cost function is given by: J(W) =
m
Pm
+
i=1 kW
T Wx(i)
Pm Pl i=1
k=1
q
x(i) k22
✏ + Hk Wx(i)
(4)
2
n
where l is the number of groups and Hk 2 {1, 0} is a binary vector representing the membership of features to group k. This model sets H (Topographic organization) fixed and learns W. Similarly to RICA, TICA is unconstrained and it can be treated with efficient gradient-based optimization solvers.
TICA represents a new patch x ˆ by calculating vector sˆ 2 Rl q 2 with sˆk = Hk (Wx(i) ) . Note that a particular group outputs high values no matter what feature Wj did the x ˆ patch activate. This behavior yields to invariances defined by each group in the topography. V. M ULTIMODAL L ATENT S EMANTIC R EPRESENTATION Content-based image retrieval using a BOF representation has been addressed using different strategies. The simplest strategy is to directly use BOF to represent the images in the collection. An alternative is to use the BOF representation as input to a more complex indexing mechanism. For instance, in [9] BOF is jointly used with a bag-of-words text representation to learn a latent semantic embedding used to represent the images in the collection. In this work, we evaluate the influence of a learned representation in these two indexing strategies. The next paragraphs briefly describe the multimodal latent semantic indexing strategy used in the present work. Latent semantic indexing is a successful approach for information retrieval which implies a transformation to a lower rank approximation of the original feature representation, allowing the extraction of the underlying semantic structure in the collection and building an effective index for image search. The Non-negative Matrix Factorization (NMF) is a latent semantic indexing method which finds a compact representation of the data, which must be non negative [18]. The general problem of NMF is to find an approximation of the matrix X which represents the image collection, in terms of two smaller matrices as follows: X = WH
(5)
where X 2 Rp⇥l is the original data matrix and W 2 R and H 2 Rr⇥l are the matrix factors in which X is decomposed, p is the number of available features, l is the total number of images in the collection, and r is the rank of the decomposition. The matrix W is known as the basis matrix and H is known as the encoding matrix. This factorization is found by solving the associated optimization problem to minimize the Kullback-Leibler divergence between the original matrix and the reconstructed one. This problem is solved by using the recursive multiplicative updating rules proposed by Lee and Seung [16]. p⇥r
A. Multimodal indexing via NMF Defining for the visual content a BOF representation, and similarly, assuming a bag-of-words representation for text annotations. Then, we can describe both modalities by matrices to describe the occurrence of visual and textual features in the image collection. Let Xv 2 Rn⇥l be the matrix of visual features, where n is the number of visual features. Let Xt 2 Rm⇥l be the matrix of text term frequencies, where m is the number of terms or keywords. In this work we evaluate the mixed multimodal indexing strategy (NMF-Mixed) proposed by González et al. [9], which ⇥ ⇤T (1 )XvT XtT builds a multimodal matrix X = with 2 [0, 1], that contains the visual and text data. This
4
matrix is then decomposed using NMF to model a set of latent factors (columns of the matrix W ) and the corresponding coefficients of the combination are codified in the columns of H. X(n+m)⇥l = W(n+m)⇥r Hr⇥l
(6)
We can find a semantic representation for new images without text annotations, either to be included in the collection or used as a queries. To embed a new image in the semantic space, the following equation needs to be solved: y = W vh
(7)
where is trimmed version of W (learned in the training phase) and h is the semantic representation of the new image. The image y will be embedded into the semantic space by finding the vector h 0 that satisfies the equation. This is done by using the multiplicative updating rule proposed by Lee and Seung, but only h is updated whilematrix W is kept fixed with the values found in training phase. Once, images in the collection and query images are represented in the same space, the problem of finding relevant results reduces to the problem of matching images with similar representations using some similarity measure. v Wn⇥r
VI. E XPERIMENTS AND R ESULTS In order to evaluate the performance of the proposed feature representations, we conducted retrieval experiments under the query-by-example paradigm. The evaluation was performed in the context of histopathology image retrieval using a histopathology basal-cell carcinoma dataset (BCC dataset). A. Histopathology basal-cell carcinoma dataset The BCC dataset comprises 1407 image patches of 300×300 pixels extracted from 308 images of 1024×768 pixels, each of these images is related to an independent ROI on a slide biopsy. Each image is represented in RGB format and corresponds to field of views with a 10X magnification and stained with H&E [8] These images were manually annotated by a pathologist, indicating the presence (or absence) of basalcell carcinoma and other architectural and morphological features (collagen, epidermis, sebaceous glands, eccrine glands, hair follicles and inflammatory infiltration). Figure 1 shows different examples of these images. Particularly, these kind of images have high visual variability coming from several sources: cut orientation, staining and luminance, magnification, and digitalization among others. B. Experimental setup We performed automatic experiments by sending a query image to the retrieval system and evaluating the relevance of the results. 20% of the images were randomly selected as queries from the BCC dataset of 1407 images. The remaining 80% of images were used as the target collection to find relevant images. A ranked image in the results list is considered relevant if it shares at least two keywords with the query.
The evaluation was done using traditional image retrieval performance measures, including Mean Average Precision (MAP) and Precision at 10 (P@10). 1) Image features: Bag of features was the image representation strategy used in these experiments. Herein, different descriptors were evaluated to represent the patches. In this evaluation we compare our proposed UFL based representations against traditional descriptors DCT (discrete cosine transform) and Haar (Haar-based wavelet transform). DCT and Haar have shown to be good feature detectors to represent histopathology images in classification tasks[6]. The size of the visual dictionary was fixed to k = 400 words. 100000 patches were randomly sampled from the training set to perform feature learning with n = 400 number of feature detectors for each UFL Method. Figure 2 shows the set of features learned with sAE (a) and TICA (b). Notice that patches were preprocessed using ZCA whitening in order to remove correlations between pixels. This preprocessing has shown improvements in other pattern recognition tasks. In all the cases, function optimization was carried on using LBFGS, particularly Mark Schmidt’s implementation1 which took around 20 minutes for 400 iterations. 2) Text annotations: Images in this dataset has been annotated by a pathologist, indicating the presence (or absence) of basal-cell carcinoma and other architectural and morphological features, giving total of 7 different concepts. For each image, we build semantic vectors following a boolean approach, assigning 1 to the terms attached to an image and 0 otherwise. This leads to 7-dimensional binary vectors, which serve to build the text representation. C. Retrieval performance 1) Visual search: As a first experiment, we evaluate the performance on image retrieval using only the visual representation based on the bag-of-features strategy with patches represented by standard descriptors (DCT and Haar) and learned representations based on Topographic RICA and sAE. So, the final representation of images for all the strategies is a normalized histogram of the occurrence of visual features in a codebook, and direct visual matching is done by calculating the level of similarity between images using the histogram intersection similarity measure[20]: KHI (x, y) =
n X i=1
min {xi , yi }
(8)
where x and y are images and xi , yi indicate the frequency of the i-th visual feature in each image respectively. Table I shows a comparative in retrieval performance between the proposed UFL based strategies and standard descriptors. As shown in Table I, the two BOF representation based on UFL: sAE and TICA, improve over the two canonical features (DCT and Haar), and the best performance is obtained with TICA, achieving an improvement of about 9.4% over DCT and 25.9% over Haar descriptor in terms of MAP. To 1 http://www.di.ens.fr/~mschmidt/Software/minFunc.html
5
collagen
collagen, eccrine glands,
sebaceous glands
inflammatory infiltration
basal-cell carcinoma,
epidermis
inflammatory infiltration
Figure 1: Example images from the histopathology basal-cell carcinoma dataset. Images are shown along with their associated terms.
(a) Sparse Autoencoders
(b) Topographic RICA
Figure 2: Set of features learned with 100.000 patches randomly sampled from the BCC dataset Representation DCT Haar TICA sAE
Canonical descriptors Learned features
MAP 0.2892 0.2513 0.3164 0.2926
P@10 0.5737 0.4888 0.6196 0.5855
P@20 0.5453 0.4567 0.5712 0.5486
Table I: Performance measures in direct visual matching for all evaluated representations: DCT (discrete cosine transform), Haar (Haar-based wavelet transform), TICA (Topographic RICA) and sAE (Sparse Autoencoders).
NMF-mixed Representation MAP Canonical NMF-M DCT 0.4725 descriptors NMF-M Haar 0.4145 Learned NMF-M TICA 0.4896 features NMF-M sAE 0.5273
2) Multimodal semantic search: As a second experiment, we evaluated the quality of the multimodal semantic representation space generated from learned visual features. For this evaluation, we employed the NMF-Mixed algorithm to find a common semantic representation for visual and textual descriptors. Initially we performed retrieval experiments using 10-fold cross-validation for all strategies using only a subset of 80% of the images (target collection), in order to determine experimentally the most appropriate number of latent factors (r) and the modality weight ( ). Once, we have found the best configuration for each strategy (for all strategies, the best configuration is achieved with r = 10 and = 0.9), we perform retrieval experiment by
P@20 0.5617 0.4394 0.6109 0.6897
Table II: Performance measures for all evaluated representations using NMF-Mixed with 10 latent factors and = 0.9. Learned feature vs. canonical feature
evaluate the significance of the obtained results in terms of MAP and precision at 10, we applied a hypothesis test using Student’s t-test for each learned feature compared with each canonical feature. In accordance with the results obtained in the analysis (see Table III), there is a statistical difference between representations based on UFL and Haar descriptor in both MAP and precision at 10. Finally, one of the reasons that may explain the good performance of the features learned with TICA, is that TICA model learns complex invariances, such as scale and rotation variations generating a richer representation.
P@10 0.5358 0.4318 0.6089 0.6894
TICA / Haar TICA / DCT sAE / Haar sAE / DCT
BOF MAP 6.6E-05
NMF-mixed P-values P@10 MAP P@10 0.0001 0.0044 0.0072
0.1218
0.1938
0.8322
0.7452
0.0049
0.0057
0.4941
5.2E-05 0.0449
0.3185
1.6E-07 0.0003
Table III: Results of Student’s t-test for each learned feature (TICA and sAE) compared with each canonical descriptors (Haar and DCT). P-values for MAP and P@10 are reported for visual search (BOF) and semantic search (NMF-mixed). P-values in bold represent that the difference in the performance measure between each corresponding pair of features is significant with a significance value ↵ = 0.05.
projecting query images to the semantic space and calculating in this space the level of similarity between images using histogram intersection (experimentally it was found that the best similarity measure for the learned semantic representation is histogram intersection). Table II summarizes the results for multimodal semantic search. In this experiment, also we can see that the representations based on UFL outperforms the canonical features. In this case the best performance is obtained for NMF_M based on sAE, achieving an improvement of about 11.6% over DCT and 27.2% over Haar
6
(a) Precision-Recall graph for BOF representation based on learned and canonical features.
(b) Precision-Recall graph for the multimodal semantic representation via NMF-Mixed (NMF-M) based on all evaluated representations
Figure 3: Precision-Recall graphs for representation based on learned and canonical (classical) features: DCT (discrete cosine transform), Haar (Haar-based wavelet transform), TICA (Topographic RICA) and sAE (Sparse Autoencoders) descriptors in terms of MAP. Results of the Student’s t-test (see Table III at right) confirm that this difference is significant for NMF_M based on TICA compared with each canonical representation. Notice that these results show that the usage of NMF considerably profits from sAE representation, a possible explanation for this is that sAE finds better transformations to map the data on to a new space where NMF-Mixed is able to better find good correlations between both modalities. Figure 3 shows the interpolated Precision-Recall graphs for the BOF representations 3a and the semantic representations via NMF-Mixed 3b, in which we can observe the behavior of the precision along the retrieval process, i.e., the precision of the system according as the user is requesting more relevant images. The results show that representations based on UFL present the best performance in all scenarios, achieving the best early precision and keeping a good performance as long as more relevant images are required. VII. C ONCLUSION We presented a novel representation strategy for histopathology image retrieval. The strategy consists of using UFL methods to learn the representation of the patches instead of using standard canonical descriptors. The experimental evaluation demonstrates that the use of learned features increases the retrieval performance when compared to current state-of-the-art canonical patch descriptors, such as DCT and Haar wavelet, commonly used in content-based histopathology image retrieval. ACKNOWLEDGEMENTS. This work was partially funded by the project Multimodal Image Retrieval to Support Medical Case-Based Scientific Literature Search, ID R1212LAC006 by Microsoft Research LACCIR.
R EFERENCES [1] Y. Bengio, A. Courville, P. Vincent. Representation learning: A review and new perspectives. arXiv preprint arXiv:1206.5538, 2012. [2] J. C. Caicedo, J. BenAbdallah, F. A. González, O. Nasraoui. Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization. Neurocomputing, 76(1):50–60, Sty. 2012.
[3] P. Chandrika, C. V. Jawahar. Multi modal semantic indexing for image retrieval. strony 342–349, New York, NY, USA, 2010. ACM. [4] A. Cruz-Roa, J. Arevalo, A. Madabhushi, F. Gonzalez. A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. 8150:403–410, 2013. [5] A. Cruz-Roa, J. C. Caicedo, F. A. González. Visual pattern mining in histology image collections using bag of features. Artificial Intelligence in Medicine, 52(2):91–106, 2011. [6] A. Cruz-Roa, G. Díaz, E. Romero, F. A. González. Automatic annotation of histopathological images using a latent topic model based on nonnegative matrix factorization. Journal of pathology informatics, 2, 2011. [7] G. Csurka, C. Bray, C. Dance, L. Fan. Visual categorization with bags of keypoints. Workshop on Statistical Learning in Computer Vision, ECCV, strony 1–22, 2004. [8] G. Díaz, E. Romero. Micro-structural tissue analysis for automatic histopathological image annotation. Microscopy Research and Technique, 75(3):343–358, 2012. [9] F. A. González, J. C. Caicedo, O. Nasraoui, J. Ben-Abdallah. NMFbased multimodal image indexing for querying by visual example. ACM International Conference On Image And Video Retrieval, strony 366– 373. ACM Press, 2010. [10] L. He, L. R. Long, S. Antani, G. R. Thoma. Histology image analysis for carcinoma detection and grading. Computer methods and programs in biomedicine, 107(3):538–556, 2012. [11] A. Hyvärinen, J. Hurri, P. O. Hoyer. Natural image statistics, wolumen 39, rozdzia/l Energy correlations and topographic organization, strona 249. Springer, 2009. [12] A. R. Jamieson, K. Drukker, M. L. Giger. Breast image feature learning with adaptive deconvolutional networks. SPIE Medical Imaging, strony 831506–831506, 2012. [13] Y. Kang, S. Kim, S. Choi. Deep learning to hash with multiple representations. ICDM, strony 930–935. IEEE Computer Society, 2012. [14] A. Krizhevsky, G. E. Hinton. Using very deep autoencoders for contentbased image retrieval. ESANN, 2011. [15] Q. V. Le, A. Karpenko, J. Ngiam, A. Y. Ng. Ica with reconstruction cost for efficient overcomplete feature learning. Advances in Neural Information Processing Systems, 24:1017–1025, 2011. [16] D. D. Lee, H. S. Seung. New Algorithms for Non-Negative Matrix Factorization in Applications to Blind Source Separation. ICASSP 2006, 13(1):V–621–V–624, 2001. [17] J. Liu. Image retrieval based on bag-of-words model. CoRR, abs/1304.5168, 2013. [18] W. Liu, N. Zheng, X. Lu. Non-negative matrix factorization for visual coding. ICASSP 03, wolumen 3, strony III–293–6. Ieee, 2003. [19] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain. Contentbased image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell., 22(12):1349–1380, Gru. 2000. [20] M. J. Swain, D. H. Ballard. Color indexing. International Journal of Computer Vision, 7:11–32, 1991. 10.1007/BF00130487. [21] P. Wu, S. C. H. Hoi, H. Xia, P. Zhao, D. Wang, C. Miao. Online multimodal deep similarity learning with application to image retrieval. ACM SIGMM 2013, strony 153–162. ACM, 2013.