Automatic Image Cropping Using Sparse Coding Jieying She
Duo Wang1
Mingli Song
College of Computer Science, Zhejiang University Hangzhou 310027, China Email:
[email protected]
College of Computer Science, Zhejiang University Hangzhou 310027, China Email:
[email protected]
College of Computer Science, Zhejiang University Hangzhou 310027, China Email:
[email protected]
Abstract—Image cropping is a technique to help people improve their taken photos’ quality by discarding unnecessary parts of a photo. In this paper, we propose a new approach to crop the photo for better composition through learning the structure. Firstly, we classify photos into different categories. Then we extract the graph-based visual saliency map of these photos, based on which we build a dictionary for each categories. Finally, by solving the sparse coding problem of each input photo based on the dictionary, we find a cropped region that can be best decoded by this dictionary. The experimental results demonstrate that our technique is applicable to a wide range of photos and produce more agreeable resulting photos.
I. I NTRODUCTION Nowadays, people can use several techniques to improve their taken photos’ quality. Cropping is one of such tools – people may want to discard the blurred or noisy parts of a photo, or to emphasize the central objects by extracting and cropping out the most important part of a photo. However, because of the growing number of cameras and thus that of photos taken, people may find it time-consuming and dull to manually crop their snapshots. Automatic photo cropping can free people out of such onerous work. Various automatic cropping techniques have been proposed, which we will have a brief review in Section II. Previous researches mainly focus on cropping out the important objects in a photo regardless of their aesthetical quality, e.g., [1], [2], and [3]. That is, these techniques mainly determine their cropped region by including these important objects. More recent work begins to emphasize the agreeability of the resulting photos by introducing certain photographic rules, e.g, [4], [5], [6], and [7]. One of them is the Rule of Thirds, i.e., dividing a photo into nine equal-size areas by two horizontal lines and two vertical lines, the subjects of the photo should be centered at one of the four intersections of the four lines. Our work here, photo cropping based on composition, is the work that focuses more on the aesthetical composition of photographs. However, unlike other work that applies specific photographic rules which may not be adaptive to a wide range of photos, we propose to mine the composition data of a large number of high-quality photos and learn to crop automatically. The paper is organized as following. We will first introduce previous automatic cropping techniques in Section II. In Section III, we explain our proposed algorithm in detail. Finally, 1 Both
She and Wang contribute equally.
978-1-4577-0121-4/11/$26.00 ©2011 IEEE
we show results of our experiments in Section IV and conclude with further discussion in Section V. II. R ELATED W ORK Among previous cropping techniques, some do not pay sufficient attention to aesthetical values of a photo, which merely crop out the major objects based on a saliency map. Ciocca et al.(2007)[1] used a CART classifier to classify a set of photos into three categories(landscape, close-up, and other). Different modifications are applied to different categories, with the main idea that the cropping result should include the focused elements. Stentiford(2007)[2] also cropped the photo mainly based on a saliency map. Santella et al.(2006)[3] employed eye tracking to help determine the content area for cropping. Such work may be efficient for object-oriented cropping, but they ignore the aesthetical values of photos and may not be applicable to professionally photographic cropping. In some other work, aesthetical evaluation is emphasized. The most frequently applied standards include color, lighting and composition. To improve photos’ composition, Luo et al.(2008)[4] and Bhattacharya et al.(2010)[5] applied the Rule of Thirds. In addition to the Rule of Thirds, Liu et al.(2010)[6] applied the diagonal dominance, visual balance and sizes of salient regions for equally evaluation. Some other features, including spatial distribution of edges, color distribution, hue count, blur, contrast and brightness(Ke et al.(2006)[7]), were also used. Taking these aesthetical evaluations into account, some techniques were proposed. Nishiyama et al.(2009)[8] trained a SVM to label subject regions of a photo as of high or low quality. Fitting the quality values to the Sigmoid function, they obtained a final quality score by combing the posterior possibilities, and cropped the region with the highest score. Zhang et al.(2005)[9] proposed three models - composition sub model, conservative sub model and penalty sub model and combines them linearly to an object function to determine the cropped region. Although these techniques improve quality of some photos, they may fail to adapt to a wider range of images, because not all photos rigidly conform to those specific photographic rules. Bin Cheng et al.(2010)[10] proposed to learn to photograph by mining hundreds of thousands of photos. They segmented photos into patches, and used Gaussian Mixture Models to discover position relationship between pairs of patches and
490
... 704hn matrix with the formula
...
!1 n 1 min $ ( D &n i 1 2
xi - Dai
2 2
" # l ai 1 % '
.. . ... dictionary of the first category
Over 6000 Photos
13 Categories
Build Dictionary
Fig. 1. Training stage: first download photos, then classify scenes and train a multi-class SVM for 13 scene categories, and finally extract a saliency map for each photo and build a dictionary.
position patterns of single patches. The motivation of their work is similar to that of ours, while our method addresses the problem in a different approach. III. C OMPOSITION BASED AUTOMATIC P HOTO C ROPPING A. System Overview We have a brief introduction to our technique in this subsection. Our method includes the training stage and testing stage. For the training stage, as shown in Figure 1, we mine the composition information of over 6000 photos. We first train a multi-class SVM scene classifier proposed by Aude Oliva et al.(2001)[11] for 13 scene categories[12]. We then classify each training photo into the 13 categories. Subsequent stages are performed in context of each category. Within each category, we extract a graph-based visual saliency map for each photo based on the algorithm proposed by Jonathan Harel et al.(2007)[13]. We reshape the two-dimensional map matrix into a one-dimensional feature vectors for each photo, and learn a dictionary for these feature vectors. Some of the words in the dictionary may be more important in the sense of composition, for which we add weights. As for the testing stage, we also first classify the photo and calculate saliency maps. We then search the cropped regions that can be best decoded from the dictionary by applying sparse coding. Detailed explanation is demonstrated in the following subsections. B. Dataset To ensure that our technique is adaptive to a wide rage of photos, we crawled over 6000 photos from www.photosig.com. All photos are well appreciated by users. C. Scene Classification For our large training dataset, it is neither effective nor practical to mix up all the photos and try to learn their composition in the same context. The reason is that photos with different semantic context or structural information can vary in composition to a great degree. Therefore, we first apply a rough scene classification to classify photos into categories.
Oliva et al.[11] proposed the Spatial Envelope, a representation of the shape of a scene for classification. It is a set of perceptual dimensions(naturalness, openness, roughness, ruggedness and expansion) that are related to the shape of the space. We apply their algorithm for our rough classification step. In our work, we use the scene dataset from [12]. We extract global features for the 13 scene categories with [11]’s algorithm, and train a multi-class SVM. We use the trained SVM for our classification step in both our training and testing stage. In our training stage, we classify around 6000 photos from our dataset into 13 categories. Although our photos may not be perfectly classified, our goal is to group photos that are structurally related, so that we can better discover composition or structure information of photos in a certain group. This is the reason why we choose [11]’s algorithm for our classification step, which represents the shape of a scene without taking local object information into account. D. Composition Dictionary Learning In photography, composition is referred to placement or arrangement of elements in a photo. In our technique, we refer elements to regions that users are interested in, because these interesting regions capture people’s eyes and convey main messages of a photo. We adjust composition of these regions to improve photos’ overall aesthetical values. To discern interesting elements of a photo, we adopt Harel et al.(2007)[13]’s graph-based visual saliency model. According to [13], this model can powerfully predict human fixations on various images, achieving accuracy of 98% of the ROC area of a human-based control. Using [13]’s model, we extract a saliency map for each of our training photos. Each saliency map is of size 22 × 32. Next, within each category, we reshape each twodimensional saliency map of photos into a one-dimensional feature vector, i.e. a 704 × 1 feature vector. Value of each element of the vector represents the degree of importance or interest of the corresponding pixel in the photo. Position of each element represents the structural or compositional information of the interesting regions, because a different
491
location of a interesting region in the photo corresponds to different rows of the feature vector. We next use sparse coding to model the feature vectors as sparse linear combinations of basic elements, i.e., elements from a dictionary. The dictionary can be viewed as consisting of some basic composition elements for the photos, and each photo can be reconstructed by these basic composition elements. Sparse coding is efficient for solving the problem, and it is also meaningful for our work – we appreciate that a photo can be reconstructed with a few basic elements, so that it can be viewed as conforming to a few certain composition standards instead of including and mixing up too many composition structures, which is not welcome in photography. We learn our dictionary in context of each category. We combine these feature vectors into a feature matrix, i.e., a 704 × n matrix X, where n is the number of photos in a certain category. Then for each category, we learn a dictionary D for the feature matrix X. We here use Julien Mairal’s code[14] that implements the dictionary learning algorithm of [15]. More specifically, we train a dictionary D such that 1∑1 ||xi − Dαi ||22 + λ||αi ||1 } n i=1 2 n
min{ D
where xi is each feature vector in X, αi is the corresponding sparse codes, and λ constrains the sparsity of α. In our experiment, we set λ as 0.1. In a certain dictionary, different elements may have different contributions, i.e., a subset of elements from the dictionary can reconstruct most of the feature vectors. More specifically, when reconstructing a set of photos from the dictionary, let ti be the total times that αi , the corresponding sparse code of the ith element, appears as non-zero, and let tj denote the total times that αj appears as non-zero. ti can be regarded as the contribution of the ith element in the reconstruction process. If ti > tj , we say the ith element contributes more. Taking this definition of contribution of elements into account, we add weights to words to approximate the contribution after we build the initial dictionary. The process of adding weights is as following. We define a vector w as weights, of which each element wi represents the weight of the ith element in the dictionary. Each wi is initially 1. Thus, if each element in the dictionary contributes equally, this weighting vector do not affect our original dictionary at all. Then, within each category, we first solve sparse coding α for each training photo. We use the code of [14] that implements the LARS algorithm of [16] to solve the problem. Then for each photo, we find an element in the dictionary such that
well enough, and therefore contributes more to the composition dictionary. An alternative way to find such element is to solve sparse coding for each element in the dictionary, but this process is time-consuming. We note that α is sparse, so we can approximate the process efficiently and well enough with our method above. After targeting such element j in dictionary D, we increase its weight wj by 1. Therefore, after decoding all the photos in the category, we obtain an accumulative weight for each element in the dictionary. We normalize wj by: wj wj′ = ∑n i=1
wj
E. Photo Cropping For the testing stage, we first classify the testing photo into one of the 13 categories and calculate its saliency map M as we do in the training stage. We then search the size and position of the cropped region rectangle of the photo to determine the resulting photo. We denote the cropped rectangle as [regx, regy, cx, cy], where regx and regy are the height and width of the rectangle, and cx and cy are the position of the upper-left vertex of the rectangle with respect to the original photo. We then extract the corresponding sub-saliency map M ′ from the saliency map M for that cropped region. An alternative way is to re-calculate the precise saliency map for each cropped rectangle, but it is again time-consuming and impractical for real-time application. Therefore, we merely calculate the saliency map once for the whole photo and extract the corresponding saliency region for the cropped rectangle, which reduces the time of cropping a photo from more than 10 minutes to around 10 seconds in our experiment. Then for each cropped region rectangle, we again resize the saliency map M ′ to 22×32 and reshape it to a 704×1 feature vector x′ . We then solve the weighted version of sparse coding for the cropped region: 1 1 min { ||x′ − Dα′ ||22 + λ||diag( ′ )α′ ||1 } ′ α 2 w where diag( w1′ ) is the diagonal matrix of reciprocal for the normalized weighting vector for the dictionary, and λ is again the constraint on the sparsity of α′ , which we set to 0.1 in our experiment. We here also use the code of [14] to solve the problem. We finally determine the cropped region such that min
regx,regy,cx,cy
{||x′ − Dα′ ||2 }
min{||xi − dj αj ||2 } j
where xi is the feature vector of the photo, dj is the j th element (a vector) in dictionary D, and αj is the j th element of α, the sparse code for the photo. Finding such element means that the target element can solely reconstruct a training photo
where x′ is the feature vector for the region and α′ is the corresponding sparse code. This means that we find the region that can be best reconstructed by the dictionary. In other words, the resulting photo should best conform to composition standards reflected from the dictionary.
492
100.00%
86.81%
80.00% 60.00%
Santella's 40.00%
Ours 13.19%
20.00% 0.00%
Percentage
(a) Fig. 2.
(b)
100.00%
(c)
(d)
User evaluation: (a) original image; (b) Santella’s[3] result; (c) our result; (d) Percentage of subjects who choose (b) or (c) as the better result; 100.00%
79.12%
80.00% 60.00%
Santella's 40.00%
Ours 20.88%
20.00% 0.00%
Percentage
(a) Fig. 3.
(b)
(c)
(d)
User evaluation: (a) original image; (b) Santella’s[3] result; (c) our result; (d) Percentage of subjects who choose (b) or (c) as the better result; 100.00% 80.00%
57.14%
60.00%
42.86%
Santella's
40.00%
Ours 20.00% 0.00%
Percentage
(a) Fig. 4.
(b)
(c)
100.00%
(d)
User evaluation: (a) original image; (b) Santella’s[3] result; (c) our result; (d) Percentage of subjects who choose (b) or (c) as the better result; 100.00% 80.00%
60.44% 60.00%
39.56%
Santella's
40.00%
Ours 20.00% 0.00%
Percentage
(a) Fig. 5.
(b)
(c)
(d)
User evaluation: (a) original image; (b) Santella’s[3] result; (c) our result; (d) Percentage of subjects who choose (b) or (c) as the better result;
IV. E XPERIMENT 100% 90% 80% 70% 60% 50% 40%
Santella's
30%
Ours
20% 10% 0%
Fig. 6.
User evaluation results
A. User Studies In the experiment, we use the dataset of [6] with a total of 93 photos for testing, and it takes about 10 seconds to process each photo. Some of the results are reflected in Fig 2 - 5. We randomly chose 11 photos and compare our results with that of Santella[3] for user studies. We designed a questionare.
For each question, we include the original photo as reference, and list our resulting photo and [3] resulting photo as options. Users are asked to choose one of the two that is of better composition. The order of display of the two options is at random, and users have no idea of which photo is our result. We post the questionare online, and receive 91 replies. We show some of the survey results in detail from Fig. 2 to Fig. 5. We display the original testing photos in column (a), [3]’s results in column (b), our results in column (c), and the survey results in column (d). As reflected from Fig. 2 and Fig. 3, our results are much more favored than [3]’s results, because our method focuses more on composition of photos as a whole. Our result is slightly favored in Fig. 4. Users had a higher evaluation of [3]’s results for Fig. 5. The overall survey results are displayed in Fig. 6. The blue area represents the percentage that users vote our results as of better composition for the 11 photos, and the red area reflects votes of [3]’s results. We show that our results are more favored by users in 8 out of the 11 photos. Some of our photos are much more appreciated by users.
493
B. Comparison with Related Work
Also, as our results reflected, our technique is adaptive to a wide range of photos, including natural scenes, animals, human profile, etc. V. C ONCLUSION
(a)
Our cropping work focuses on learning the composition of a large number of photos, and it is quite efficient. We classify the photos into 13 categories and extract the graph-based visual saliency map of each photo to construct a dictionary for each categories. Based on the dictionary, we find a cropped region by sparse coding. We show that our technique generates resulting photos with better composition. Also, our technique can be applied to different types of photos. In our future work, we aim to improve our scene classification, mine composition from more photos, and accelerate our cropped region searching method.
(b)
(c) Fig. 7.
(a) original image; (b) Santella’s[3] result; (c) our result.
R EFERENCES
(a)
(b)
(c) Fig. 8.
(a) original image; (b) Nishiyama’s[8] result; (c) our result.
(a)
(b)
(c) Fig. 9. (a) original image; (b) aesthetically improved photo using optimal object placement technique, from Bhattacharya et al. [5]; (c) our result.
We compare our results with some other work in this subsection. We show that our cropped results focus on aesthetical values as Fig. 7 shows. Unrelated objects are discarded. And if the original photos are good enough, we can leave them alone as Fig. 8 shows. And our method can generate similar effects as a more complex technique does, which re-arranged the relative positions of objects, shown in Fig. 9.
[1] G. Ciocca, C. Cusano, F. Gasparini, and R. Schettini, “Self-Adaptive Image Cropping for Small Displays,” IEEE Transactions on Consumer Electronics, vol. 53, no. 4, pp. 1622–1627, 2007. [2] F. Stentiford, “Attention Based Auto Image Cropping,” in ICVS Workshop on Computational Attention & Applications, 2007. [3] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, and M. F. Cohen, “Gaze-based interaction for semi-automatic photo cropping.” in CHI’06, pp. 771–780, 2006. [4] Y. Luo and X. Tang, “Photo and video quality evaluation: Focusing on the subject,” in Proceedings of the 10th European Conference on Computer Vision: Part III, ser. ECCV ’08, pp. 386–399, 2008. [5] S. Bhattacahrya, R. Sukthankar, and M. Shah, “A framework for photoquality assessment and enhancement based on visual aesthetics,” in Proceedings of the international conference on Multimedia, ser. MM ’10. ACM, pp. 271–280, 2010. [6] L. Liu, R. Chen, L. Wolf, and D. Cohen-Or, “Optimizing photo composition,” Computer Graphic Forum (Proceedings of Eurographics), vol. 29, no. 2, pp. 469–478, 2010. [7] Y. Ke, X. Tang, and F. Jing, “The design of high-level features for photo quality assessment,” in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1, pp. 419–426, 2006. [8] M. Nishiyama, T. Okabe, Y. Sato, and I. Sato, “Sensation-based photo cropping,” in Proceedings of the 17th International Conference on Multimedia 2009, Vancouver, British Columbia, Canada, October 1924, 2009. ACM, pp. 669–672, 2009. [9] M. Zhang, L. Zhang, Y. Sun, L. Feng, and W. Ma, “Auto Cropping for Digital Photographs,” in IEEE International Conference on Multimedia and Expo, pp. 4 pp.+, 2005. [10] B. Cheng, B. Ni, S. Yan, and Q. Tian, “Learning to photograph,” in Proceedings of the international conference on Multimedia, ser. MM ’10. ACM, pp. 291–300, 2010. [11] A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, May 2001. [12] F.-F. Li and P. Perona, “A Bayesian Hierarchical Model for Learning Natural Scene Categories,” in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 524–531, Jun. 2005. [13] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Advances in Neural Information Processing Systems 19, pp. 545–552, 2007. [14] J. Mairal, “Spams(sparse modeling software), version 2.0,” http://www.di.ens.fr/willow/SPAMS/index.html, Feburary 2010. [15] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Learning for Matrix Factorization and Sparse Coding,” Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010. [16] B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, pp. 407–499, 2004.
494