Page 1 ... We create a social networking framework that considers objects, time, place ... define an image collection architecture using a social network-.
1
Using the Social Networking Graph for Image Organization Brandeis Marshall and Siddharth Pandey Purdue University Computer and Information Technology 401 North Grant Street West Lafayette, Indiana 47906 {brandeis, spandey}@purdue.edu
Abstract—The process of record-keeping and maintenance of digital photographs is an emerging concern. In this paper, we present a tag organization framework for photographs by clustering similar content based photographs. Using a social network graph model, we relate objects and the corresponding relationships amongst them. Each photograph is represented in our framework with five attributes of objects, place, occasion, time and associations. Through experimentation of the MIRFLICKR-25000 image collection, we identity and examine the object popularity and object closeness properties. Keywords-image retrieval, photo properties, MIRFLICKR collection, algorithms, experimentation
I. I NTRODUCTION Digital cameras allow for instant photographs; however, they typically auto-label photos according to a numerical scale while users rename photos based on context such as date, place and/or event. Kirk et. al. [10] identifies the user struggle to organize their photos since a download session usually consists of multiple images over a duration of time. Traditionally, image retrieval approaches model one of two paths: content-based image retrieval (CBIR) [2], [18], [20] or photo annotation [1]. In CBIR, a photo is represented as a mixture of color, edge, shape and texture features. Each feature has a varying number of dimensions, which is denoted as sequence of numerical values. These low-level numerical features are then applied to capture high-level semantics for each photo. These features are then used to generate unique characteristics of characters in the image and then produce similar images with matching characteristics. Hwang et al.[9] use WordNet to form a cluster of semantic similar keywords. They then use clusters to retrieve the annotated images. In photo annotation, humans provide the semantic content through typically single word descriptions. A popular trend in sharing consumer photo collections includes online photo sharing applications like Flickr [5] and Picasa [7]. These applications allow for single and group photo tagging. However, the modular platform of these applications limit its portability and extensibility: (1) users have a username and password, (2) users have restricted storage capabilities and (3) photo annotations are non-transferable.
A. Motivation Both image retrieval approaches have drawbacks. CBIR requires large amounts of image processing, which is timeconsuming and resource inefficient. Photo annotation relies on human intervention that can be error-prone with disjoint and unrelated tags. Photo annotations have a high easy-to-use factor and does not require additional computing resources. The lack of portability and extensibility of photo annotations in photo sharing applications are becoming a growing problem. Photo sharing users must either use only one photo sharing application or perform duplicate work on multiple photo sharing applications. Currently, the photo annotations are part of the photo sharing application, which is stored on remote servers. This reliance on remote servers as primary storage, indexing and retrieval limits user accessibility. The challenge lies in photo storage, indexing and retrieval system to increase the flexibility of these photo sharing portals. We address this issue by allowing the metadata information associated with any given photo to be attached to the photo rather than the photo sharing application. We leverage user familiarity with social networking models. With the representation of various social networking models as graphs [4], we propose to use a graph based framework for categorizing photographs. We relate objects and the corresponding relationships amongst them. Each photograph is represented in our framework with five attributes of objects, place, occasion, time and associations. Through experimentation of the MIRFLICKR25000 [8] image collection, we measure the popularity of objects within images and the closeness factor given a pair of objects.
B. Contributions We makes several contributions in this work. We address the photo storage, indexing and retrieval problem. We create a social networking framework that considers objects, time, place, occasion and association. We use the graph properties of degree and closeness to identify objects that are popular or occur in more number of photographs, objects that are closely connected by a specific edge theme and objects that can be present along with two or more objects in a photograph. Through experimental evaluation, we examine the usefulness
2
of our degree and closeness property in terms of tag reuse and photograph retrieval. II. R ELATED W ORK Modern day online photo sharing applications like Flickr [5] and Picasa [7] allow users to tag their photographs with multiple keywords. These keywords allow the application to group photographs and in turn make it easier for a user to search for photographs [1], [12], [14], [19]. Sociality and Function are two perspectives of tagging[1]. Sociality attempts to categorize them based on the tag’s end user who is either the individual or the social community around the individual. Function indicates the purpose of the tag usage that can be either for photograph organization or providing additional information to the end user. The study reports that the tagging behavior is motivated by the social influence. The social influence on tagging has been further studied by Nov et al. [14] who analyzed the number of tags and the people tagging these photographs on Flickr. The study supports the impact of social circle in tagging behavior. A similar study [19] proposes Bayesian Networks to model user’s interpersonal relationship. The application provides the user an interactive way to exercise their collective memories by interpersonal relationships and tags to select interrelated photos. Beyond traditional tag clustering lies the tag ranking research of Liu et. al. [12], which shows that nearly half of the image tags are not closely related to the photographs. To assist the tagging process, various studies [2], [11], [13], [15], [17], [18] present methods to automate annotation. The AnnoSearch system [20] considers a hybrid approach of visual and text-based search in order to better semantic contentbased image retrieval (CBIR). The study analyzes the image annotation process through a data driven approach while traditional approaches focus on training concept models, supervised learning techniques and other machine learning techniques. These studies [2], [18], [20] are based on CBIR that relies on extracting the visual features of an image such as color, texture, shape, etc. These features are then used to generate unique characteristics of characters in the image and then produce similar images with matching characteristics. Hwang et al.[9] use WordNet to form a cluster of semantic similar keywords. They then use clusters to retrieve the annotated images. The CBIR based approaches attempt to identify objects in a photograph , however, they do not mainly focus on the other attributes of an image such as “where” the photograph was taken, “what” is occasion at which it was taken and the association between objects of “how” they are related [16], [21]. We use the social networking graph to identify the set of attributes of (who, what, when, how, where) for a photograph while also analyzing a combination of these attributes for a suitable tag recommendation. The text based approaches utilize the existing keywords of an image to form clusters of semantic keywords. If the existing keywords do not cover all the attributes of the set as stated above, the resulting cluster does not contain that type of description for the photograph and hence is not collectively exhaustive over the photograph content.
To the best of our knowledge, we are the first to formally define an image collection architecture using a social networking model. The work closest to ours is Golder [3], but did not consider all the attributes of time, place, event, association and objects. Also, degree and closeness properties are not considered and concentrates mainly on identifying persons with respect to an event or a place. III. S OCIAL N ETWORKING F RAMEWORK The management, browse and retrieval of digital photographs on a personal computer is a hierarchical collection of semi-related directories. The date and event name are used consistently to organize photos [10]. However, users must manually perform the creation and renaming of folders and sub-folders to make future viewing or retrieval accessible. The social networking framework captures relationships amongst photos based on objects within the photos as well as metadata associated with a group of photos. A. Photo Representation We describe the content of a photo using who, how, where, what and when attributes of a photograph. We classify these five major attributes as object, association, place, event/occassion and time respectively. A photo can be described using a combination of any of these five attributes as shown in Figure 1:
Fig. 1.
photo data and metadata attributes
We define a photo (P) as a 5-tuple: P (t, b, p, o, a)
(1)
where timestamp 6= null. In older digital photographs, the date of a photo may not be available. If a photo in our experimental evaluation does not include a date, then we assign a default timestamp. These five attributes categorize images and identify possible themes in a photograph. These attributes are further described below: 1) Timestamp t. A digital camera image contains the date of “when” the photograph was taken. 2) Object b. An object is any entity “who” or “thing” that can be identified in a photograph. 3) Place p. The location or landmark “where” attribute that the photograph was taken e.g. city, state, country, region. With GPS-equipped digital cameras, the longitude and latitude information may be stored in this attribute. 4) Event o. Any image may describe a “what” event or occasion such as wedding, birthday or vacation.
3
An event connects an activity involving people at a particular place at a specified time. 5) Association a. An association denotes the “how” attribute of the objects. This relationship may be personal such as father, mother or brother or may signify a group relationship such as family, friend, colleagues, etc. Using the five attributes defined in equation (1), a photograph is represented as an undirected social networking graph SN G(V, E) where V denotes the set of all objects and E denotes the set of all relationships between any two objects in the graph. Based on the 5-tuple, the date, place, event and association content related to any particular photograph is stored as an edge e(vi , vj ) for a given pair of objects vi , vj . Since objects may appear in multiple photos, SN G can have multiple edges between the objects vi and vj of the form of eχ (vi , vj ) where χ = {t, p, o, a}. Hence, for a given photograph, we may have a set of multiple objects, events, places and asssociations. A photograph Pi can be transformed to a clique, which constructs edges between every object pair. Hence, the social network graph is a collection of subgraphs for ease in browsing and retrieval as represented in equation (2). subgraph G(P ) = Clique(V 0 , E 0 ) s.t.
(2)
0
(3)
0
V = {v1 , ..., vn } and E = {e1 , ..., em }
For each subgraph, data contents of place, event and association may not be currently recorded resulting in the objects and date information to become the primary sources of search. However, our framework lends itself to semi-automatic labeling techniques since commonly searched components are centralized. The semi-automatic labeling of event and association can occur by leveraging the existing date and object content, respectively. Thus, repeatedly annotating photos with the same tag could be significantly reduced. The work of semiautomated labeling for photos is part of future work.
each edge in the clique would contain the same date, place, event and association content. The photo content can be updated or modified according to the user’s specifications. In the case of multiple photos P hoto1 , P hoto2 and P hoto3 , we can view these photos components in a network graph such as Figure 3. We have a series of three photos with 4 objects represented in each photo. The notation P hotoχy refers to the photograph y containing attribute x’s information. P hotop2 = Hawaii refers to the second photograph being taken at a place in Hawaii. P hotop1 = home refers to the first photograph being taken at place denotes as home. P hotoo1 = vacation refers to the first photograph occurring during the event of a vacation. P hotot1 , P hotot2 = July 2008 refers to first photograph being taken in July 2008. P hotot3 = Aug 2008 refers to the third photograph being taken in August 2008. P hoto1 and P hoto2 have disjoint objects; however the objects of Sky and John may be related with respect to time, place or event. P hoto2 and P hoto3 include one common object, John. In a search of “pictures containing John”, P hoto2 and P hoto3 would be returned whereas “pictures associated with John” would retrieve P hoto1 , P hoto2 and P hoto3 . Photo1 Cactus
Photo2 Mountain
Jack
Sawyer
Photot1, Photot2 = July 2008
Water Plant
Sky
John Garden
Photop2 = Hawaii Photop1 = home
Kate Christie
Photot3 = Aug 2008
Photoo1 = vacation Photo3 Fig. 3. Social Network Graph representing the 3 photographs: P hotop refers to the place the photo was taken, P hotoo refers to photo’s event and P hotot refers to a photo’s date
B. Photo Properties
Fig. 2. Example graph representation in social networking graph model: (a) landscape picture of foreground plant life and cactus while background includes sky and mountain range (b) picture graph in which edge time = July 2008 and edge occasion = vacation
We show in Figure 2 an example of how the social network graph would be constructed for a single photo. Photograph Pi is represented with four objects (sky, mountain, plant, cactus). The number of objects associated with any given photograph must be at least 1 with no maximum number. The edges contain date, place, event and association metadata information. We provide sample edge information in Figure 3. Hence, the photo is stored as a collection of objects in which
We apply our photograph organization framework to assist in discovering correlations amongst objects, dates, places, events and associations. We consider the impact of degree and closeness measures [4] to facilitate more effective photo categorization. We motivate each property: • Degree. What is the popularity of an object vi in the social networking framework? • Closeness. What is the relatedness of objects vi , vj in our photo collection? For the degree, we define “object popularity” as the potential candidates of retrievable photos that can be dictated by the number of photos containing common objects. However, our object popularity is a one-dimensional perspective of a photo collection. Since photos typically have more than one tag, we should also consider the bond between two (or more) objects,
4
which we handle through the closeness parameter. In the next two subsections, we more formally define both degree and closeness parameters in the context of our social networking framework model. 1) Degree Property: The degree of an object refers to the number of distinct edges interacting with it. In our interpretation of the social networking graph, each edge may contain up to four (4) attributes e.g. time, place, event and/or association. Thus, we are concerned with the number of unique photographs associated with a particular object vi . In Figure 3, deg(John) = 3, deg(Sky) = 2 and the remaining objects have a degree of 1. 1: function deg(graph: SN G, object: vi ) 2: pid = empty // storage of unique photographs 3: for each neighbor of vj of vi do 4: for edges eχ (vi , vj ) ∈ E do 5: if pid.get(eχ (vi , vj )) = null then 6: pid.insert(eχ (vi , vj )) 7: return pid.length In line 2, we use the pid data structure to record the photographs related to an object vi . We begin by considering all edges connected to the target object vi in line 3. If the edge’s contents have not been observed and stored previously, we add that edge as shown in lines 4-6. Lastly, we return the number of distinct edges related to a particular object. 2) Closeness Property: The closeness measure of two objects close(vi , vj , χ) computes the distance between two objects based on a specific χ. Note the χ parameter distinguishes the desired path in terms of time, place event and/or association. We implement Dijkstra’s algorithm (beginning at line 8) to find the shortest search path from source object vi to the target object vj in which we outline the pseudocode below. If a path is desired for multiple attributes, then the close algorithm would extend the use of χ = t, p, o, a for any combinations of attributes {{t, p, o}, . . . , {t, p}, {t, o}, . . .}. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
function close(graph: SN G, object: vi , object: vj , type: χ) for object vy ∈ SN G do eχ (vi , vy ) = ∞ // Unknown distance function from source vi to vy prev(vy ) = null // Previous node in optimal path from source e(vi , vi ) = 0 Q = all vertices V ∈ SN G vcurr = vi while Q 6= null OR reach vj do object vclose = smallest eχ (vcurr , vu ) if e(vcurr , vclose ) = ∞ then break remove vclose from Q for each neighbor vn of vclose do alt = eχ (vcurr , vclose ) + eχ (vclose , vn ) if alt < eχ (vclose , vn ) then prev(vn ) = vclose vcurr = vn
Category animal bird clouds dog flower lake night river sea sky sunset tree water
#Images 3216 741 3700 684 1823 791 2711 894 1322 7912 2135 4683 3331
#No Tags 261 40 223 46 145 41 220 36 92 542 169 342 206
TABLE I MIRFLICKR C ATEGORY T ESTBED
break for-loop return prev The main part of our closeness algorithm lies in lines 818 in which we compute the nearest neighbor. The nearest neighbor path is determined by a matching edge information. We consider two closeness effectiveness measures: minimum object path (MOP) and maximum image distance (MID). The close function returns MOP, the set of objects that connects object vi to object vj . Given MOP, we can capture the number of distinct photographs, which we call MID. We compute both measures in the experimental section. 18: 19:
IV. E XPERIMENTAL E VALUATION The experiments in this study were carried out using the MIRFLICKR-25000 [6] image collection. We only consider the image directories of the most popular tags e.g. 13 our of the 24 categories. We show in Table I the number of explicitly labeled and unlabeled images. We ignored the unlabeled images for our experiments. The number of images per category varies greatly where ‘dog’ and ‘bird’ directories contain the fewest number of photos. We observe that the set of 9,675 images over the 13 categories contained 16,868 unique tags. Some of the image tags containing special characters (e.g. mˇ exico) and non-english character set were discarded in the process of filtering tag information. We use the time attribute to form edges. For our tests, we consider a basic supervised learning model in which to examine to the applicability of the social networking graph model for image organization. During the training phase, we vary the training set size to include 100, 200 and 500 photographs. We then conduct a testing phase including 100 photographs. The training and testing set sizes are determined by the small set of labeled photos in the ‘dog’ category. The training and testing images are disjoint labeled sets. A. Degree Property To address tag reuse, we compute the average degree for the 13 categories as shown in Figure 4. We notice a minimal increase in degree as the training set size increases. Since the training set size does not dramatically impact the degree, we
5
B. Closeness Property
Average Degree 2 1.8 1.6 Average Degree
1.4 1.2 1
100
0.8
200
0.6
500
For finding the relationship between two objects in our framework, we compute the minimum object path and maximum image distance measures. Due the large number of images with low degrees, we only test our closeness property measures with an image degree that is greater than 10 within each category. The higher degree values will increase the likelihood of any two objects to appear in different photographs.
0.4 0.2 0
animal bird clouds dog flower lake night river sea
sky sunset tree water
Category
Fig. 4.
Avg Degree by Training Set
only report results for training set of 500 and testing set of 100. Our results indicate that on an average, the tags in all of the categories were associated with more than one theme. We Category animal bird clouds dog flower lake night river sea sky sunset tree water
Avg. Tags 8.15 7.75 11.00 8.89 8.64 10.23 10.21 9.80 11.11 12.44 11.07 10.99 10.36
% Match 46.01% 45.03% 46.72% 55.56% 44.32% 51.12% 43.48% 49.38% 52.92% 45.41% 49.49% 44.94% 47.49%
TABLE II AVG TAGS COUNT AND PERCENT MATCH
further investigate the number of tags associated with each image. Table II (column 2) displays the average number of tags per photo for each category during the training phase. In contrast to web search, we observe an average range of 7-12 tags per photo. With many tags representing one image, an effective framework to organize the photo collection becomes necessary. In addition to finding the number of tags, we compute the percent tag match during our testing phase. We show in Table II (column 3) approximately 50% tag match across the categories. The tags overlap across the categories; however, there still remains many unmatched tags. Our social networking framework can model those popular and infrequent tags to make photo retrieval more efficient. We notice that the an image tag could be identical to the category name. During the training phase of 6,500 images, the average category degree is 161 with the category tag ‘water’ having a degree of 268 while the category tag ‘animals’ having a degree of 23. During the testing phase, about 19% of the images have the same tag as the category name e.g. ‘dog’ appears in 64 images while the keyword ‘animals’ appeared only in 2 images.
Category animal bird clouds dog flower lake night river sea sky sunset tree water
Avg MOP 1.23 1.12 1.08 1.05 1.01 1.06 1.14 1.07 1.05 1.11 1.10 1.11 1.06
Avg MID 1.90 2.89 2.71 5.83 3.77 3.32 2.47 2.75 2.84 2.28 2.36 2.14 2.32
TABLE III C LOSENESS PROPERTY MEASURES
Table III reports our MOP and MID measure computations. For each pair of objects, these objects may be connected in more than one path. Our MOP computation only finds the minimum number of intermediate objects. We see that on average that only one object lies between any pair of objects. Hence, higher degree images tend to be annotated with similar tags. Our MID computation shows the average number of photos that intersect a pair of objects. The category testbed size does not impact both the MOP or MID measures e.g. bird has the fewest photos while sky has the largest number of photos. Thus, in photo retrieval, similar images with matching object can be searched with greater ease. V. C ONCLUSION AND F UTURE W ORK We propose a novel photograph storage, indexing and retrieval tool by constructing a social networking framework. Our social networking graph represents photographs as objects (items and persons), time, place, occasion and association. We define two graph properties: degree and closeness. The degree property considers the popularity of an object in our social networking framework. The closeness property focuses on the strength of the connectivity between a pair of our framework attributes. We show through experiments that average degree of less than 2 still yields closeness values of approximately 50% for each of our 13 categories. We outline our next steps in combining degree and closeness for tag clustering. In the future we would like to leverage degree and closeness property to define another property: betweenness, which finds relationships with 3 or more social networking framework attribute parameters. Also, we would like to construct a tag recommendation algorithm that incorporates degree, closeness and betweenness properties.
6
R EFERENCES [1] M. Ames and M. Naaman. Why we tag: motivations for annotation in mobile and online media. In CHI ’07: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 971–980, New York, NY, USA, 2007. ACM. [2] H.-M. Chen, C. Ming-Hsiu, P.-C. Chang, M.-C. Tien, W. H. Hsu, and J.L. Wu. Sheepdog - group and tag recommendation for flickr photos by automatic search-based learning. In ACM Multimedia, pages 737–740, 2008. [3] S. Golder. Measuring social networks with digital photograph collections. In Proceedings of the ACM conference on Hypertext and hypermedia (HT), pages 43–48, 2008. [4] R. A. Hanneman and M. Riddle. Introduction to social network methods. online at http://www.faculty.ucr.edu/ hanneman/nettext/, 2005. [5] http://www.flickr.com. Last accessed February 2010. [6] http://www.flickr.com/photos/tags/. Retrieved on October 2009. [7] http://www.picasaweb.google.com. Last accessed February 2010. [8] M. J. Huiskes and M. S. Lew. The mir flickr retrieval evaluation. In MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA, 2008. ACM. [9] K. Hwang, J. Choi, and P. Kim. An efficient retrieval of annotated images based on wordnet. In Advanced Communication Technology, The 9th International Conference on, volume 1, pages 368–372, Feb. 2007. [10] D. S. Kirk, A. J. Sellen, C. Rother, and K. R. Wood. Understanding photowork. In In Proc. CHI 2006, pages 761–770, 2006. [11] S. Lindstaedt, V. Pammer, R. M¨orzinger, R. Kern, H. M¨ulner, and C. Wagner. Recommending tags for pictures based on text, visual content and user context. In ICIW ’08: Proceedings of the 2008 Third International Conference on Internet and Web Applications and Services, pages 506–511, Washington, DC, USA, 2008. IEEE Computer Society. [12] D. Liu, X.-S. Hua, L. Yang, M. Wang, and H.-J. Chang. Tag ranking. In Proceedings of ACM WWW, pages 351–360, 2009. [13] G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In Proceedings of ACM WWW, pages 981–990, 2009. [14] O. Nov, M. Naaman, and C. Ye. What drives content tagging: the case of photos on flickr. In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 1097–1100, New York, NY, USA, 2008. ACM. [15] D. Ramage, P. Heymann, C. D. Manning, and H. Garcia-Molina. Clustering the tagged web. In Proceedings of the ACM International Conference on Web Search and Data Mining, pages 54–63, 2009. [16] T. Rattenburg, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from flickr tags. In Proceedings of the ACM SIGIR Conference on Research and development in information retrieval, pages 103–110, 2007. [17] X. Rui, M. Li, Z. Li, W.-Y. Ma, and N. Yu. Bipartite graph reinforcement model for web image annotation. In Proceedings of the 15th international conference on Multimedia, pages 585–594, 2007. [18] B. Sigurbhornsson and R. van Zwoi. Flickr tag recommendation based on collective knowledge. In Proceedings of ACM WWW, pages 327–226, 2008. [19] D. Uriu, N. Shiratori, S. Hashimoto, S. Ishibashi, and N. Okude. Caraclock: an interactive photo viewer designed for family memories. In CHI EA ’09: Proceedings of the 27th international conference extended abstracts on Human factors in computing systems, pages 3205–3210, New York, NY, USA, 2009. ACM. [20] X.-J. Wang, L. Zhang, F. Jing, and W.-Y. Ma. Annosearch: Image auto-annotation by search. In CVPR ’06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1483–1490, Washington, DC, USA, 2006. IEEE Computer Society. [21] L. Wu, L. Yang, N. Yu, and X.-S. Hua. Learning to tag. In Proceedings of ACM WWW, pages 361–370, 2009.