an extensible personal photograph collection for graded relevance ...

Faculty of Mathematics, Natural Sciences and Computer Science Institute of Computer Science

COMPUTER SCIENCE REPORTS Report 03/11 December 2011

AN EXTENSIBLE PERSONAL PHOTOGRAPH COLLECTION FOR GRADED RELEVANCE ASSESSMENTS AND USER SIMULATION DAVID ZELLHÖFER

Computer Science Reports Brandenburg University of Technology Cottbus ISSN: 1437-7969 Send requests to:

BTU Cottbus Institut für Informatik Postfach 10 13 44 D-03013 Cottbus

David Zellhöfer [email protected], http://dbis.informatik.tu-cottbus.de

An Extensible Personal Photograph Collection for Graded Relevance Assessments and User Simulation

Computer Science Reports 03/11 December 2011

Brandenburg University of Technology Cottbus Faculty of Mathematics, Natural Sciences and Computer Science Institute of Computer Science

Computer Science Reports Brandenburg University of Technology Cottbus Institute of Computer Science

Head of Institute: Prof. Dr. Hartmut König BTU Cottbus Institut für Informatik Postfach 10 13 44 D-03013 Cottbus

Research Groups: Computer Engineering Computer Network and Communication Systems Data Structures and Software Dependability Database and Information Systems Programming Languages and Compiler Construction Software and Systems Engineering Theoretical Computer Science Graphics Systems Systems Distributed Systems and Operating Systems Internet-Technology

CR Subject Classification (1998): H.3.4, H.3.3 Printing and Binding: BTU Cottbus

ISSN: 1437-7969

[email protected]

Headed by: Prof. Dr. H. Th. Vierhaus Prof. Dr. H. König Prof. Dr. M. Heiner Prof. Dr. I. Schmitt Prof. Dr. P. Hofstedt Prof. Dr. C. Lewerentz Prof. Dr. K. Meer Prof. Dr. D. Cunningham Prof. Dr. R. Kraemer Prof. Dr. J. Nolte Prof. Dr. G. Wagner

An Extensible Personal Photograph Collection for Graded Relevance Assessments and User Simulation Preprint David Zellhöfer Brandenburg Technical University Database and Information Systems Group Walther-Pauer-Str. 1, 03046 Cottbus

[email protected]

ABSTRACT

1.

In this paper, we present a document collection with graded relevance assessments that has been sampled from real photographers. In order to reflect both vagueness in the commonly used retrieval calculations as well as in the user’s query, we argue that current document collections being based on binary relevance judgments have drawbacks – in particular if user-centered or relevance feedback-related experiments are conducted. In addition, such system-centric collections are based on documents, which do not necessarily reflect a layperson’s personal photo collection. To overcome this issue, we suggest a test set of documents that is based on a study of 19 real photographers discriminating it from Flickr downloads or the like. The collection has been categorized on basis of different criteria such as document quality or motif quality plus the aforementioned graded relevance assessments. Reflecting the photograph taking behavior of the investigated photographers, we are also providing an event-based ground-truth in addition to a topicbased one. In total, 128 different topics are available for the collection. In order to provide means to address different photographer and user types, e.g. in user simulations or in usability engineering, we make the demographic information of both photographers and assessors available. Eventually, this links interactive information retrieval evaluation with persona-based interaction design – a factor that has been neglected in multimedia information retrieval so far.

Current image collections rely mainly on binary relevance judgments regarding a topic, i.e., if a document is relevant or not. It can be argued that such assessments being based on the result of a similarity measurement such as the cosine measurement or the like are to crisp and insensitive. This is due to two main reasons:

Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Performance evaluation (efficiency and effectiveness)

General Terms Human Factors, Experimentation, Measurement

INTRODUCTION

First, common approaches in multimedia information retrieval (MIR) calculating the relevance of a document w.r.t. a query incorporate heuristics, vagueness or probability assessments, i.e., they cannot provide a definite judgment. They merely calculate a probability of relevance. Second, the relevance of a document w.r.t. a query is always subject to the user subjective notion of relevance. That is, a document, which has been rated highly relevant by one user can be rated less relevant or irrelevant by another user. As a consequence, the whole retrieval process is characterized by its impreciseness and its subjectivity. Thus, we will present a new document collection that relies on graded relevance judgments reflecting the subjectivity in relevance judgments. As classical precision and recall as well as the mean average precision (MAP) measurements cannot cope with such assessments, we suggest the usage of the discounted cumulative gain (DCG) measurement [8]. DCG relies on graded relevance assessments and has become more and more used within the information retrieval (IR) community, which is reflected by a performance evaluation of different metrics presented at SIGIR ’11 showing that DCG “really is a useful user-centered measure of system effectiveness” [4]. Besides its capability of reflecting subjectivity, DCG also provides more appropriate means to evaluate relevance feedback (RF) or adaptive systems as it can be used to measure slight changes or re-orderings of relevant documents with varying degrees of relevance within the result list. Unfortunately, these findings are still neglected in typical MIR benchmarks making them mostly valuable for systemcentric evaluation. On the other hand, Voorhees [13] states that such “Cranfield test collections represent too little of the user to support adaptive information retrieval research” and “completely ignore[s] the user interface”. To address these issues, we are contributing to establishing a more usercentered perspective on MIR system evaluation with this paper. Eventually, this links MIR evaluation to evaluation techniques from interaction design via a user-centric metric.

Contrasting to other document collections for MIR or object classification, the proposed collection focusses on retrieval tasks in a (layperson’s) personal photo collection and has been sampled from such (see Sec. 2). To emphasize this point, we are not aiming at providing a general-purpose Cranfield-based collection for the evaluation of MIR-related algorithms in terms of scalability or the like. The suggested collection is meant to reflect the actual content of a layperson’s hard-disk that can be found in real life (i.e. a usercentered viewpoint) with all duplicate images, variance in photographic quality, and noise. Hence, we rely on a study of real photographers with different demographics to provide insights into “real” photo collections. To our knowledge, this is the first collection that samples such collections and can therefore be used to check algorithms and methods of MIR on real-world data, i.e., data that has not undergone further processing steps than shooting the photo and storing it on a hard-disk. Nevertheless, we see the main contribution of this paper in providing means run user-centered experiments with adaptive MIR systems as we will elaborate later. In order to discriminate the presented document collection from other test sets, we will outline the characteristics of some characteristic collections representing the system-centered field of MIR evaluation. Caltech 101 [6] This test collection consists of 9,197 documents that are divided into 101 categories. The categories include topics such as faces, dinosaurs, airplanes, gramophones, yin-yang symbols and various animals to name some. The documents themselves are often rotated, scaled or modified in a way that their background has been replaced by an artificial one. Regarding the choice of categories it must be said that not all of them mirror a daily-life retrieval task. More often, their choice is due the collection’s origin in object recognition. This fact is reflected by the found rotated images that are obviously used to measure the sensitivity of feature extraction algorithms regarding invariances etc.. The same arguments holds for the Caltech 256 collection [7] as it is a mere extension. MIR Flickr This document collection has been part of ImageCLEF 2010/2011 as the photo annotation task1 . The collection contains 200,000 documents from Flickr2 . These documents are used for a classification into categories regarding an image’s quality (e.g. sharpness) or a thematic concept. In order to support multimodal retrieval, EXIF data and tags are provided. The evaluation is based on MAP. Generally speaking, the collection mirrors the state of Flickr including layperson’s photographs, graphics or professional work all being image processed (“photoshopped”) or not. All images are distributed with the collection.

ciple, this enables MSRA-MM to be used with the DCG metric. Unfortunately, this collection provides only thumbnails of all images because of copyright issues. To compensate this problem, 7 global low-level features accompany the thumbnails. Nevertheless, new features cannot be extracted from the original images limiting the collection’s future utility. Social Event Detection Task Mirroring the recent discussion about events illustrated by photos (e.g. [1]). This task4 has been built of 70,000 Flickr images including metadata such as date and local information using binary relevance judgements for event association. The events are subdivided into event classes such as concerts or games. Regarding the content of the collections, MIR Flickr and MSRA-MM can be considered as more realistic (in terms of real-world content) than Caltech. Nevertheless, both collections are an extract from images found on the web, i.e., a set of images that has been uploaded by a user possibly after image processing and after some images have been discarded. The decision to discard an image can be based on different reasons: bad quality of the image, i.e., a blurred or backlit image; a duplicate motif; bad composition; or personal reasons. Such downloads do not necessarily mirror the contents of a personal photo collection as the findings from our usage study shows (see Sec. 2). In contrast, the presented collection reflects an amalgamated personal image collection that has been taken by 19 photographers. Hence, it can be used best as a test set for layperson retrieval tasks carried out ad-hoc on “raw” collections such as: “find all images with a street scene”, “find a beach similar to this”, or more event-based tasks like “show me more pictures from the last U2 concert”. These tasks can be used for topic sorting of the collection, to find duplicate images, or to discard unwanted images, e.g. if they are not sharp enough which is likely to appear in a layperson’s collection as our study has shown. Although out of the collection’s primary scope, it might also be used for object recognition but may not perform very well here as it includes a lot of snapshots and badly composed photographs that have not been corrected by intention. This is especially true for blurred images or the like. Nevertheless, these “erroneous” images are part of real personal photo collections as our study of 19 photographers shows (see Sec. 2). To face the aforementioned drawbacks of the discussed document collections, the main contributions of this paper are as follows: 1. A document collection with graded relevance assessments regarding sample queries will be presented. The relevance assessments are solely based on the judgements of human assessors.

MSRA-MM The collection3 contains 65,443 documents that have been retrieved from Microsoft Live Search including sample queries that are based on Microsoft’s search logs. In contrast to the test sets discussed before, it features graded relevance assessments. In prin-

2. We provide a collection of personal photos that has been sampled from 19 real contributors and has not been downloaded simply from the web. The documents of the collection have not undergone any major image processing steps (see Sec. 2).

1

www.imageclef.org www.flickr.com 3 research.microsoft.com/en-us/projects/msrammdata 2

4

www.multimediaeval.org

3. Duplicate images, blurred images, and other images with bad quality are still part of the collection. These duplicates are categorized on different levels. 4. To our knowledge, we propose the first image dataset including demographic details of both photographers and assessors necessary for real-world backed-up user simulations or persona construction5 . Furthermore, it is the only collection reflecting the “raw” content of users’ hard-disks based on a study of real-world users. 5. In addition to traditional topic-based retrieval, the presented collection contains an event-based groundtruth. To retrieve events, EXIF (including GPS) and IPTC data for all images is made available. 6. The extracted low-level features are based on an opensource software, which facilitates the extension of the collection and the inclusion into other collections. Thus, we provide the only collection that relies on graded relevance assessments of real-world documents of a personal photo collection reflecting a photographer’s lifespan, which is not limited in terms of extractable features (such as MSRAMM). Furthermore, the collection has been designed to be easily extensible (see Sec. 6) if additional characteristics have to be added for special evaluation tasks. To conclude with, the discussed collection provides new means to evaluate RF-extended MIR systems with help of user simulations or the like. Although the user is in focus of interactive IR (IIR) research, there is still a gap between retrieval and usability evaluation. We believe that usability aspects are clearly linked to retrieval performance and should not be investigated separately. With this work, we contribute to bridging this gap by providing the first means to link these fields of evaluation by offering a data set satisfying the needs from both worlds as we will outline in Sec. 4.

2.

CHARACTERISTICS OF THE COLLECTION AND PHOTOGRAPHERS

The Pythia Image Collection v1 6 consists of 5,555 personal photos that have been contributed by 19 photographers. The documents for the collection have been picked randomly. To ensure a variance in photographic motifs and style, the contributors have been chosen from different demographic groups, e.g. ranging from year of birth 1944 to 1985. Thus, one can interpret the content of the collection as a mirror of a photographer’s lifespan with typical changing usage behaviors, cameras, topics, and places. In advance to handing in photos, all photographers had to complete a survey in order to evaluate their photograph taking behavior etc. as this obviously affects the actual content and photographic style of the images contained in the presented collection. The results of this survey can be found in Tab. 1, while the survey incl. answer codes is listed in Tab. 9 at the end of this paper. Normal system-centric collections neglect this factor of subjectivity in favor of sheer collection size. Surely, it can be argued that 5,555 documents are little in direct comparison to other collections. However, we do not share this point of view because we are not aiming at running Cranfield-like experiments or to investigate the scaling 5 6

A technique often used in interaction design (see Sec. 4). Pythia was a priestess at the oracle at Delphi.

behavior of algorithms. Instead, we are aiming at providing an appropriate collection for the evaluation of user-centered or interactive MIR systems. Hence, the presented collection is based on a user study, has been annotated manually with a more sophisticated graded relevance scale, and provides richer metadata such as demographic and event information in combination with reproducible low-level features (see Sec. 5). It is widely known that a direct comparison in terms of size is inappropriate as collections aiming at evaluating adaptive MIR systems have to control more and different variables and require therefore more expenses [13]. The collection discussed in this paper is modeling a typical personal photo collection with all its noise – rendering it unique in the end. In our opinion, there is a need for a collection that can be used for evaluating interactive MIR systems because the evaluation of adaptive systems is still treated as a stepchild in MIR evaluation. To sum up the demographics of our contributors, it can be said that 50% are female, mostly fully employed, and use their camera in general at special events. Regarding their studies only few took research-related classes such as IR or MIR. Though, half of the contributors state that they have little or more knowledge of the principles of content-based retrieval (Q2). It is important to note that this information is crucial if one wants to evaluate an MIR system from a layperson’s point of view. In this case, one might to weigh the overall influence of the “IR-savvy” assessors and contributors lower to compensate the expert knowledge bias. In our opinion, there are far too many studies conducted by IR research groups alone or at least not clearly stating the level of expertise of their participants. This is is unfortunate because it is obvious that the expertise within a research field clearly affects the interaction with research-related systems, e.g., if a photographer chooses photos that will be retrieved particularly well because of her knowledge of the underlying algorithms – even if this happens on a subconscious level. To end the evaluation of the survey, it can be said that none of the photographers are colorblind – or at least know of it (Q3). In average, they use the internet between 91 - 120 minutes (Q4), have visited Web 2.0 websites such as Facebook or Flickr (Q5), often have accounts (Q7) but seldom use such websites to share photographs (Q6). In particular, the last finding of our usage study is interesting. It clearly shows that one can expect to find different photos than on Flickr or the like –the normal origin of all other available collections – in the presented data set. Fig. 1 illustrates the contribution of each photographer in percent. As said before, the documents of the collection have not been image processed extensively. This includes color correction, sharping, cropping, rotating, or the removal of flaws of any kind. It excludes scaling and proper anonymization to preserve privacy rights. Furthermore, effects built into the camera altering the photo while saving (e.g. a sepia filter), or rendered dates are also present on some images. To conclude with, no “photoshopping” has been done to create aesthetic pleasing photographs. Additionally, no photos have been removed from the distributed collection as this could not be observed on the contributing photographer’s hard-disks.

Table 1: Demographic Details and Behavior of Photographers ID

Gender

actor0 actor1 actor2

Images without clear contributor have been associated with actor0, i.e. 0.85 % of all images. female 1983 4 Business Math. 1 1 1 7 3 0 female 1985 4 Hotel Business 2 0 1 5 4 1

actor3

female

1985

4

actor4 actor5

female male

1984 1985

4 4

actor6

male

1984

4

actor7

male

1985

4

actor8

male

1979

4

actor9 actor10

male female

1944 1979

8 4

actor11

male

1977

5

actor12 actor13

female female

1979 1982

4 4

actor14 actor15 actor16 actor17 actor18

female male male female male

1955 1959 1966 1983 1983

4 4 4 5 4

actor18 4,07%

Year of Birth

actor0; 0,85% actor1 0,90% actor2 8,55%

actor16 0,94%

actor17 25,06%

actor15 1,10%

actor8 18,02% actor9 4,21%

Camera Usage

Computer Science Business Math. Business Information Systems Information & Media Technol. Business Information Systems Computer Science Engineering Media & Computing Sciences

0

Media & Computing Sciences History of Art Business Information Systems Travel Agency Cybernetics Mathematics Social Work n/a

Q1

Q2

MIR

Q3

Q4

Q5

Q6

2

1

4

2

0

3 3

1 1

7 7

3 4

0 1

3

1

4

4

1

2

0

7

3

0

4

1

7

4

2

2 1

0 0

1 1

3 7

2 4

0 1

2

0

1

2

2

0

2 2

0 3

1 1

6 7

4 3

2 1

0 1 1 1 2

0 0 2 0 0

1 1 1 1 1

4 7 4 2 7

2 3 0 2 4

0 1 0 0 2

1 1 2

MIR

1 1

MIR IR

Q7

Facebook, Fotocommunity

Facebook, others Picasa

Facebook, others Facebook, Fotocommunity

Facebook others

Picasa

others

Motif Duplicates. During the collection of all images, it actor3 1,15%

actor4 2,03%

actor6 15,46%

actor11 0,70% actor10 0,77%

Field of Work

actor5 10,69%

actor14 0,07% actor13 1,96% actor12 0,14%

Job Type

actor7 3,31%

Figure 1: Photographers’ Contribution in Percent

As a consequence, the collection includes blurred or flawed images and motif duplicates (see below) with a distribution similar to the one found at the studied photographers’ collections. Refer to Tab. 2 to see how these images are spread within the collection. This table lists also images with shooting dates rendered atop or special panorama images that could be found on the studied photographer’s hard-disks.

became obvious that all personal photo collection contained a particular amount of duplicate images. A motif duplicate (MD) as defined for the scope of this paper is a photograph that has been taken twice or more subsequently with the photographer’s intention to depict the same motif. These MD are characterized by the fact that the photographer mostly did not move but took the same motif again by correcting the rotation, translation, shutter speed, or the like of the camera because the composition did not look well. Fig. 2 illustrates a motif duplicate using the camera’s zoom. Other reasons are unsharp images or the choice of a wrong picture format (format switch). In other words, MD vary more in image quality than in motif. Tab. 2 provides a full list of possible categories. Note that the category association is not exclusive, i.e., a motif duplicate can be both rotated and zoomed or the like.

Table 2: Special Document Characteristics Type 1. Motif Duplicates (total) Unmodified Translated Format Switch Zoomed Rotated Sharpened Altered (Lighting or Effect) 2. Blurred Images 3. Backlit Images 4. Silhouettes (Shadows etc.) 4. Altered Images (e.g. sepia) 5. Rendered Date 6. Panorama Images

#Docs. 379 15 71 21 75 26 18 12 231 204 119 54 1106 14

% 6.82 8.38 39.66 11.73 41.90 14.53 10.06 6.70 4.16 3.67 2.14 0.97 19.91 0.25

Figure 2: A Motif Duplicate with Zooming

2.1

Collection Metadata

In addition to the aforementioned data, all documents have been enriched with embedded metadata to keep the images self-contained when distributed. Refer to Tab. 3 for an overview. All EXIF data has been conserved as retrieved directly from the camera. If GPS data was not available, it has been added manually ranging from a preciseness of street to city level. City or country names are also saved as IPTC keywords. The global distribution of images is depicted in Fig. 4. Amongst others, the IPTC keywords have also been

Table 4: Event Frequency

Table 3: Metadata Characteristics (Excerpt) Characteristic EXIF (Date, Camera Info. etc.) GPS Data Event Tags Outdoor Photographies Indoor Photographies

WordNet Event A. Conference B. Event C. Excursion D. Flight E. Holiday F. Jubilation G. Party H. Rock Concert I. Scuba Diving J. Soccer K. Visit

% 100.00 81.85 96.71 82.64 17.41

used to tag images as indoor/outdoor, day/night, altered, blurred etc., and to associate them with a photographer ID as found in Tab. 1. For photos depicting people (1,536 documents, i.e., 27.65 % of the collection), the number of people is provided (see Fig. 3 for the group classification7 ). The naming scheme for other tags is self-explanatory and will be neglected for the sake of brevity.

Table 5: Topics, Sorted by Domain Topic Domain Photo quality (cf. Tab. 2) Number of persons Events “Pure” topics Detection of Photographers Total:

700 600

600 500 400

% 0.65 0.36 7.23 1.78 77.86 0.49 1.33 8.70 1.03 0.04 0.54

344

Number of Topics 11 >7 61 30 19 128

300 186

200

165 112

100

108 21

0 1

[2,5]

[6,10]

[11,20]

[21,50]

[51,100]

> 100

Figure 3: Number of People within Photos

2.2

Elicitation of the Event Ground-truth

As said before, an event ground-truth has been added manually to the collection. The ground-truth is based on interviews with the photographers because they are likely to know what their pictures depict. Event information is saved as an IPTC keyword as follows: wnet:"holiday". For semantic clarification, the term following wnet: equals a general term that can be found in WordNet [11], whereas the term within brackets contains optional information to specify a unique event. Tab. 4 lists the frequency of 11 general WordNet events found in the collection8 . In total, 61 unique events are available. The usage of WordNet is a big advantage because it provides definitions of the event types in combination with means to relate the terms. As a result, 7

For large groups, the amount has been estimated. One might notice that holiday-related events are dominating all others. This is reflecting the findings on the studied real-world personal photo collections and not a freely chosen bias. 8

Figure 4: Global Distribution of Photographs

other researchers are free to extend the ground-truth or to use it in other ways than suggested in Sec. 3.

2.3

Elicitation of the Topic Ground-Truth

In order to obtain graded relevance assessments, at least 2 assessors judged the documents w.r.t. different search tasks (see Tab. 7 for an overview of some of our assessors). Before the evaluation of the document collection, all assessors have been introduced to the usage of the software system being used for eliciting the relevance judgments. Furthermore, they had to complete the same survey as the photographers. For the actual graded-relevance assessment, we picked 30 different topics. The topics were chosen after a first analysis of the collected photographs in order to reflect their contents in addition to the event ground-truth mainly provided by the contributing photographers. Please note that our definition of topics differs from the somewhat wider term “visual concept” used in the ImageCLEF annotation task. For the scope of this paper, a “pure” topic is closer related to the actual depicted motif within an image. In contrast, ImageCLEF subsumes various topic domains such as moods communicated by images (“scary”, “calm”), temporal information (“day”, “autumn”), motif and photo qualities (“indoor”, “blurred”), number of persons, and events or the like under the term visual concept. Following this interpretation, we offer a total of 128 topics with our collection of which some rely on a binary relevance scale, for instance if the photo depicts on or more persons (see Fig. 3). Obviously, a graded relevance scale does not contribute much to such cases. Before the assessments, all topics have been explained to the assessors. Additionally, sample images with varying relevance and a narrative were accessible to the assessors during evaluation. Then, the participants had to assess each image’s relevance on a scale ranging from 0 (irrelevant) to 3 (fully relevant) regarding one or more of the topics from Tab. 6. After all judgments have been obtained, an ideal DCG (IDCG)

Table 6: “Pure” Topics without Events, Motif Quality, Temporal Information etc. 1. Beach and Seaside 2. Street Scene 3. Statue and Figurine 4. Asian Temple & Palace 5. Landscape 6. Hotel Room 7. People 8. Architecture (profane) 9. Animals 10. Asian Temple Interior 11. Flower/Botanic Details 12. Market Scene 13. Submarine Scene 14. Ceremony and Party 15. Theater

16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.

Clouds Still Life Church (Christian) Art Object Car Ship Airplane Temple Squirrel Sign Mountains Monkeys Birds Trees Abstract Content

Table 7: Assessor Characteristics (Excerpt) Assessor M B S T G K D S2 S3 N ···

Gender male female male male male female male male female female ···

Year Birth

···

of

Q2

···

1986 1987 1984 1982 1944 1955 1979 1978 1978 1980

3 2 3 3 0 0 4 2 0 0 ···

··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···

[8] curve was calculated. To calculate this curve, all individual assessments for a document were considered. Then, we calculated the expectation value for a document’s graded relevance and rounded this value in order to associate a document with a relevance grade. Based on these grades, the IDCG could be determined for each topic.

3.

WORKING WITH THE COLLECTION

The proposed collection can be used for the evaluation of retrieval systems targeting two different main application areas. First, it can be used to annotate or associate photographs from a personal photo collection with different topics (see Tab. 6). Because motif duplicates or “erroneous” images have not been removed, it can also be used to filter or cluster such images. For instance, if a user is looking for pictures of a certain landmark she might only be interested in the sharp images or the ones that vary visually. Duplicate images should be hidden or clustered at the user interface level to improve the usability of an IIR system. As the second application area, we suggest event-based retrieval. The discussed image collection comes with all kinds of metadata including camera information, date and time of photo creation, and GPS data. Given an image from one of the manually annotated 61 different events, one can evaluate how good other documents illustrating the same event

can be retrieved. Retrieval can here be carried out on the general level (e.g. images of rock concerts) or for a particular event (images of a U2 rock concert) as this information is available as ground-truth. We believe that the rich set of accompanying metadata will contribute in solving these tasks. For instance, to determine similar events on basis of spatial or temporal properties or to cluster images with bad quality based on exposure time in combination with color characteristics.

4.

EVALUATING IIR SYSTEMS

Keeping IIR systems in mind, our collection provides means to conduct user simulation studies. As stated in [2] “Simulation has several advantages, including cost-effectiveness and rapid testing without learning effects as argued in the SIGIR SimInt 2010 Workshop”. Furthermore, it enables researchers to get insights into the performance of relevance feedback (RF) enhanced IIR systems prior to real user testing. Unfortunately, writing good user simulations is a complex task. One has to decide to what degree a simulated user (SU) experiences a certain image as relevant or how fast the SU fatigues and provides “false” RF. [2] discusses the latter aspect on a theoretical basis. In our opinion, this collection contributes to modeling better user simulations that are backed up with real assessments and demographics. Hence, it becomes possible to model SU based on relevance assessments of, e.g., business information scientists born in the mid-80s. This was impossible before because this information was not provided by typical test collections. Eventually, this links the evaluation of IIR systems to personas – a state-of-the-art technique commonly used in interaction design usability engineering – and will help to improve the overall search experience for users. According to Cooper et al. [5, pp. 75-76] personas “are not real people, but they are based on the behaviors and motivations of real people we have observed and represent them throughout the design process. They are composite archetypes based on behavioral data gathered from many actual users encountered in ethnographic interviews.”. In other words, personas are used to model user groups or behavioral patterns that have been observed in the real world, i.e., the prospective user’s context. Other sources of data that can be used for persona construction are surveys (as in this paper), interviews, or research literature. By referring to the offered demographic data accompanying the actual assessments, one can model personas9 that link the evaluation of the interaction design with typical IR evaluation. Here, it becomes possible to address certain usage patterns during the design phase of the interactive system with related, subjective relevance assessments. It is known that such persona-based approaches cannot compensate a study with real end-users but they are known for providing new insights into the interaction process [5]. On the other hand, it is known that such initiatives are very cost 9 To be precise, the personas that can be constructed with the provided data are so-called ad-hoc personas. Such personas have been shown to improve the user interaction design process but should be extended with additional qualitative data (such as interviews or user observations) from the studied MIR usage context.

intensive [13]. By combining personas with SU experiments we establish a connection between the investigation of retrieval performance and interaction design. In other words, design decisions affecting retrieval performance and usability can be validated against personas and SU at an early stage of development. Unlike other SU approaches, the discussed dataset is backed-up with real-world data that will provide valuable insights into relevance assessments of user groups. In consequence, these specific needs can be addressed in both the retrieval system as well as in the interaction design while keeping the evaluation costs low. Nevertheless, this cannot render experiments with users in their specific usage context superfluous. But it surely reveals first problems and chances that can be addressed before a prototype is released.

5.

Every image is associated with an XML file of the same name containing extracted EXIF and IPTC metadata. The XML file format is an extension of the CoPhIR [3] file format with small modifications to support additional low-level features plus a proper stylesheet. Because of its strong relation with the CoPhIR format it is capable of handling additional meta data such as owner, title, description, dates, comments,tags etc. although these XML elements are often not set in the collection because we rely mostly on EXIF and IPTC. The collection offers 17 low-level features that are partly MPEG-7 conform. The features are listed in Tab. 8. In addition to our own implementation, 7 low-level features have been extracted with help of the LIRE framework [10]. For these features, the contents of the VisualDescriptor node are the equivalent of LIRE’s textual feature representation and can therefore be used directly with LIRE. Regarding CoPhIR, it is important to note that the MPEG-7 features that are produced by LIRE are not directly compatible although they are named equal. This is due to the chosen parameters during the extraction. Unfortunately, the actual parameters that were used for CoPhIR are not published. In order to use documents from the presented collection with the CoPhIR set it is necessary to re-extract all CoPhIR descriptors or to transform the values of the VisualDescriptor node appropriately. To avoid such problems, a tool for similarity calculations on basis of the provided feature set is also part of the supplements. To address legal issues, all documents from the collection can be distributed under the terms of the creative commons11 . This is done to facilitate re-prints of parts of the collection within scientific works.

EXTENDING THE COLLECTION

Fig. 4 shows clearly that the collection does neither cover all continents nor all imaginable topics. Although this is likely 10

Name Auto Color Correlogram BIC BRIEF CEDD Color Histogram Color Layout Color Structure Contour Shape Dominant Color Edge Histogram Face Detection FCTH Gabor Region Shape Scalable Color SURF Tamura

FILE FORMAT AND LEGAL ISSUES

The document collection as well as all supplements can be found under www.8bit-inferno.de/pythia/. The archive contains a description of the metadata10 , the survey, demographic information, relevance assessments in TREC QREL format, and the images in original size.

6.

Table 8: Available Features and Compatibility

This includes the feature-relevant citations, which are skipped in this paper for the sake of brevity. 11 creativecommons.org/licenses/by-nc/3.0/de/deed.en

LIRE √

√ √ √ √

√ √ √ √

Pythia √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √

to be found in real-world personal photo collections, we support the extension of the presented work with other image collections in case special characteristics have to be included for experimental evaluation. With our collection, we provide additional compatible QREL files for already published collections. In addition, a category mapping for overlapping topics between different collections is available. Eventually, this makes it possible to combine collections in an arbitrary manner. This was impossible before because the ground-truth of all test sets were incompatible. Unfortunately, graded relevance assessments cannot be distributed for other collections than the Pythia collection. For instance, to improve the geographic variety for evaluations, UCID v2 [12] could be added in order to introduce images from Las Vegas (including motif duplicates and blurred pictures), other parts of North America, or the UK. To add pictures from Polynesia and Africa, the Wang collection [14] – a subset of the famous Corel database – could be used12 . In order to salt the collection with more than the present less than 1% clutter, Caltech 101’s BACKGROUND Google documents can be included yielding a total of 10.17% noise during evaluation. This category consists of photos, imageprocessed photos, graphics/clip arts, and rendered texts. Its content is similar to documents that could also be found on our contributor’s hard-disks during image collection. We decided not to include such documents in the proposed collection due to copyright considerations.

7.

CONCLUSION

In this work, we have presented a personal photo collection with a comprehensive set of topics. The collection consists of contributions by 19 persons with different demographics. Hence, it can be interpreted either as an average of personal photo collections or as a chronological development of a photo collection over a lifetime. In our opinion, it is the only image collection providing demographic details of both photographers and assessors. This enables the construction of real-world user simulations for 12

Please note that this may introduce copyright issues.

IIR and RF experiments. Furthermore, it relates this approach closely to the construction of personas in interaction design [5]. Thus, the suggested collection is a valuable resource for IIR evaluation. It is one of the few that has been designed with IIR in mind – a factor that is often neglected, especially in the field of MIR. In general, published collections have been designed for system-centered evaluation [9] such as TREC or ImageCLEF.

Table 9: Contents of the Survey ID

Question

Answers

Job Type

Choose your current type of work.

Camera Usage

When do you usually take photos?

Q1 Q2

Have you visited one or more of the following lectures? Do you know the principles of content-based information retrieval?

(1) (2) (3) (4) (5) (7) (8) (9) (0) (1) (2) (3) IR,

Q3

Are you colorblind?

Q4

How many minutes do you use the internet per day?

With respect to the current discussion in the community, the collection can also be used to evaluate event retrieval approaches in addition to traditional annotation or classification tasks. To support such approaches, all pictures are accompanied by extensive metadata and low-level features.

Q5

Do you know Web 2.0 services such as Flickr or Fotocommunity.de for sharing holiday, family or other photographs with friends? How often do you use such Web 2.0 services to share photographs with friends?

Currently, we are investigating the usage of a crowd-sourcing platform to obtain a higher degree of variety in demographics, topics, and assessments. When the results are available they will be added to the supplement without overriding the one coming with this work. On the long run, we are aiming at extending the motif duplicate annotation with graded relevance assessments because it is highly subjective as most statements about images.

Q7

Another important contribution is the inclusion of graded relevance assessments which enables the utilization of other metrics than precision, recall, or mean average precision. DCG as one of the more sophisticated and user-oriented metrics has been shown to be a useful measure of system effectiveness [4] and has been widely adopted in the text-based IR community. In our opinion, graded relevance assessments reflect the user’s subjective notion of relevance much better than the traditional binary scale.

8.

REFERENCES

[1] G. Amarnath and R. Jain. Managing Event Information: Modeling, Retrieval, and Applications. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. [2] F. Baskaya, H. Keskustalo, and K. J¨ arvelin. Simulating simple and fallible relevance feedback. In Proc. of the 33rd ECIR’11, pages 593–604. Springer-Verlag, 2011. [3] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, and F. Rabitti. CoPhIR: a Test Collection for Content-Based Image Retrieval. CoRR, abs/0905.4627v2, 2009. [4] B. Carterette. System effectiveness, user models, and user utility: a conceptual framework for investigation. In Proc. of the 34th SIGIR’11, pages 903–912. ACM, 2011. [5] A. Cooper, R. Reimann, and D. Cronin. About face 3: The essentials of interaction design. Wiley, Indianapolis, Ind., 2007. [6] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples an incremental Bayesian approach tested on 101 object categories. In Proc. of the Workshop on Generative-Model Based Vision, 2004. [7] G. Griffin, A. Holub, and P. Perona. Caltech-256 Object Category Dataset, 2007. [8] K. J¨ arvelin and J. Kek¨ al¨ ainen. Cumulated gain-based

Q6

[9]

[10]

[11] [12]

[13]

[14]

Which of the following services do you use to upload and administrate holiday, family, or other photographs?

Pupil In job training Student Fully employed Part-time Not employed Retired Other Seldom At special events Often Virtually always MIR

(0) No (1) A little (2) Informed outsider (3) Very much (4) I am an expert (0) I don?t know. (1) No (2) Yes (0) Not at all (2) 1 - 30 minutes (3) 31 - 60 minutes (4) 61 - 90 minutes (5) 91 - 120 minutes (6) > 120 minutes (7) > 240 minutes (0) Never heard of it (2) I know it by name (3) I have visited such websites (4) I do have an account (0) Never (1) < 1 a month (2) > 1 a month (3) Weekly (4) Daily None, Picasa, Facebook, Flickr, Fotocommunity.de, Other

evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002. D. Kelly. Methods for Evaluating Interactive Information Retrieval Systems with Users. Found. Trends Inf. Retr., 3:1–224, 2009. M. Lux and A. S. Chatzichristofis. Lire: Lucene Image Retrieval: An Extensible Java CBIR Library. In Proc. of the MM’08, pages 1085–1088. ACM, 2008. A. G. Miller. WordNet: a lexical database for English. Commun. ACM, 38:39–41, 1995. G. Schaefer and M. Stich. UCID - An Uncompressed Colour Image Database. In Proc. SPIE, pages 472–480. San Jose, USA, 2004. E. M. Voorhees. On test collections for adaptive information retrieval. Information Processing & Management, 44(6):1879–1885, 2008. Z. J. Wang, J. Li, and G. Wiederhold. SIMPLIcity: Semantics-sensitive Integrated Matching for Picture Libraries. In Proc. of the 4th VISUAL ’00, pages 360–371. Springer-Verlag, 2000.