Towards Integration of End-User Tags with Professional Annotations

14 downloads 148 Views 223KB Size Report
video content is produced and disseminated online. Pre- requisite ... Web Science Conf. 2010, April .... organization-group-other names (≈ 27,000 terms), maker.
Towards Integration of End-User Tags with Professional Annotations Riste Gligorov

Vrije Universiteit Amsterdam De Boelelaan 1105 The Netherlands

[email protected] Lora Aroyo

Vrije Universiteit Amsterdam De Boelelaan 1105 The Netherlands

[email protected]

Lotte Belice Baltussen

Netherlands Institute for Sound and Vision Sumatralaan 45, Hilversum The Netherlands

lbbaltussen@beelden geluid.nl Maarten Brinkerink

Jacco van Ossenbruggen Vrije Universiteit Amsterdam De Boelelaan 1105 The Netherlands

jacco.van.Ossen [email protected] Johan Oomen

Netherlands Institute for Sound and Vision Sumatralaan 45, Hilversum The Netherlands

Netherlands Institute for Sound and Vision Sumatralaan 45, Hilversum The Netherlands

mbrinkerink@beeld engeluid.nl Annelies van Ees

[email protected]

Vrije Universiteit Amsterdam De Boelelaan 1105 The Netherlands

[email protected] ABSTRACT

laries in the process. However, it is becoming widely accepted that professional annotators alone can not cope with the steadily increasing stream of video data. The success of Flickr1 and similar services in terms of self-organization of their digital collections suggests a possible solution to the problem of scarcity of video metadata: harness users’ efforts to amass video content descriptions. Just to put things into perspective, in 200 years of existance the Library of Congress has applied their expert-maintained taxonomy to 20 million books2 , whereas, in only four years, Flickr’s users applied their ad-hoc tagging vocabulary to over 25 million photos [3, 4]. One way of engaging users in tagging videos is through socalled ‘games with a purpose’ [7]. To this end, the Netherlands Institute for Sound and Vision in cooperation with KRO broadcasting in May 2009 launched ‘Waisda?’ (see Figure 1), a multi-player video labeling game where players describe streaming video by entering tags and score points based on various temporal tag agreements (a more detailed description of the game can be found in [1]). The underlying assumption is that tags are probably valid — trustworthily describe the video fragments — if they are entered independently by at least two players within a given time-frame. From here on we shall refer to tags that are mutually agreed on as verified tags. The first step towards integrating user tags and expert annotations is understanding the relationship between them and accessing the overall quality of the Waisda? tags. The goal of this paper is taking this first step. With respect

The goal of the paper is assessing the quality of end-user tags from a video labeling game as a first step in the process of integrating them with the annotations made by professionals. Tags lack precise meaning, whereas the terms and concepts the professionals are used to have a clearly defined semantics given by structured vocabularies. Hence, we explore the possibility of mapping user tags to their semantic counterparts from domain and lexical vocabularies. Furthermore, we report on an analysis performed by a senior cataloguer regarding the general usefulness of the user tags in terms of video and fragment retrieval. Finally, we investigate the distribution of tags with respect to audio and visual portion of the video content.

Keywords tagging, video, data analysis, professionals vs. end-users, games with a purpose

1.

INTRODUCTION

Web 2.0 has dramatically increased the speed by which video content is produced and disseminated online. Prerequisite for successful information retrieval and collection management is quality metadata associated with collection items. Until recently, the task of annotating was strictly in the hands of professional cataloguers which adhere to wellestablished guidelines and make use of controlled vocabuCopyright is held by the authors. Web Science Conf. 2010, April 26-27, 2010, Raleigh, NC, USA. .

1 2

1

http://www.flickr.com/ http://www.loc.gov/about/reports

Figure 1: Screenshot of the Waisda? game interface: the central part of the screen is reserved for the video. Immediately below the tag input field and the list of tags entered already are placed. The coloration of the tags is indication of the number of points the tags scored. On the right, there is the list of players currently in the game. to this, we look into the possibility of linking user tags to their semantic counterparts from domain vocabularies used by experts and more general lexical resources. Moreover, we reflect on a qualitative analysis regarding the quality of user tags performed by professional cataloguer from the Netherlands Institute for Sound and Vision. Finally, we investigate the distribution of tags with respect to audio and visual portion of the video content. The rest of the paper is structured as follows. In section 2 we describe the Waisda? tagging game data and the domain and lexical vocabularies we used for this study. Section 3 presents the various analyses we performed on end-user tags to assess tag quality. Finally, section 4 draws conclusions and points to some further directions for research.

to label images ensures that entered words will be accurate. As the only thing the two partners have in common is that they both see the same image, they must enter reasonable labels to have any chance of agreeing on one. Although with some differences, this idea was applied on video resources in at least three different video labeling games (that we know of): Yahoo! video tag game [6], VideoTag4 tagging experiment, and Waisda?. To the best of our knowledge, there is no other work that assesses the quality of user descriptions obtained through a GWAP.

2.

Related work Games with a purpose (or GWAPs) are computer games, in which people, as a side effect of playing, perform tasks computers are unable to perform. The first example of a GWAP was the ESP game3 , designed by Luis von Ahn, which harnesses human abilities to label images. The game randomly pairs up two players; the players don’t know each others’ identity and cannot communicate. The task in the game is to agree on a word that would be an appropriate label for the image. Once a word is entered by both partners it becomes a label for the image. The idea of partnering two people 3

DATA SET

Presently, the Waisda? tagging game is hosted on a KRO server which continuously logs all game-related activity. Even at the time of writing, new user tags are being accumulated. Subject of our analysis is the data collected in the period starting from the launch date until 6th of January 2010. During this period, the game amassed over 46,000 unique tags ascribed to approximately 600 videos by roughly 2,000 different players5 . The number of distinct tag entries exceeded 420,000. The logs contain information about players, games, videos, and tag entries. Each player’s log record includes authentication data (username and password), and binary attribute that labels the player either as active or 4

http://www.videotag.co.uk/ Throughout this text we use the terms player and user interchangeably 5

http://www.gwap.com/gwap/gamesPreview/espgame/ 2

inactive — an active player is one that has played Waisda? at least twice in a period of two weeks. Furthermore, each game is associated to particular video and lasts as long as the video is being streamed, thus a game’s log record contains video identifier, game’s starting time, and a boolean indicator about whether the game has finished or not. Each tag entry is represented by an instance of a ternary relation that relates the player that entered the tag, the game the tag was entered in (which indirectly determines the video the tag was attached to), and the tag itself. Additionally, a tag entry is associated with the point in time — relative to the beginning of the video — when the tag was entered. It also includes a score computed taking into consideration agreement with other tag entries in the temporal neighborhood. Since almost all players originate from the Netherlands and all videos subjected to tagging are in Dutch, the language of the vast majority of tags – nearly 100% — is Dutch.

# of tags # of singleword tags # of multiword tags

verified tags 7,372 6,547 (89% of verified tags)

16,429(34%)

825 (11% of verified tags)

Table 1: Single-word vs. multi-word tag distribution. studying the tags that could not be matched to any of these vocabularies, we learned that many of these where compound, multi-word tags that needed to be analyzed separately. To get more insights in the relation between the tags and the professionally made annotations, we asked a cataloguer of Sound and Vision to provide feedback on the tags of two typical videos. Finally, we looked at the relationship between the tags and the audio track by matching tags to the subtitles associated with the target video segment. In the remainder of this section we briefly discuss each step in our analysis.

Domain and lexical vocabularies For this study we used two vocabularies: GTAA and Cornetto. While the former is a domain vocabulary, the latter is a general lexical source that covers common lexical terms. GTAA (Dutch acronym for Common Thesaurus Audiovisual Archives) is the thesaurus used by professional cataloguers in the Sound and Vision documentation process. It contains approximately 160,000 terms divided in six disjoint facets: subjects or keywords (≈ 3,800 terms), locations (≈ 17,000 terms), person names (≈ 97,000 terms), organization-group-other names (≈ 27,000 terms), maker names (≈ 18,000 terms) and genres (113 terms). GTAA terms are interlinked with each other and documented using four properties: Broader Term, Narrower Term, Related Term and Scope note. While all GTAA terms may have related terms and scope notes, only terms from subject and genres facet are allowed to have narrower and broader terms. Complementary to the narrower/broader term hierarchy, terms from the subject facet are classified by theme in 88 subcategories which are organized into 16 top-level categories. Cornetto is a lexical semantic database of Dutch that contains 40K entries, including the most generic and central part of the language. It is build by combining Dutch Wordnet (DWN) with Referentie Bestand Nederlands (RBN) which features FrameNet-like information for Dutch [8]. Cornetto organizes nouns, verbs, adjectives and adverbs into synonym sets called synsets. A synset is a set of words with the same part of speech that can be interchanged in a certain context. Synsets are related to each other by semantic relations — like hyperonomy, hyponomy, meronomy etc. — which may be used across part of speech. Although Cornetto contains 59 different kinds semantic relations, hyperonymy and hyponomy are by far the most frequent ones, accounting for almost 92% of all semantic relation instances.

3.

all tags 46,792 30,718 (66%)

3.1

Multi-word tags

When it comes to adding tags, tagging interfaces can be divided in two categories [5]: character-delimited (where tags are submitted in a group separated by a delimiter, like at Del.icio.us) and action-delimited (where tags are submitted individually6 , like at Amazon.com). Character-delimited tagging interfaces typically use the space character as a delimiter which means tags are not allowed to contain spaces. Waisda? tag adding mechanism is action-based i.e. players enter tags one after another by pressing Enter. As a result, any unicode string is permissible as tag, including multi-word strings. As it turns out, multi-word tags appear relatively frequently: over one third, or 34% (see Table 1), of all Waisda? tags are multi-word. The numbers are slightly lower when looking at verified tags: one out of ten, or 11% (see Table 1), verified tags is a multi-word. This is hardly surprising since the likelihood of reaching a tag agreement among players decreases as the length of the tags being entered increases. Some multi-word tags are perfectly usable tags. These include names (“romy schneider”), idiom ”24 hours a day”, ”path of least resistance”) and other common co-locations. However, most are borderline useless: there are, for example, entire sentences, often grammatically incorrect, used as tags, : “aan varkenslapjes die de varkens gaan word”. From these findings we conclude that multiword tags should be allowed, but that users should be discouraged to add “useless” tags that are very unlikely to yield points.

3.2

Linking tags to vocabularies

Professional cataloguers make use of controlled vocabularies in the process of describing resources. A potential way of making the end-user tags inter-operable with professional annotations is to link them (directly or indirectly) to the appropriate terms from the controlled vocabularies used by professionals. In this study, we compared all Waisda? tags against terms and synsets from the two vocabularies (Cornetto and GTAA) described in section 2. We deemed a tag and GTAA

TAG ANALYSIS

We analyzed the tags to get some insights in the quality of the tags, the nature of the tags the players of the game tend to enter, and how the tags compare with the GTAA-based annotations made by the professional cataloguers. First, we wanted to know to what extent the tags overlap with the terms used in the GTAA vocabulary. Since we found only a very small overlap, we also computed the overlap with the terms in the more general Cornetto lexical resource. By

6

3

Usually by pressing Enter or Submit button

# of tags in Cornetto # of ambiguous tags in Cornetto avg. # of Cornetto synsets per tag

all tags 10,757 (23%) 4,718 (45%) 1.87599

verified tags 3,858 (52% of verified tags) 2,197 (56.9% of) verified tags in Cornetto) 2.29082

GTAA

Cornetto

Table 2: Waisda? tags and Cornetto overlap and average ambiguity of mapping the Waisda? tags to Cornetto synsets.

# of tags in GTAA # of ambiguous tags in GTAA avg. # of GTAA terms per tag

all tags 2,517 (5.3%) 239 (9%) 1.11839

Facet Subject Location Genre Person Maker Name Synset Noun Verb Adjective Adverb

# of tags in facet 1,199 613 52 118 4 673 # of tags in synset 7,222 2,090 1,693 171

Table 4: Waisda? tags distribution over Cornetto synsets and GTAA facets.

verified tags 822 (11% of verified tags) 83 (11% of verified tags in GTAA) 1.14112

3.2.1

Dealing with word (tag) inflections

So far, we used case-insensitive string comparison to match user tags and vocabulary terms. However, this approach has one serious drawback: it does not account for different inflections of a given word. For example, in English many nouns are inflected for number with the inflectional plural affix -s, as in “dog” −→ “dog-s”. While “dog” and “dogs” are two different strings, from lexical standpoint they refer to the same same root or stem, “dog”. Thus, if a user enters the tag “dogs” and a controlled vocabulary contains the term “dog” we would like to be able to link them. This may be achieved through a linguistic procedure called stemming. For this part of the study, we considered the set of all tags that were not matched to a Cornetto synset using simple string comparison. Each tag was reduced to its stem and the stem was then compared against all Cormetto synsets as described previously. For the stemming part, we used a Python implementation8 of the Snowball stemming algorithm for Dutch9 . Comparing the stems against Cornetto resulted in additional 1,526 tags being successfully matched. Out of these 1,526 tags, we picked a sample of 212 tag - stem pairs for further analysis. It turned out that in 45 cases we got a false positive. We divided the false positives in two categories. The first category contained the misspelled or unknown words that were stemmed into a valid word, whereas the second category consisted of valid words that were not reduced to the correct stem. The number of cases in the first and the second category was 38 and 7, respectively. Dealing with false positives in a robust way is indeed a challenge and requires further research. One problem with stemming is that it is language-dependent: for each language a different stemming algorithm must be used. Before a tag can be stemmed, the language of the tag ought to be determined. In this particular case we did not have to face this difficulty since almost all tags were in Dutch and we knew beforehand what stemmer to use. However, in a multilingual environment as the Web one can easily see how this may become a problem. A solution would be to try to predict the language of a tag based on profile data, in particular, language-preferences, of the user that entered it. Alternatively, we can look at the tag history of the user and pick the language of the majority of the tags created by this user.

Table 3: Waisda? tags and GTAA overlap and average ambiguity of mapping the Waisda? tags to GTAA terms.

term a to be a positive match only if they are the same string (ignoring case). A tag and Cornetto synset have been considered a positive match only if at least one of the words associated with the synset is equal (in case-insensitive manner) with the tag. The results of the mapping of Waisda? tags against Cornetto and GTAA are presented in Table 2 and Table 3, respectively. As we can see, 10,757 (or only 23.2%) of all tags and 3,858 (or 52%) of all verified tags can be matched to Cornetto synsets. The numbers are even lower for GTAA; 2,517 (or only 5.3%) of all tags and 822 (or only 11%) of all verified tags can be matched to GTAA terms. We attribute the low match with a general lexical source like Cornetto to the low level of tag literacy: typos, entire sentences as tags, etc. The even lower match with GTAA thesaurus may stem from at least two reasons. The first reason is (again) the presence of sloppy (illiterate) tags. The second reason could be that end-user tags tend to complement those made by professionals; GTAA is supposed to contain all terms used by cataloguers in Sound and Vision institute which is usually not what a user would enter (we look into this more deeply in section 3.3 where we present the opinion of a professional caloguer with respect to the usefulness of the user tags). In addition, we found that mapping the Waisda? tags to Cornetto and GTAA is highly ambiguous process; 45% of the tags found in Cornetto can be matched to more than one synset and 9% of the tags found in GTAA can be matched to more than one GTAA term (Table 2 and Table 3). Table 4 shows the distribution of user tags over Cornetto synsets and GTAA facets. Not surprisingly, it is most likely that a user would use a noun or a term from the GTAA subject facet7 as a tag. Other parts of speech and GTAA terms are less likely to appear as user tags.

8

7

As we recall, subject facet contains all keywords used to annotate programmes in Sound and Vision

9

4

http://pypi.python.org/pypi/PyStemmer/1.0.1 http://snowball.tartarus.org/

3.3

Tags in the eye of the cataloguer

A professional senior cataloguer (an employee of the Netherlands Institute for Sound and Vision) has judged the tags added to two episodes on their usefulness10 . The selected episodes were the best-tagged episode (from the popular Dutch reality show Boer zoekt Vrouw11 , with 25,965 tags added to it) and an episode that was tagged with an averaged number of tags (a documentary series about a former news correspondent situated in the U.S. returning to the Netherlands, with 738 tags). The criterion for determining usefulness of tags was as follows. The cataloguer was asked to answer the question: “Would you be surprised if you found this fragment with this term in a search query ?” (a similar approach towards tag evaluation was taken in [2]). The top 20 most added and least added tags of the episodes were evaluated with the question: “Would you be surprised if you found this programme with this term in a search query ?”. The cataloguer was also asked to indicate the ’strength’ of the tags, i.e. when she found useful tags to be highly specific or not specific. So, usefulness of tags was evaluated with respect to video and fragment retrieval and was based on the experience, but also opinion, of the professional cataloguer.

video video video video video

3.4

25,965 22,792 1,007 403 257

8,645(33%) 8,004(35%) 182(18%) 91(23%) 59(22%)

# of verified tags in video 5,837 6,153 274 73 45

# of verified tags in subtitles 2,546(43%) 2,967(48%) 64(23%) 16(22%) 18(40%)

What-do-you-see or what-do-you-hear?

As said, when entering tags, players have a tendency to state the immediate. With respect to this, it is useful to learn whether they, in general, rely more on the visual or the audio input. Therefore, we investigate the fraction of Waisda? tags that refers to the audio portion of the video content. To this end, we compare the tags associated to five videos against the respective video subtitles for hearing impaired persons. The choice of videos includes the two best-tagged videos (video 1 and video 2 in Table 5), one averagely tagged video (video 3 in Table 5), and two low-tagged videos (video 4 and video 5 in Table 5). For each of these videos KRO provided us with television teletext subtitles files for hearing impaired persons. The subtitle files contain textual versions of the dialog in the television programs. Each dialog excerpt is accompanied with time-points — relative to the begining of the video — when the dialog excerpt appears on and disappears from the screen. Prior to running the analysis, all dialog text has been broken up into words and punctuation through a process known as tokenization. For this purpose, we used the standard tokenization algorithm provided by the Natural Language Toolkit12 (NLTK) which is an NLP library for Python. Subsequently, each tag associated with the aforementioned videos was compared, in a case-insensitive manner, against all words in the subtitles in the appropriate video that appear at most 10 seconds before the tag was entered. The time interval of 10 seconds was chosen as a reasonable amount of time needed by an average player to type in a tag. An identical time interval was used by the designers of Waisda? as the time frame for matching tags added by different players. The results of the analysis are summarized in Table 5. As we may see, on average approximately one forth (≈ 26%) of all tags associated with the videos can be traced in the subtitles. This number is slightly higher when it comes to verified tags where on average one third (≈ 35%) of all verified tags are found in the subtitles. This is no surprise since some players pick up on the fact that one is more likely to score if one just types in what one hears: as long as there is another player in the game that does the same, chances of scoring are higher. This practice may impair the overal quality and richness of the user tags. We hypothesize that a viable solution to this potential problem is to ‘penalize’ users by awarding them less points in case an agreement is reached on a tag found in the subtitles. Ideally, remaining tags — excluding the ill-formed ones and the ones that are inflection of words found in subtitles (our analysis in section 3.2.1 gives an estimate of what their number might be) —

• The averagely tagged episode contained 72,69% tags deemed useful with 26,39% having a low and 19,44% having a high specificity. The senior cataloguer noted that in general the useful tags describe the material in a different way than keywords that cataloguers add do. Firstly because the tags focus on describing what is seen and heard within a programme, while the professional metadata for audiovisual content focuses on the subjects that a programme refers to. Apart from that, the tags tend to describe short-lived things that appear in few consecutive video frames, rather than describing the topic of a logical segment of the programme, like a scene, or the entire episode. To describe the episode as a whole, only two tags from the top 20 most added tags of the averagely tagged episode (documentary) proved to be useful. For the best-tagged episode (reality show) none of the top 20 tags were deemed useful to describe the complete episode. Tags added to the documentary series episode were notably more often useful than the tags added to the reality show. They were more defining and more specific. On the other hand, the reality show contained tags that are more general and lacked descriptive power. The results of the analysis performed by the cataloguer suggests two things. First, the content of the programme being tagged seems to influence the descriptive power and general usefulness of the tags. Second, there seems to be a semantic gap between professional descriptions, based on a closed vocabulary, and the tags entered by users. While the former operate on a higher semantic level describing the subject of the programme, the later are more low-level and generally refer to things that are heard or seen on screen. 11

# tags in subtitles

Table 5: Waisda? tags and video subtitles overlap.

• Looking at the best-tagged episode, 45% of the tags were deemed useful with 27,45% having a low and 11,76% having a high specificity.

10

1 2 3 4 5

# of tags in video

Parts from this section have been adapted from [1] http://boerzoektvrouw.kro.nl/

12

5

http://www.nltk.org/

either refer to objects in scenes or are abstract concepts pertaining to topic. Either way, they form a class of tags that is not trivial to obtain (automatically) hence may provide added value in terms of video and frame retrieval. However, these matters require further research before drawing decisive conclusions.

4.

[4] S. Sen, F. M. Harper, A. LaPitz, and J. Riedl. The quest for quality tags. In GROUP ’07: Proceedings of the 2007 international ACM conference on Supporting group work, pages 361–370, New York, NY, USA, 2007. ACM. [5] G. Smith. Tagging: people-powered metadata for the social web. New Riders Publishing, Thousand Oaks, CA, USA, 2007. [6] R. van Zwol, L. Garcia, G. Ramirez, B. Sigurbjornsson, and M. Labad. Video tag game. In proceedings of the 17th International World Wide Web Conference (WWW 2008), Beijing, China, April 2008. [7] L. von Ahn and L. Dabbish. Designing games with a purpose. Commun. ACM, 51(8):58–67, 2008. [8] P. Vossen, I. Maks, R. Segers, and H. van der Vliet. Integrating lexical units, synsets and ontology in the Cornetto database. In E. L. R. A. (ELRA), editor, Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), 2008.

CONCLUSIONS

The results of the tag analysis led us to three main conclusions. First, the level of tag illiteracy is relatively high. A substantial part of all Waisda? tags are ill-formed. We believe that this problem can be alleviated by providing the players with general guidelines that aim at avoiding the common pitfalls that render user tags useless. Also, improving the UI (e.g. auto-competition suggestions originating from structured vocabularies) could reduce the likelihood of players entering bad tags. Second, the overlap between user tags and domain and lexical vocabularies is rather low. We attribute the low overlap with a general lexical source like Cornetto to the relatively high number of sloppy tags. However, from the even lower overlap with the GTAA thesaurus we conclude that the tags end-users make tend to complement those made by professionals. Furthermore, analysis result show that that mapping the Waisda? tags to domain and lexical vocabularies is a highly ambiguous process which stands in the way of successful integration of user tags with annotations made by professionals. Finally, we found that players are more prone to entering tags that pertain to the visual portion of the video content. We find this highly advantageous, as this is not the kind of tags a professional annotator would make but could be very helpful for finedgrained video and frame retrieval. In conclusion, end-users tags in this context have a high potential, however, to fully integrate them in the archiving process the issues around quality control and tag-sense disambiguation need to be solved.

Acknowledgements We would like to thank the Sound and Vision cataloguer for her feedback on the tags, KRO and Q42 for their invaluable contributions to the Waisda? game, and all users that played the game and provided us with the unique opportunity to analyse this dataset. This research was funded by the PrestoPrime project under the 7th Framework programme of the European Commission.

5.

REFERENCES

[1] L. B. Baltussen. Waisda? video labeling game: Evaluation report, 2009. [Online; accessed 20-January-2010] http://research.imagesforthefuture.org/index.php/ waisda-video-labeling-game-evaluation-report/. [2] T. Leason. Steve: The art museum social tagging project: A report on the tag contributor experience. Paper presented at the Museums and the Web 2009 Conference, Indianapolis, Indiana, United States, 15-18 april 2009 [Online; accessed 16-March-2010] http://www.archimuse.com/mw2009/papers/leason/ leason.html. [3] P. Schmitz. Inducing ontology from flickr tags. In Proceedings of the Workshop on Collaborative Tagging at WWW2006, Edinburgh, Scotland, May 2006. 6

Suggest Documents