Centering Information Retrieval to the User

Centering Information Retrieval to the User Bernd Ludwig* — Stefan Mandl* * Chair for Artificial Intelligence, University of Erlangen-Nürnberg

Haberstraße 2, 91058 Erlangen (Germany) {Bernd.Ludwig,Stefan.Mandl}@informatik.uni-erlangen.de

ABSTRACT. In this paper, we present a novel approach to text mining that helps to build intelligent

user interfaces for recommender and information retrieval systems. The main problem for the user in information retrieval is that he must have almost perfect knowledge of the domain and the domain terminology. Our approach eases this burden by showing a way how to encode domain knowledge so that an information retrieval system can transform the user’s way to talk about the domain in the expert’s way to do that. After that transformation the system can search its data bases for appropriate information. We demonstrate the practicability of our approach in a case study on a TV recommender system. Le présent article introduit une nouvelle approche dans le domaine de la fouille textuelle, dans le but de faciliter le développement d’intelligentes interfaces aux utilisateurs pour systèmes de recommandation ainsi que de recherche documentaire. En recherche documentaire, le principal problème pour les utilisateurs consiste à devoir disposer de connaissances quasiment parfaites du domaine d’application et de sa terminologie. Notre approche vient atténuer cette réquisition en montrant une façon d’encoder les connaissances de domaines d’application de manière à ce que les systèmes de recherche documentaire puissent transformer la terminologie (relative aux domaines) des utilisateurs en celle des experts des domaines respectifs. Cette transformation effectuée, les systèmes peuvent consulter leurs bases de données pour trouver les informations recherchées. La faisabilité de notre approche est démontrée par l’étude de cas d’un système de recommandation d’émissions de télévision.

RÉSUMÉ.

KEYWORDS:

text mining, information retrieval, intelligent user interfaces, recommender systems

MOTS-CLÉS :

fouille textuelle, recherche documentaire, intelligentes interfaces aux utilisateurs

RSTI – RIA. Volume 24 – n◦ 1/2010, pages 95 à 118

96

RSTI – RIA. Volume 24 – n◦ 1/2010

1. Introduction State-of-the-Art information retrieval (IR) systems are intended to search large data bases of unstructured documents mostly written in natural language. Recent advances in IR technology allow users to enter free text queries to search the documents.

1.1. The User Perspective in Information Retrieval Complete automatic comprehension of natural language text is intractable by means of current technology. No IR system is able to understand the free text query nor the documents in its data base. Therefore, search algorithms are based on crude approximations of text comprehension: typically, a document is considered relevant to a query if the terms in the query and the terms in the document share some similarity. To the best of the author’s knowledge, despite a huge amount of literature on user models and user adaptation, there seems to be almost no research on the relationship of (the style of) the user’s query language and (the style of) the language in which the text documents forming the data basis for information retrieval are written. As typing in queries in natural language is the only access a user has to a search engine, one can draw the conclusion that in information retrieval from texts users are not taken into account: How can a user retrieve a document without knowing the terms in the document in order to enter a good query? For laypersons searching for precise technical information constructing a good query is hard to achieve. Therefore, user-friendly information retrieval is concerned with two orthogonal aspects: – search algorithms that perform well for a particular retrieval task – search criteria that are well suited to precisely describe a user’s query and – at the same time – allow to compute good features for the search algorithm to compute optimal (with respect to an evaluation function) results. Often, research is focusing on the algorithms, while criteria are considered to be easy to find. For example, in (Satzger et al., 2006) an algorithm is presented for a preference based search in data bases, but the discussion does not focus on the question how preferences for a particular user are entered, learned, or computed. In our approach, we build a user model in terms of a vocabulary that is typical for a layperson which is trying to perform an information retrieval task. Thereby, the content of any document will be reformulated implicitly a user-centered way: technical terms are “re-expressed” in the user’s vocabulary before IR is performed. In this way, the search task is much more easier for a layperson. The paper is structured as follows: In the next section we discuss the concept of a switch from a expert centered perspective on the available information to a user centered one. We illustrate the concept in a case study on a TV recommender system. The system is implemented in two version: one is available on the Internet using a standard web browser; the second version is built into a commercial TV set (see Fig.

Centering Information Retrieval to the User

97

1 for a screen shot). Next, we present a mathematical model of how the discussed switch can be made effective for retrieval or recommender systems relying on text mining approaches for data search. After that, we illustrate how the model presented up to now has been implemented in our TV recommender system. Finally, we present a user evaluation for the implemented system.

1.2. Information Retrieval from Text Corpora For instance, given the user task to layout a document with a picture in the upper right corner, and two columns of text floating around the picture and a footer and a header, the usual experience with any online help system is that the user has to know by himself how a particular text processing system could be used for building complex layouts. In particular, he has to know the terminology of the system developer in order to enter the correct keywords for the explanations he is looking for. The problem of unexperienced users not knowing the expert vocabulary is crucial for all language oriented user interfaces. If the user was to formulate queries in a “nonexpert” manner the user interface would have to perform complex reasoning tasks. In this way it could infer the steps necessary to solve the user’s problem. In fact, a user interface without these reasoning capabilities requires the user to have expert knowledge about the domain. An important part of the domain knowledge consists in the vocabulary and language used for domain relevant issues. The same observation is true for information retrieval systems. Domain experts express themselves differently from laymen. In concrete applications, there are many variations for the meaning of “user centered” and “expert centered”. They depend mainly on the domain knowledge required to solve tasks in the domain.

1.3. Analysis of the Design In order to set up a case study for an information retrieval system that allows the user to use his/her own vocabulary as far as possible, we analyzed the context of a TV recommender application, the requirements for its implementation, and the design of the information retrieval approach that helps to implement an easy-to-use recommender-system. For many of the programmes, TV broadcasting stations publish short summaries in the Electronic Programme Guide (EPG). TV users can receive them using teletext or particular EPG functions implemented in recent DVD recorders or satellite tuners in TV sets. To give an example, here is such a summary for a Finnish movie that was broadcast in German television last spring: Das Ehepaar Ilona und Lauri befindet sich in einer finanziell keineswegs rosigen Situation, als beide auch noch ihren Job verlieren. Ungeachtet der aussichtslos erscheinenden Lage bleiben sie beide optimistisch ...

98


The married couple Ilona and Lauri is by no means in a financially rosy situation, when both also lose their job. Regardless of this situation that appearingly offers no prospects both retain their optimism ... Reading this summary, one may note that several topics are addressed which are related to negative emotions: – by no means in a financially rosy situation: difficulty, uncertainty – lose their job: disadvantage, failure – offers no prospects: anxiety, scepticism The text also contains some more positive formulations, in particular retain their optimism. So, all in all the summary makes the promise that there will be some tensions in the plot of the movie. If the user is searching for recommendations for movies with action and chill, probably this one is a good choice. In order to verify the hypothesis that this way to select a programme from the hundreds of options available, we analyzed the design of an appropriate user interface along three dimensions: – Context of Use In a Wizard-of-Oz study we collected data on how users select a TV programme when they can interact in natural language. Test persons were not obliged to any restrictions such as certain TV channels. There were no limitations in terms of language for the interaction with the wizard. The test persons should feel comfortable in a living room and choose a programme to watch. – Requirements The studies revealed the fact that test persons always “navigated” between several options they liked best. They asked questions on these options to get detailed information for their final decision. Furthermore, the study revealed that test persons talked about some distinct classes of mood and topics in order to filter out nice programmes. It became clear quickly that titles, genres, and broadcasting times played a minor role. – Design The analysis of the study resulted in a design that allows users to enter key words for a content based-search and to set sliders for the most important moods as they are common also in TV guides.

1.4. Other Approaches for TV Recommenders TV recommenders have already been built without analyzing textual descriptions about programmes. In (Blanco Fernández et al., 2005; Blanco-Fernandez et al., 2004), ontologies are applied to structure meta-information about programme. For a particular recommendation, it solves logical queries about the meta-information. However, the kind of meta-information used here tells much less about the programmes content than a natural language summary. In (Pigeau et al., 2003), fuzzy reasoning is


99

Figure 1. A screen shot of the user interface implemented on a standard TV set. The readers can see sliders for the four moods “fun”, “action”, “erotic”, and “chill.” In the last line of the GUI dialog, the user may enter keywords that characterize the topics he is interested in

applied again on meta data. Approaches incorporating reasoning about user preferences and collaborative filtering of such preferences also apply meta data. Therefore, the contribution of our approach is the analysis of natural language summaries of TV programmes. The features extracted in this step can be used directly for classification or for getting additional meta data.

2. Case Study: TV Recommender Systems and the Real User Via satellite, the Electronic Programme Guide (EPG) provides an enormous amount of information about TV programmes with natural language descriptions about the content of programmes. Viewers are overwhelmed by the huge number of channels and programmes when they select a programme to watch. For the design and implementation of TV recommendation systems sophisticated user models such as (Ardissono et al., 2004) are used. In order to allow for default reasoning, stereotypes for users are applied which are based on the analysis of the average user’s lifestyle (see (Gena, 2001)). Much attention is paid on the issue of designing an attractive, functional, and easy-to-use graphical interface between users and the recommendation system (see (van Barneveld et al., 2004)). In order to increase the user’s confidence in the system proposals, the generation of trust-worthy suggestions that take programmes watched earlier into account has been studied in detail (see (Buczak et al., 2002)). For the implementation of the search, different approaches and theories of reasoning have been applied: (statistical) classifiers such as neural nets (Zhang et al., 2002), fuzzy logic (Yager, 2003), similarity based reasoning (McSherry, 2002) – to name the most prominent ones. We use a text mining based

100


approach as the necessary computations can be carried out quickly. Performance is an important factor as the system runs in real-time on an embedded platform used in a standard commercial TV set (see Fig. 1 for an example of how the GUI looks like). In a user study (Nitschke et al., 2003) conducted as part of the research project EMBASSI (see (Herfet et al., 2001)), candidates were situated in front of a computer display that suggested an automatic recommendation system to be at work. The users were allowed to ask arbitrary questions about available TV programmes. A Wizardof-Oz provided responses. The experiments showed that users express emotional attitudes they desire the programme to have, or even their own emotions hoping the system would come up with proposals that match their mood: Liebe, Romantik (Love, romance); Entspannen (relax); Show, Witz (Show, fun). In the domain of recommending TV programmes domain knowledge consists of know-how about the user’s preferences, his current interests, the programmes currently available and an algorithm that computes matches between the programmes and the user’s interests and preferences. In the application of our case study, all knowledge is represented in terms of natural language descriptions. There is no classification system at hand which the system could use to restrict the search on formally given filter criteria. On the other hand, the user’s interest is represented in natural language as well or in the setting of the four sliders that are used to characterize the mood of programmes the user is interested in.

3. Standard Text Mining In the literature (see for example the excellent text by (Manning et al., 1999)), text mining is accepted as the standard approach to information retrieval from unstructured text. The basic idea is taken from pattern recognition: two document with similar content use similar words. So, retrieving documents that are relevant for a given query contain pattern of key words similar to the pattern defined by the query. Any syntactic structure or discourse context is ignored. Instead, a document is viewed as a bag of words (a set in the language of mathematicians) in which the order of the individual words does not matter. To compare two documents, each is transformed in a vector. The dimension of the vector is as high as the number of different words that occur in any document. Therefore, the vector’s dimension is the same for each document. On the basis of this formal representation for documents, we view each document as the result of a random process that generates the bag of words (this stochastic model substitutes syntactic and semantic structure as well as the discourse context of the document). Therefore, there is a certain probability for each word to be used in the document. This probability is usually estimated by the relative frequency of the word, i.e. by the number of times the word occurs in the document compared to the number of all words in the document. This magnitude is called term frequency of the word w: tf(w) =

count(w) N

(N is the number of all different words.)


101

α

Figure 2. The relative position of two vectors can be compared by computing the angle α between them With this definition, we can define the value for each dimension i in a document vector: it is the term frequency of the ith word in the corpus. All words not occurring in the current document, have value 0. Formally, if we have 1 ≤ j ≤ D documents and 1 ≤ k ≤ W different words in all documents, the vector for document j is: fj = (tfj (w1 ) tfj (w2 ) ... tfj (wW ))T As a vector is a geometrical object, the relative position of two vectors can be compared geometrically as shown in the two-dimensional example in Fig. 2. As two vectors are more similar the smaller the angle between them is, two documents are the more similar the more similar their term frequency vectors are. As relative term frequencies are probabilities for the occurrence of a word (of the whole corpus) in a document, the geometric notion of similarity also has a probabilistic interpretation: two document are the more similar the more identical the probabilities of each word are to occur in one of the documents. Two documents are identical if ∀i : tfj (wi ) = tfk (wi ). Experience with this model shows that it is not sufficiently sensitive to interesting and exceptional cases: tf(w) is particularly high if w is a function word that occurs frequently in a text. Examples are the, and, is and many others. So we have to modify the value for each token in a document. We must achieve to weigh tokens according to how special they are: The aim is to highlight tokens that occur often in a few documents only. This means the weight we are looking for must be inverse to the proportion of documents containing w and all documents. Assume we have Nw documents containing w and M documents in total. With these values, we have P (w contains in a document) =

Nw M

One could suppose now, that a good weight for the term frequency of w is idfh =

1 P (w contains in a document)

as this formula results in a high weight if w occurs in a few documents only and in a small weight otherwise. Unfortunately, this function emphasizes rare documents too

102


much. As you can see from Fig. 3(a), if w occurs in just one document, w’s value is weighted with M (10.000 is not a big number for M ). On the other hand, if w occurs in 10 per cent of all documents, the weight is only 10. So we have to smooth idfh . This is usually achieved by taking the logarithm: idf(w) = log idfh (w) = log

M Nw

Fig. 3(b) shows that proportion between words occurring very rarely and others occurring more often is not as dramatic as it was for idfh . Finally, the value of a term is computed by calculating tf(w) × idf(w). This mathematical model of standard text mining has a severe implication for information retrieval: if one of the two documents is considered to be the user’s query (as it is usually done in text mining), the approach requires user centered knowledge and expert centered knowledge to be almost identical. However, in our case study, this approach does not work: 1) If the rankings are used as input, user queries and descriptions have different dimension and the value of a dimension in the query vector has a different meaning than that in the description vector. 2) If natural language is used as input, the vocabulary of user queries is so different from that in the descriptions that matches will be found rarely if at all. This insight requires a modified approach to text mining if one wants to allow the user’s language to differ significantly from the expert language (used for writing the documents from which information shall be retrieved).

(a) idfh (w)

(b) idf(w)

Figure 3. Comparison between two different formulae for computing the documentfrequency weight of a word in a corpus


103

finden (to find): 11.18 20.7 Kai (proper name): 1.17 8.14 15.1 offenbar (obviously): 5.6 7.1 12.3 12.33 offenbaren (to make obvious): 12.3 12.5 12.48 14.22 opfer (victim): 2.41 5.42 9.76 10.13 15.14 15.53 15.58 20.14 21.27 22.12 22.15 tot (dead): 1.22 2.22 2.40 3.4 7.32 9.19 9.39 10.25 wald (forrest): 2.1 According to this assignment of Dornseiff groups to the words in the document, he matrix M for the first two words consists of the following elements: 1.17 8.14 11.18

finden (to find) m0,0 m1,0 m2,0

Kai m0,1 m1,1 m2,1

15.1 20.7

finden (to find) m3,0 m4,0

Kai m3,1 m4,1

Figure 4. Example for the mapping of words to topics

4. User Centered versus Expert Centered Information How to distinguish “user centered” and “expert centered” knowledge? For TV programmes, it is common to valuate programmes along several dimensions: User centered knowledge can be represented as a four dimensional ranking vector v = ranking for action ranking for chill ranking for fun ranking for erotic Expert knowledge entailed in the programme descriptions is hard to analyze completely by automatic means as the natural language understanding problem is not solved in general. However, text mining algorithms are a well-known technique for such tasks. In order to answer the question whether a text mining approach could be successful we tried to cluster German TV programme descriptions using a large sample of descriptions from the EPG data stream and the WEKA (see (Witten et al., 2005)) implementation of the k-means clustering algorithm. We used 2,270 summaries containing 14,575 different terms. On the average, one summary had 7.06 terms. The experiments showed that indeed the clustering could separate programmes that were also considered to be different by a human labeler: Cluster 1 2 3

Topics news, entertainment, movies regional news magazines

104


Consequently, this study leads to the hypothesis that the descriptions could be used as input for a text mining algorithm. So they would constitute the “expert centered” knowledge about the domain.

4.1. How to Switch the Perspective Therefore, the question is how to come up with some sort of inference that gives good results in terms of precision and does not require computational resources beyond the limits of the embedded platform. If one analyses the sample corpus of user queries it becomes obvious that users talk about topics and moods, not about concrete content. They ask for “something chilling and frightening” and not for a “movie in which a father kills his family one after another during the summer holiday”. This observation leads to the conclusion that in our case study inference is finding the topics (and emotions) a movie is about and finding the topics (and emotions) the user talks about. After this generalization of concrete content which results in a list of topics, matches could eventually be computed comparing feature vectors on this higher level of abstraction.

4.2. Case Study: Is Switching Practicable? In order to validate this last hypothesis, we repeated the study whether programmes with different content could be separated by k-means clustering after assigning a list of topics to each description in the sample corpus (see Sect. “Finding Topics in Words”). The clusters were even better than in the first study: Cluster 1 2 3

Topics travel, entertainment documentation, information movies

Cluster 4 5

Topics infotainment magazines

How to incorporate this approach in the mathematical model of text mining? The mathematician’s view is that of a mapping from one vector space (word frequencies in documents) to another one (topic frequencies in a text). Assuming a linear relationship, a matrix can be computed for this mapping.

4.3. Performing the Switch: Generalizing Text Mining by Abstraction What should such a matrix look like? With n different words and m different topics, we would like to have a matrix M of dimension m × n with: T

T

(t1 ... ti ... tm ) = M · (w1 ... wi ... wn )


105

For an example of how the matrix M can look like, see Fig. 4. There are n different German base forms occurring in the description of a movie and m different topics encoded in the form x.y (The explanation for these codes will follow in Section “Finding Topics in Words” below). Each of these codes constitutes one of the m rows of M . What is the benefit of M ? If one multiplies M with a TF/IDF vector for the n different base forms, the result is a m-dimensional vector. Each dimension stands for one topic, and the value for each dimension is a weighted sum of the TF/IDF values. The weights are stored in the matrix M and relate words and topic on a quantitative basis: If we simply want to count the frequency of topics, after we have counted the frequency of words, M contains as elements: mi,j = 1 iff word j is in topic i; mi,j = 0 otherwise For the example in Fig. 4, the computation would be the following:     0 1 tf-idf(1.17)  tf-idf(8.14)   0 1       tf-idf(11.18)  =  1 0  · tf-idf(f inden)     tf-idf(kai)  tf-idf(15.1)   0 1  1 0 tf-idf(20.7) Now we are ready to compute the similarity s of two vectors v and w:      v   w  v1 w1 t1 t1 s = M ·  ...  · M ·  ...  =  ...  ·  ...  = tv · tw vn wn tvm tw m tv and tw are the generalized feature vectors (topic vectors) for the both texts under consideration. This is the standard cosine similarity measure used in many classification tasks. The matrix M encodes the switch from user to expert knowledge: the entries of M relate words and topics statistically on the basis of a lexicon for German. For other applications, M could be changed and its entries can even be determined using machine learning techniques.

4.4. Iteration of the Abstraction Process From the mathematical point of view tv and tw are again vectors and can be interpreted as feature vectors in an m dimensional vector space. Of course, one may apply another linear transformation to such a feature vector and repeat the abstraction process, i.e. generalize once again. What could this be good for? Remember the case study above: in our recommender system, users can also change the values of four sliders for action, chill, fun, and erotic. Formally, this results in a ranking vector T r = raction rchill rfun rerotic which has dimension 4. One can interpret each of these four dimensions as a cluster of topics. Consequently, we can compute a feature vector in this space of moods from

106


a feature vector in the space of topics by applying another matrix N of dimension 4 × m. An element of the matrix is: ni,j = rj iff topic j addresses mood i; ni,j = 0 otherwise So, on this level of abstraction, the similarity s of two texts is: s = N · tv · N · tw = mv · mw As you can see, the same computations have to be carried out as for mapping from words to topics. Again, N can be constructed using any knowledge acquisition technique that computes a quantitative correlation between the dimensions of the input vector and those of the output vector.

5. The Implemented Recommender System This section describes the system that has been implemented during our case study on generalized text mining. It also explains details on the knowledge sources used for the construction of the transformation matrices M and N described above.

5.1. Finding Topics in Words The linguistic knowledge we exploit for finding the list of topics a German word is associated to is based on the D ORNSEIFF lexicon for German. Like a thesaurus, it groups words according to certain topics, i.e. in each group there are words (even of different word categories) that describe a particular aspect of a certain topic. The D ORNSEIFF lexicon (cf. http://wortschatz.uni-leipzig.de). is not a synonym lexicon, but a “topic” lexicon. Structured in a two-level hierarchy, the lexicon organizes topics in chapters (e.g. chapter 15 contains subtopics social life) and sub chapters (e.g. sub chapter 15.39 is the topic reward). If the meaning of a word is ambiguous, it is listed in more than one sub chapter. For the German words finden (to find) and Kai (name of a boy or quay), the lexicon lists the following topics (cf. Fig. 4): group id 1.17 8.14 11.18

description Ufer (the coast) Schiff (the ship) Entdeckung (the detection)

group id 15.1 20.7

description Vorname (first name) Erwerb (the aquisition)

For the construction of the matrix M this means: m1.17,Kai = 1 m20.7,finden = 1

m8.14,Kai = 1

m11.18,finden = 1

m15.1,Kai = 1

Fig. 5 illustrates how topics are used for finding good matches. In the example, the natural language user input can be generalized to a list of topics – in this example,


107

Topic: 14.18 User query contains:

e ok

s

ic top

ev

show

phrases in this topic: show light music vinyl pop hit ...

Docu− ments

rms

search te

Figure 5. Relation between words and topics in a query and in a free text description there is only one topic: 14.18. To this topic a list of phrases is assigned. Each of them serves as a search term for which we compute the TF/IDF-value in each available document. As explained in Sect. 4.3, the TF/IDF-value for topic 14.18 is computed as a weighted sum of the TF/IDF-values for each search term assigned to 14.18.

5.2. Computing the Distance between Topics For the clarity of the presentation, one point has not been mentioned so far. In the implemented recommender system, the matrix M actually is not binary. The values for mi,j are computed applying another heuristic: in the field of word sense disambiguation, several metrics (see (Pedersen et al., 2005)) are known to measure the semantic distance between two words. Some of these metrics use the WordNet dictionary (see Fig. 6 for an example entry) for the distance dist(s, t) between the words s and t. To compute the distance between two words, one can count the number of hyperonym/hyponym steps necessary to move from s to t in the WordNet hierarchy. Fig. (6) shows the first steps starting with the word love. The steps are indicated by the links direct hyperonym and direct hyponym respectively. We applied such a metric to compute the distance between each pair of topics in the Dornseiff lexicon. This results in a quadratic matrix S of dimension m × m: (si,j )1≤i,j≤m = dist(i, j) where disti,j is computed as the distance between the topic names s and t for the pair of Dornseiff topics (s, t). The matrix S allows to rank those topics high that occur in the same text and close to each other in terms of their semantic distance in WordNet. So, our recommender system uses a modified version of the matrix M ˜ = S · M. which is calculated as M

5.3. Detecting Emotions in Texts In order to allow the emotional aspect as a recommendation criterion, we use the valence/arousal model by W HISSEL (described in (Cowie et al., 2001), pp. 39),

108


Figure 6. A W ORD N ET entry

Figure 7. Some emotion words with V/A coordinates

which assigns a fixed position in a two dimensional coordinate system to each emotion represented by an adjective. Valence indicates emotion quality, hence whether it is considered a positive or negative emotion. Arousal stands for the activation level an emotion possesses “i.e. the strength of the person’s disposition to take some action rather than none” ((Cowie et al., 2001), p. 39). The table in Fig. 7 shows some English adjectives along with their German translation and their position in the two dimensional valence/arousal model (see columns activ and eval). Fig. 8 illustrates the results of W HISSEL’s approach graphically: all emotions are plotted in a two dimensional space. It also illustrates that a list of topics is associated to each emotion identified by W HISSEL. This relation between topics and emotions is computed in the following way: a topic belongs to an emotion if in the Dornseiff lexicon, a word of the word family for the translation of an English emotional adjective is in this topic. Furthermore, the valence/arousal model allows to localize the moods fun, sadness, chill, or action (see the four sliders in Fig. 1) in certain regions of the valence/arousal space. This (indirect) relationship between moods and vocabulary leads to the construction of the elements ni,j of matrix N and can be exploited for finding recommendations: if the user wants to see something chilling (rchill is high) then the topics assigned to the emotions in the region for chill are good features to search for (as they


109

Figure 8. From emotions to topics: The emotion selected in the valence-arousal “space” is characterized by the topics 9.36: enthusiasm, eagerness; 9.40: carefulness, accuracy; 10.39: caution; 11.3: deliberation, thought, consideration; 20.12: canniness, parsimony; 9.4: readiness; 10.51: good will, courtesy, benevolence

deliver a high contribution in the computation of the scalar product if the appropriate values wi are high).

5.4. Computing Recommendations In order to compute proposals matching a user query, the recommender system computes the similarity value ˜ · q) · (N · M ˜ · k) sq,k = (N · M for a given query q and each programme description k which is broadcast in the time interval selected by the user. The description are ranked in descending order of sq,k as high values for the scalar product means that q and k are close together in the vector space, and a small angle between them indicates a good match. Fig. 9 presents an overview of our approach and sketches the processing pipeline of the implemented recommender system.

6. Evaluation Whenever different users rated the same set of recommendations for a given query, there was a high variance among their judgments. Hence, due to the low inter-labelercorrespondence, a typical evaluation of the system in terms of precision and recall rates turned out to be hard to achieve. Instead, we focused our first evaluation on general user satisfaction and acceptance ratings: In a public presentation of the demonstrator system people of different sex, age, education and interest tested the system: They used the NL interface by typing

110


User’s slider settings Determine which emotions evoke a certain mood (i.e. slider setting with the user Interesting points in the valence/ arousal space

Relevant Dornseiff groups for the interesting valence/ arousal points

Determine how emotions (i.e. points in the valence/ arousal space) are verbalized

Determine the vocabulary that is characteristic for the appealed emotions

Typical vocabulary for the relevant Dornseiff groups

Text mining based on the vector space model

Construct a query by using the vocabulary in the relevant Dornseiff groups as search terms

Retrieve recommendations (i.e. documents matching the query)

Figure 9. The switch from the user perspective to the expert’s view is performed in a sequence of processing steps


answer share answer very useful 36% useless useful 26% very useless fair 20% no idea (a) Answers to Question 1

share 7% 1% 10%

answer share answer very good 20% bad good 28% very bad fair 22% no idea (b) Answers to Question 2

111

share 10% 5% 15%

answer share answer share yes 84% do not know 7% no 2% no idea 7% (c) Answers to Question 3

Figure 10. Results of the user evaluation

in queries or playing with the sliders. People tested the system for about 15 minutes and entered around five queries in the average. The system responded by presenting a list of the programmes on air at the time of the query sorted according to how well each programme matched the query. Then the users filled in a questionnaire. 60 questionnaires have been evaluated. The first question was: How useful do you consider the text input-and-search function of the system? It was answered as shown in Fig. 10(a). Test persons used the standard remote control of the TV set with one special key which invokes the GUI dialog in Fig. 1. In this way, they had nothing new to learn for the new feature of the TV set and could concentrate completely on the recommendations. The second question was: How appropriate do you consider the proposals you got from the system? The answers in this case are shown in Fig. 10(b). For deciding about the appropriateness of proposals, the candidates could read the description along with a graphical explanation (see Fig. 11 for an example) of what the system had computed. The explanation shows all base forms in the EPG summary of a programme. It displays the words used for the recommendation in green, links them to the associated topics – for the user’s query (left side) as well as for the summary (right side). The thickness of the lines between the topics on the left and on the right indicate the (relative) importance of each topic. The explanation and the full text supported the users in valuating the quality of the system’s suggestions. The last question Would you recommend the system to a friend? Answers in Fig. 10(c) demonstrate that users were content with the system. In order to compare the discriminative power of the Dornseiff groups as a vehicle to reduce the dimensionality of the feature vector for the text mining algorithm on a semantic and pragmatic basis, we analyzed a representative corpus with the Latent Dirichlet Allocation (LDA) algorithm (see (Blei et al., 2003) for mathematical details). This machine learning approach views topics as random variables and the selection of a word in a document as the result of a random process which is conditioned on the topic. As a document may address more than one topic, it is modeled formally as mixture of topics whose number in a document depends on drawing a random number from a probability distribution. Thus, LDA topics can be interpreted as a solution for

112


the switch of perspective that is different to the Dornseiff groups and can therefore be used as a baseline for measuring the quality of the Dornseiff groups. For the comparison, for each topic computed by LDA we built histograms for the distribution of the Dornseiff groups in a topic. The bars in Fig. 12 result from applying the following formula: X count(d, t) = count(w|d is a Dornseiff group of w) w∈t

So, for each word w in topic t we count how often it occurs in the corpus and add this number to each Dornseiff group d to which w belongs. To make the histogram more intuitive, we exploit the hierarchical structure of the Dornseiff groups and add counts for sub groups of each (super) group. For example, the peak around group 400 is produced by the sub groups 10.20 (hilarity), 10.21 (enjoyment, laughing), 10.22 (joking), and 10.23 (ridiculous). They all belong to the super group 10 (feelings, affects). The sums are shown by the green line in the chart. The black lines show the expected value for the frequency of each super group according to the probability distribution computed by LDA. Intuitively, the green line is most of the time higher than the black line. This demonstrates that the Dornseiff groups are more sensitive features than the LDA topics. We are currently verifying this hypothesis with statistical tests on different levels of significance. Beyond the technical analysis we are currently setting up an online evaluation framework for our recommender system in order to perform long term user studies. This framework will enable us to compare recommendations based on several algorithms: at the moment, we are comparing our approach to a naive Bayes classifier that uses standard TF/IDF vectors. Finally, the framework allows to compare our recommender with existing tools available e.g. in the Internet.

7. IntelliZap – A First Application of the Approach The IntelliZap remote control for programme recommendations is based on the approach described in the last section. On the basis of our evaluation, we extended

18.29 velocity

civilisation 15.48 fight

war 15.51

Figure 11. Explanation of a recommendation (screen shot of the prototype system)


113

Figure 12. Comparison of the indicative power of LDA groups and Dornseiff groups for a certain topic (here: comedy)

the system to allow for more interactivity. The user still can set slider values and interesting topics and choose among the recommended programmes. Beyond that, IntelliZap allows the user to zap to another channel according to the analysis of the summaries and the slider settings instead of simply switching to the channel stored in the next position in the TV set’s memory.

7.1. Modes of Usage When the user asks for a recommendation, the system ranks all available programmes according to their distance to the user settings (as we will explain later, we experimented with three different distance measures.). This ordering serves as a short time ordering for zapping and is stored until the user asks for a new recommendation. IntelliZap provides four modes of usage that define how zapping is performed given a fixed short time ordering. – similar: In this mode, when the user presses the button next programme, the system proposes the next programme in the short time ordering that has not been recommended yet. This means that recommendations are increasingly different to the original recommendation until a threshold is reached. In this case, the user is alerted that no further recommendations can be made. – dissimilar: traverses the fixed short time ordering in the opposite direction as similar: it starts at the end of the list. With this function the user can select a programme with different emotional attributes compared to the original recommendation.

114


– random: ignores the ordering completely and selects an arbitrary programme. This mode serves as a baseline to show that the other modes on the average compute better suggestions than a mode that randomly draws recommendations from a uniformly distributed set of programmes. – newest: helps the user in updating the short time ordering. When the user invokes this function, the system reorders the ranked programmes according to their distance to the current time. The result is that programmes that started some time ago or will start in the far future get a much lower priority than those that started just a few minutes ago or will start in a few minutes. The new zapping function described above uses exactly the same algorithm that was explained in section 5. Its main characteristic is that it compares single programmes.

7.2. Clustering Broadcasting Stations However, often it may be of interest to do the transition from programmes to broadcasting stations. Knowing which stations offer similar content helps for grouping channels in the TV set’s memory: stations with similar content can be stored in positions not far from each other while stations with different content may have locations that are far apart. 7.2.1. Applications Such a feature can be particularly helpful when setting up the TV set for the first time. Normally, new TV sets scan all frequency bands automatically and store channels in the same order in which they found them. This is very unintuitive as it does not take user behavior into account. Bootstrapping the TV set with information about which channels are similar from the average user’s point of view would help to decrease the time needed for configuration. Clustering programmes and broadcasting stations is also useful for learning user preferences from implicit and explicit feedback to the recommendations. 7.2.2. Implementation and Results EPG data is updated continuously. For the comparison of results a fixed sample is necessary. In order to construct one, we collected enough data to cover two weeks and drew 1,000 programmes randomly. In the EPG data, many programmes lack a summary or only have a very short one, in particular regular programmes such as news or daily soaps. To avoid noise from missing summaries, all programmes with a summary shorter than a threshold of 255 characters were eliminated from the sample. 700 programmes whose summaries had between 255 and 1024 characters remained. We applied several different cluster algorithms out of the family of k-means (with k between 6 and 24 as indicated in the table) algorithms and also tested three different distance measures: scalar product, weighted Euclidian distance, and unweighted Euclidian distance. The results of the experiments can be discussed in two ways: first,


115

Scalar Product Algorithm K-means Farthest first Kmeans++ Outlier km

var(6) 0.9249 0.9191 0.9252 0.9252

t(6) 0 : 16, 7 0 : 21, 8 0 : 20, 5 0 : 16, 5

var(12) 0, 7874 0, 7663 0, 7847 0, 7830

t(12) 0 : 31, 4 0 : 53, 5 0 : 40, 3 0 : 31, 6

var(24) 0.5924 0.5832 0.5834 0.5358

t(24) 0 : 56, 9 2 : 30, 4 1 : 16, 5 1 : 06, 5

Unweighted Euclidean Distance K-means Farthest first Kmeans++ Outlier km

0.9530 0.8984 0.9352 0.8965

0 : 14, 2 0 : 16, 5 0 : 18, 0 0 : 15, 6

0, 8495 0, 7438 0, 8224 0, 6706

0 : 31, 2 0 : 51, 8 0 : 37, 6 0 : 37, 1

0.6864 0.6914 0.6092 0.4313

1 : 04, 0 2 : 42, 3 1 : 20, 2 1 : 03, 9

Weighted Euclidean Distance K-means Farthest first Kmeans++ Outlier km

0.9029 0.9129 0.8912 0.8984

0 : 26, 6 0 : 35, 9 0 : 34, 7 0 : 27, 3

0, 7542 0, 7248 0, 7073 0, 7333

0 : 52, 9 1 : 42, 4 1 : 13, 5 0 : 57, 7

0.5517 0.5611 0.4477 0.4886

1 : 23, 4 4 : 32, 9 2 : 04, 4 1 : 47, 3

Figure 13. Variance within clusters applying different distances (var(k) is the variance for k ∈ {6, 12, 24} clusters; t(k) is the time in seconds needed for clustering)

by looking at statistical magnitudes such as variance and, second, by looking into the clusters and comparing the clusters with expert knowledge in the TV domain. The technical analysis is shown in Figure 13. The numbers indicate no general tendency for a certain cluster algorithm. However, it is obvious that the weighted Euclidean distance is the best of all three distance measures for minimizing the variance – at the cost of increased processing time. Expert knowledge for this clustering task is hard to find on a objective basis as the decision for claiming two broadcasting stations to be close to each other involves a lot of personal preferences which vary to a great extent between “experts” (i.e. people watching TV regularly). However, applying expert knowledge is necessary as variance may be low, but the clusters do not have any meaning at all in terms of the domain. Therefore, we built a “jury” from people of the research group that judged the clusters according to their knowledge of the broadcasting stations. The main observation was that clusters built with the weighted Euclidean distance were the most homogeneous ones. Often, clusters were filled with programmes and their repetitions on other channels or at other time – a hint towards the precision of the implemented algorithm. To control the clustering computed on the basis of topics and emotions, we clustered in parallel according to the metadata available for each programme: broadcasting stations assign genre codes to programmes following an international standard (see (EN 300 468, 1997)). In almost all cases, one code is assigned to one programme. Therefore, the expectation was that clustering according to genre codes has to struggle with less noise than clustering according to topics and emotions.

116


Clustering according to topics and emotions proved to be very stable in separating thematic channels (for sports, for children etc.) from the general ones that broadcast a wide range of programmes. All major German broadcasting stations were put in this cluster in mostly all experiments. The channels that changed their cluster most frequently were those with short summaries. Clustering according to meta data found no clusters that were contradictory to the previous findings. However, the approach computed fewer empty clusters. Consequently, the populated clusters were more fine grained: one cluster contained channels that most of the time broadcast soaps which normally are assigned a particular genre code.

8. Conclusions 8.1. Theoretical Approach We presented a mathematical formulation of a generalized approach to text mining. “Generalized” refers to the notion that not only words (and derived quantitative measures such as TF/IDF) serve as features, but that semantic abstractions can be made that allow to cluster the vocabulary into several categories that encode domain relevant knowledge. We argued that this approach is suited to build recommender systems that analyze natural language documents as their data basis for computing proposals. From the user’s point of view the main advantage of our approach is that the described process of abstraction can be used to switch from expert centered vocabulary to user centered vocabulary. This enables the user to get good results for his queries even without knowing the expert terminology of an application domain.

8.2. A Working Prototype In a case study of a TV recommender system, we show that our approach is practicable and accepted by the users who took part in an evaluation of our system. From the computational point of view, the implemented approach is fast enough to perform in real-time even on embedded systems with low memory and CPU performance.

8.3. Further Work: User Adaptivity In a real application, the approach presented here depends heavily on the quality of the matrices described above. Individual values in the matrix N cells tell the recommender how important each emotion is for each of the moods action, chill, fun, and erotic. A value in matrix M tells which topics are relevant for an emotion. Obviously, these values may vary from user to user. Therefore, to be user-adaptive the system must learn the values in the matrix cells from user feedback. Therefore,


117

we are currently extending the user interface. It allows the user to rate a chosen programme. If the feedback is positive, the association between words, topics, and emotions is reinforced. In this way, the matrices M and N are adapted incrementally to the user’s personal preferences. In our evaluation framework, we are currently collecting data and experimenting with interactive clustering and feed back algorithms to understand in more detail how user opinions influence the parameters our system has to be trained on in order to be user adaptive.

9. References Ardissono L., Gena C., Torasso P., Bellifemmine F., Difino A., Negro B., “ User Modelling and Recommendation Techniques for Personalized Electronic Program Guides”, in L. Ardissono, A. Kobsa, M. T. Maybury (eds), Personalized Digital Television – Targeting Programs to Individual Viewers, vol. 6 of Human-Computer Interaction Series, Springer, chapter 1, p. 3-26, 2004. Blanco-Fernandez Y., Pazos-Arias J. J., Gil-Solla A., Ramos-Cabrer M., Barragans-Martinez B., Lopez-Nores M., “ A Multi-Agent Open Architecture for a TV Recommender System: A Case Study Using a Bayesian Strategy”, ISMSE ’04: Proceedings of the IEEE Sixth International Symposium on Multimedia Software Engineering, IEEE Computer Society, Washington, DC, USA, p. 178-185, 2004. Blanco Fernández Y., Pazos Arias J. J., Gil Solla A., Ramos Cabrer M., López Nores M., Barragáns Martínez A. B., “ AVATAR: A Multi-Agent TV Recommender System Using MHP Applications”, IEEE International Conference on E-Technology, E-Commerce and E-Service (EEE), IEEE Computer Society Press, p. 660-665, mar, 2005. Blei D. M., Ng A. Y., Jordan M. I., “ Latent Dirichlet Allocation”, Journal of Machine Learning Research, vol. 3, p. 993-1022, 2003. Buczak A. L., Zimmerman J., Kurapati K., “ Personalization: Improving Ease-of-Use, Trust, and Accuracy of a TV Show Recommender”, Proceedings of the TV’02 workshop on Personalization in TV, Malaga (Spain), 2002. Cowie R., Douglas-Cowie E., Tsapatsoulis N., Votsis G., Kollias S., Fellenz W., Taylor J., “ Emotion Recognition in Human-Computer Interaction”, IEEE Signal Processing Magazine, vol. 18, p. 32 - 80, January, 2001. Gena C., “ Designing TV Viewer Stereotypes for an Electronic Program Guide”, Proc. UM2001 Workshop on Personalization in Future TV (TV01), Sonthofen, July, 2001. EN 300 468, Digital Video Broadcasting (DVB); Specification for Service Information (SI) in DVB systems, Technical report, European Telecommunications Standards Institute, January, 1997. Herfet T., Kirste T., Schnaider M., “ EMBASSI – Multimodal Assistance for Infotainment and Service Infrastructures”, Computers and Graphics, vol. 25, n˚ 4, p. 581-592, August, 2001. Manning C. D., Schuetze H., Foundations of Statistical Natural Language Processing, MIT Press, 1999. McSherry D., “ A Generalised Approach to Similarity-Based Retrieval in Recommender Systems”, Artif. Intell. Rev., vol. 18, n˚ 3-4, p. 309-341, 2002.

118


Nitschke J., Hellenschmidt M., “ Design and Evaluation of Adaptive Assistance for the Selection of Movies”, Proceedings of IMC 2003 "Assistance, Mobility, Applications", Rostock, June, 2003. Pedersen T., Patwardhan S., Michelizzi J., “ Wordnet::Similarity - measuring the relatedness of concepts”, Proceedings of the Nineteenth National Conference on Artificial Intelligence, p. 1024-1025, 2005. Pigeau A., Raschia G., Gelgon M., Mouaddib N., R., Saint-Paul, “ A fuzzy Linguistic Summarization Technique for TV Recommender Systems”, The 12th IEEE International Conference on Fuzzy Systems, p. 743-748, 2003. Satzger B., Endres M., Kießling W., “ A Preference-Based Recommender System”, 7th International Conference on Electronic Commerce and Web Technologies (EC-Web ’06/Dexa 2006), Lecture Notes in Computer Science, Cracow (Poland), September, 2006. ISSN: 0302-9743. van Barneveld J., van Setten M., “ Designing Usable Interfaces for TV Recommender Systems”, in L. Ardissono, A. Kobsa, M. T. Maybury (eds), Personalized Digital Television – Targeting Programs to Individual Viewers, vol. 6 of Human-Computer Interaction Series, Springer, chapter 1, 2004. Witten I. H., Frank E., Data Mining: Practical machine learning tools and techniques, 2 edn, Morgan Kaufmann, San Francisco, 2005. Yager R. R., “ Fuzzy logic methods in recommender systems”, Fuzzy Sets Syst., vol. 136, n˚ 2, p. 133-149, 2003. Zhang T., Iyengar V. S., “ Recommender Systems Using Linear Classifiers”, Journal of Machine Learning Research, vol. 2, p. 313-334, 2002.