Towards a Customization of Rating Scales in ... - Semantic Scholar

3 downloads 171 Views 46KB Size Report
Towards a Customization of Rating Scales in Adaptive Systems. Federica Cena, Fabiana Vernero, and Cristina Gena. Dipartimento di Informatica, Universit`a di ...
Towards a Customization of Rating Scales in Adaptive Systems Federica Cena, Fabiana Vernero, and Cristina Gena Dipartimento di Informatica, Universit`a di Torino Corso Svizzera 185; 10149 Torino, Italy cena,vernerof,[email protected]

Abstract. In web-based adaptive systems, the same rating scales are usually provided to all users for expressing their preferences with respect to various items. It emerged from a user experiment that we recently carried out that different users show different preferences with respect to the rating scales to use in the interface of adaptive systems, given the particular topic they are evaluating. Starting from this finding, we propose to allow users to choose the kind of rating scale they prefer. This approach raises various issues; the most important is that of how an adaptation algorithm can properly deal with values coming from heterogeneous rating scales. We conducted an experiment to investigate how users rate the same object on different rating scales. On the basis of our interpretation of these results, as an example of one possible solution approach, we propose a three-phase normalization process for mapping preferences expressed with different rating scales onto a unique system representation.

1 Introduction In modern adaptive web-based systems, users are active: They can constantly interact with the system (e.g. expressing preferences regarding things like privacy configuration settings, friends, and interests). This active user participation applies also in the personalization setting: In an increasing number of systems, users are allowed to inspect and modify their user model or to express some kind of preference towards the presented items. For expressing their preferences, users are provided with some kind of rating scale, which is usually the same for all users. However, we found in an experiment on preference evolution, conducted in collaboration with the Prevolution research unit1 , that users hold - and maintain over time - different opinions about the best rating scale for performing either a certain task or a certain kind of tasks2 . No single rating scale can therefore be thought to satisfy the specific needs and preferences of all users. Given such heterogeneity in user preferences, offering all users the same rating scale could be a risky option: Some users might find their experience with the system unsatisfactory and, in the extreme case, decide not to interact with it at all. We therefore argue that adaptive 1

2

The research described was conducted in collaboration with activities carried out by the targeted research unit Prevolution at FBK-irst, funded by the Autonomous Province of Trento, Italy. We thank Anthony Jameson and Silvia Gabrielli, who worked with us on the design of the research project to which this study belongs. For the specific results of this study, see http://www.di.unito.it/∼vernerof/experiment09.html.

systems should either offer customizable rating scales (thus allowing users to choose their favourite one for performing the various tasks in the system) or adapt them automatically, in order to improve user satisfaction. This flexibility, however, might come at a high price, since ratings expressed by means of different rating scales should be somehow normalized before the system makes use of them in the adaptation algorithm. Mapping may be complicated by the fact that rating scales may influence the kind of ratings given by users, because of their special features - in particular, we consider their granularity (i.e, their level of precision) and their emotional connotation (i.e. the emotions that they evoke). As for granularity, notice, for example, that a rating “3” in a scale from 1 to 3 does not necessarily correspond to the highest rating in another scale with more than 3 positions. As for emotional connotation, for example, users who perceive some ratings on a given rating scale as being rude may avoid using them. Moreover, we point out that users may attribute different meanings to the same rating made with the same rating scale, and they may exhibit idiosyncratic patterns of rating behaviour (e.g., some users may tend to give only positive ratings). Starting from these insights, we decided to carry out a further experiment with the goal of investigating users’ rating behaviour and in particular the relationships among their ratings on different rating scales.

2 Experiment The experiment was performed in the context of iCITY3 , a social recommender system in the cultural events domain which integrates adaptivity principles with Web 2.0 social features. Participants were asked to express their preferences for five different topics, evaluating each one with each of the three rating scales with which they were provided. We selected thumbs, stars and sliders as rating scales, taking inspiration on interfaces often used in social websites and since they differ with respect to the features that we wanted to consider (i.e., emotional connotation and granularity). Each rating scale conveys a different metaphor, which influences the emotional connotation: For the thumbs, the metaphor is related to human behaviour; for the stars, it relies heavily on cultural conventions (as with hotel ratings); for the sliders, it is technological (e.g., evoking measuring tools). Specific connotations were identified from the oral comments provided by users in our previous experiment: thumbs are “friendly” and young, but also “impolite” and too simple; stars are classical, familiar, cool; sliders are precise, cold, “detached” and boring. Regarding the granularity, the thumbs provide a coarse granularity, where only three different ratings are possible: negative, neutral/intermediate and positive; the stars provide a finer granularity with five positions and no explicitly negative ratings (the minimum rating being zero stars), while the sliders provide the finest granularity (the minimum rating being zero and the maximum being ten). Hypothesis. We hypothesized that users do not always follow strict mathematical proportion when they map their assessments onto various rating scales (i.e. they sometimes map their assessments in an unexpected manner); and that this fact may be due to the differing granularity and emotional connotations of each rating scale, as well as to the users’ partly idiosyncratic rating behaviours. 3

http://icity.di.unito.it/dsa-en

Experimental Design. We employed a within-subject design in which each participant used each of 3 rating scales. Subjects. We selected 16 participants, 19-55 years old, from colleagues and friends, according to an availability sampling strategy. Measures and Material. A series of fifteen web pages was prepared, each one presenting the topic that was to be evaluated and containing one of the rating scales. For each topic, three web pages were devised, one for each rating scale. The participants’ ratings were recorded and stored by the system. Experimental Task. Participants had to read written instructions autonomously. They were asked to express their preferences for five different topics (corresponding to the categories of events in the taxonomy of iCITY, i.e. Art, Cinema, Theater, Literature, Music) using the three rating scales. Each participant used every rating scale for every topic. The order of presentation of the topics was randomized for each participant, and, given a certain topic, the order of presentation of the rating scales was also randomized. Results. We found that 20% of the ratings expressed with the thumb rating scale were “thumbs down”, 51% were “thumbs up”, and the remaining 29% were in the intermediate position. As for the stars, 6% of the ratings were “0”, 11% were “1”, 18% were “2”, 20% were “3”, 20% were “4” and 24% were “5”. As for the sliders, 1,5% of the ratings were “0”, 6% were “1”, 8% were “2”, 8% were “3”, 3% were “4”, 14% were “5”, 8% were “6”, 12% were “7”, 12% were “8”, 9% were “9” and 15% were ‘10”. Regarding users’ mappings of ratings among the different scales, we found the following” “Thumbs down” rating was mapped by participants to a 0 (29%), to a 1 (50%) or to a 2 (21%) on the star rating scale; to a 0 (7%), to a 1 (29,5%), to a 2 (21,4%), to a 3 (21,4%) or to a 5 (21,4%) on the slider rating scale. The intermediate rating on the thumb rating scale was mapped either to 2 (45%), 3 (45%), or to 4 (10%) on a star rating scale. It was mapped to 2 (10%), 3 (10%), 4 (10%), 5 (30%), 6 (15%), 7 (10%), 8 (10%) or 9 (5%) on a slider rating scale. Then, we compared users’ mappings among different scales with the corresponding mathematical mapping. Considering the ratings expressed on all three rating scales, 60% of ratings were examples of mathematical proportion: in particular, 24% of the ratings are mapped to a perfect mathematical proportion (e.g. a rating of 10 on the slider rating scale is mapped to a rating of 5 on the star rating scale and to a “thumbs up” on the thumb rating scale), while 36% of ratings can be considered a good approximation of mathematical proportion (e.g., a rating of 9 on the slider rating scale is mapped to a rating of 5 on the star rating scale). It is worth noting that 40% of ratings depart considerably from mathematical proportion, showing that mathematical proportion is not enough to make a mapping which is able to capture the actual meaning of user ratings. To give a better idea of what we mean, we report that two different users made quite opposite rating choices: The first assigned a certain topic a rating of “thumbs down” on the thumb rating scale, of 2 on the star rating scale and of 5 on the slider rating scale. The second user, by contrast, mapped a rating of “thumbs up” on the thumb rating scale to 2 on the star rating scale and 5 on the slider rating scale. Thus, the same two ratings on the two finest scales have opposite meanings for these two users. This example confirms our hypotheses that different users may attribute different meanings to the same rating made with a certain rating scale. With mathematical proportion alone, we would not be able to map these ratings according to

the meaning that they really have for users. We also note that only one user showed perfect mathematical proportionality with respect to the lowest ratings, assigning a rating of 0 on all three rating scales when she assigned 0 on the thumb scale. None of the other users ever used the lowest rating on the sliders. This rating is apparently perceived as being more negative than the mathematically equivalent rating of 0 on the other scales. Although conducted with a small number of participants, this experiment confirmed our idea that mathematical proportion alone is not sufficient to translate ratings from one scale to another - and consequently, to an internal representation.

3 Discussion of a possible normalization process If ratings could be exactly mapped from one scale to another by means of a simple mathematical proportion, they could also be mapped to a uniform representation - corresponding to the user model representation of a given system - with no effort. Unfortunately, it can be seen from our experiment that other factors have to be taken into account, such as emotional connotations and user features. We discuss here an algorithm for normalizing user ratings with respect to a uniform representation which considers emotional connotations and user features. It comprises the following steps: i) mathematical normalization; ii) connotation-based normalization; iii) user model-based adjustment. As explained in the following, the first two steps are mutually exclusive. We again refer to iCITY as a use case, giving example rules for the mapping of user ratings to the internal representation of such ratings, where user interests are coded in a scale in the range [0, 1]. Mathematical normalization. Mathematical proportion can still be used as a basis for converting ratings from the input rating scale to the uniform internal representation if such ratings tend not to deviate much from the mere mathematical proportion in the observed mapping from one scale to another. According to our data, proportionality could be used for i) extreme positive ratings and ii) intermediate ratings. As for the first case, a rating of 5 on the star rating scale is usually mapped to a 10 on the sliders (63%). Thus, such ratings can be mapped to a normalized rating of 1 in the internal representation [0,1], according to strict proportion (rule 1): if ((user rating = 5 and rating scale = ‘stars’) or (user rating = 10 and rating scale = ‘sliders’)) then {normalized rating = 1;}

As for the second case, intermediate ratings on the three rating scales (thumbs up for the thumbs, “2” or “3” for the stars, “5” for the sliders) are mapped according to strict proportion in 59% of the cases. Thus, they can be mapped to a normalized rating of 0.5 in the internal representation (rule 2). Connotation-based normalization. The emotional connotation of the input rating scale should be considered for ratings where the mappings deviate from mathematical proportionality and show recognizable relationships among the scales. In our experiment, we observed some tendencies relating to extremely low ratings and based on the idea that these ratings given with a thumb or star scale are commonly perceived as less negative than if given with a slider rating scale. First, the lowest rating on the thumb rating scale is often mapped (71%) to ratings higher than strict mathematical proportion on the star and slider rating scale. Second, the lowest rating on the star rating scale tends to

be mapped to ratings higher than the strict mathematical proportion on the slider rating scale (75%). Third, comparing how the lowest rating on the thumb and star rating scale tends to be mapped to ratings on the slider scale, the lowest rating on the thumb rating scale tends to be mapped to higher ratings than the lowest rating on the star rating scale. Thus, the lowest ratings on the thumb and star rating scale will be mapped to ratings slightly higher than 0 in the internal representation (rule 3). An example rule is: if (user rating = 0 and rating scale = ‘thumbs’) then{normalized rating = 0.2;} else if (user rating = 0 and rating scale = ‘star’) then {normalized rating = 0.1;} else if (user rating = 0 and rating scale = ‘slider’) then {normalized rating = 0;}

User model-based adjustment. Normalized ratings can be further adjusted considering some features of the “user as rater”, which can be inferred from her behavior in the recommender system. In our experiment, we observed that users differ as for their level of accuracy in rating and for their general attitude toward the evaluated topics (critical vs. enthusiastic users). Moreover, we assumed that users may also differ in terms of their reliability, that is, for their tendency to give ratings that actually correspond to their opinions. Accordingly, we could formulate the following heuristics: First, if the user is always very precise in rating (accuracy) we can consider her rating as it is. In case she is not very precise, her rating should be adjusted, for example by merging it with user interest as inferred by the system from user behavior (if available) (rule 4). Second, if the user is reliable, her rating can be considered as it is; otherwise, her rating should be adjusted, for example by merging it with inferred user interest, as suggested before (rule 5). Third, if the user is critical, her rating should be increased a little, while if she is enthusiastic, her rating should be slightly decreased (rule 6). Here is an example rule for the adjustment of normalized ratings from the previous steps. We consider the case of a user who is both accurate and reliable: if (user accurate = true and user reliable = trues) then {if (user attitude = ‘critical’) then {normalized rating = normalized rating + 0.05;} else if (user attitude = ‘enthusiastic’) then {normalized rating = normalized rating - 0.05;}}

Let’s look at an example to clarify these concepts. We can consider the case of a user (very precise, reliable, and enthusiastic) who rates an iCITY item with “2” on a star rating scale, and another one (very precise, reliable but critical) who rates the same item as “0” on a thumb rating scale. For the first user, we first perform a mathematical normalization, which transforms the rating 2 on the star rating scale into the rating 0.5 (rule 2). For the second user, we instead perform the connotation-based normalization and the rating 0 is normalized to 0.2 (rule 3). Finally, it is necessary to consider specific user features. Since the first user is considered enthusiastic, the system lowers her rating to 0.45 (rule 6). By contrast, the rating of the other user is slightly increased to 0.25 since this user is considered very critical (rule 6).

4 Conclusion and Related work With this paper, we have made the first step towards a customization of rating scales in adaptive systems. The main contributions of this paper are i) raising the question of when it is desirable to use different scales for different people, ii) describing the problem of normalization that needs to be solved if this approach is taken, iii) discussing a possible solution for a not strictly mathematical normalization, and iv) providing some

examples of rules of thumb based on the experiment we performed. The benefit is that designers of adaptive systems will now be aware of this problem and can take it into account in subsequent research. Our results should be intended as initial heuristics that need to be further investigated. We are working on an experimental evaluation of the proposed approach. Furthermore, we will have to consider that inferring if a user is critic, enthusiastic or reliable is not simple. As a consequence, we may have to face the cold-start problem, requiring users to interact with the system for quite a long time before the user model-based part of the discussed normalization can be applied. Also, other approaches to the mapping of ratings should be considered, such as probabilistic approaches. Machine learning may also prove useful, since it could allow a system to automatically learn a model for normalizing the ratings, based on some training data. Notice that, in this case, no explicit heuristics about emotional connotations and user attitudes in rating might need to be applied - though it would still be important to have some understanding of the learned models, so as to be able to recognize the conditions under which they can be applied. Considering similar or related research, another paper focusing on rating scales is [5], which defined the main elements that determine the design of rating scales aimed at collecting explicit user feedback. They also found that user preferences for scales were in poor agreement, in accordance with our findings. The study of rating scales can be framed into the larger domain of option setting interfaces. Since option setting is often considered a boring and time-consuming task ([3], [4]), it is particularly desirable that rating scales actually match user preferences. Notice that a first translation is needed whenever users have to express their preferences by means of some rating scale: [1] pointed out that the granularity of true user preferences, that is, the number of levels among which users wish to distinguish, may be different than the range and granularity provided by the available rating scales. [2] explicitly investigated how different rating scales affect user ratings. They compared a binary scale providing only thumbs up or down, a no-zero scale ranging from -3 to +3, and a 0.5 to 5 star scale, allowing halfstar increments, with the original MovieLens five-position rating scale. They found that ratings on all three scales correlate strongly with original ratings on the five-position scale; however, they observed that users tended to give higher mean ratings on the binary and on the no-zero scales.

References 1. Dan Cosley, Shyong K. Lam, Istvan Albert, Joseph A. Konstan, and John Riedl. Is seeing believing?: how recommender system interfaces affect users’ opinions. In Gilbert Cockton and Panu Korhonen, editors, CHI, pages 585–592. ACM, 2003. 2. Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst., 22(1):5–53, 2004. 3. Johnsgard T.J. Albert U. Allen C.D Page, S.R. User customization of a word processor. In Proceedings of CHI 1996, page 340346, 1996. 4. S Trewin. Configuration agents, control and privacy. In Proceedings of the ACM Conference on Universal Usability, page 916, 2000. 5. J. van Barneveld and M. van Setten. Designing Usable Interfaces for TV Recommender Systems in Personalized Digital Television. Targeting programs to individual users. L. Ardissono, A. Kobsa and M. Maybury editors, Kluwer Academic Publishers, 2004.

Suggest Documents