Taking Account of Familiarity with a Topic in Personalizing ... - CiteSeerX

0 downloads 0 Views 180KB Size Report
expectations, a user's familiarity with a topic has no effect on the utility of ... track1 [1] quite appropriate for our research interests. For the past two years, ...
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

Does Familiarity Breed Content? Taking Account of Familiarity with a Topic in Personalizing Information Retrieval

Gheorghe Muresan, Michael Cole, Catherine L. Smith, Lu Liu and Nicholas J. Belkin School of Communication, Information and Library Studies, Rutgers University 4 Huntington St., New Brunswick, NJ 08901, USA {muresan, mcole, csmith, luliu, belkin}@scils.rutgers.edu

Abstract We report on an evaluation of the effectiveness of considering a user's familiarity with a topic in improving information retrieval performance. This approach to personalization is based on previous results indicating differences in user search behavior and judgments according to his/her familiarity to the topic explored, and to research on using implicit sources of evidence to determine the user's context and preferences. Our attempt was to relate a topic-dependent concept and measure, familiarity with the topic, with topic-independent measures of documents such as readability, concreteness / abstractness, and specificity / generality. Contrary to our expectations, a user’s familiarity with a topic has no effect on the utility of readability or concrete/abstract scoring. We are encouraged, however, to find that high readability had a positive effect on search results, regardless of a user’s familiarity with a topic.

1. Introduction

We found the experimental design of the TREC HARD track1 [1] quite appropriate for our research interests. For the past two years, participants in the HARD TREC (Highly Accurate Retrieval of Documents) have investigated, inter alia, how to take account of various aspects of an information seeker’s context and preferences in order to improve IR performance. This has been done through a specific experimental procedure, as follows. People called assessors, were asked to construct TRECstyle search topics, of personal interest to them2. When constructing these topics, the assessors were asked to specify particular categories of metadata associated with the topic and themselves, which are indicative of various aspects of the context within which that topic was situated. The categories and values of topic metadata that were specified for the HARD track in 2004 were: ƒ desired genre of documents, with values: news-reports, opinion-editorial, other, any. ƒ desired geographic coverage, with values: US, other, any.

1

Two key issues in current information retrieval (IR) research, are how to take account of the searcher’s profile and context in order to personalize the list of documents estimated to be relevant, and which aspects of context affect relevance judgments [7]. There has been substantial research indicating that a person’s familiarity with the topic of a search influences the person’s information behavior in a variety of ways, including, among others, the criteria applied when making relevance judgments [8, 15]. The research reported in this paper was motivated by our interest in investigating different ideas about the relationships of a person’s degree of familiarity with a topic and the relevance of documents with particular identifiable characteristics.

TREC experiments (http://trec.nist.gov) are run as tracks, in which groups of participating researchers address a single area of research interest, which is defined by its own set of unique objectives and research questions. Very generally, all participants in a track receive a dataset of the same corpus and test topics (topics are information requests). Participants then produce a single baseline retrieval run, attempting to optimize the effectiveness of their system over the dataset; these runs are submitted to the event organizers, the National Institute for Standards and Technology (NIST). Following the production of the baseline run, participants then receive some form of additional information with which to improve the effectiveness of their run. Participants then produce experimental retrieval runs, attempting to optimize the effectiveness of their experimental system over the enhanced dataset. The results of these runs are also returned to NIST, where a sample of the retrieved documents are assessed for relevance. Relevance assessments are then used to measure the effectiveness of each experimental run relative to its baseline, and relative to all other runs in the track [3, 14, 16]. 2 Because the assessors both proposed the topics and assessed the relevance of the documents returned by the participating sites to a set of queries, assessors are also called searchers and IR system users in this paper.

0-7695-2507-5/06/$20.00 (C) 2006 IEEE

1

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

ƒ assessor’s familiarity with the topic, with values: high, low. ƒ desired granularity of response, with values: passage, document. Groups participating in the HARD track were initially given the set of 50 topics without metadata and asked to conduct searches without any knowledge of the searchers’ preferences. After submitting these baseline runs, the participants were given the metadata corresponding to each topic, reflecting the searchers’ preferences. The participants were also allowed to submit a clarification form for each topic, to which the topic’s assessor was to respond. While the metadata simulated aspects of the user profile that could, in principle, be obtained via implicit sources of evidence, the clarification forms were a means for explicit elicitation of evidence. Based on the new information related to the searchers’ preferences, the participants were expected to generate search results that better fit the user profiles and context, and which consequently would produce better search effectiveness than the baseline. The assessors had three options when judging relevance: not relevant, soft relevant, and hard relevant. Not relevant has the standard definition (not on topic); soft relevant means that the document is on topic, but does not satisfy one or more of the metadata categories (the assessors had to specify which ones); hard relevant means on topic and satisfying all of the metadata categories. Success in the HARD task was measured by an improvement in hard effectiveness compared to the baseline, when taking the user profiles into account. Note that in order to tune their personalization techniques, the participants were given a set of 20 training topics, with associated metadata, a list of 100 documents which had been retrieved for each topic, and relevance judgments associated with these documents. In principle, the learning process was envisaged to work as follows: if a document was judged hard relevant for a topic, it could be considered as a positive exemplar of documents that match the metadata associated with the topic; if a document was judged soft relevant, it could be viewed as a negative exemplar of the specified metadata; no inference related to metadata could be made when a document was judged nonrelevant. Unfortunately, training data was highly skewed towards certain metadata values (e.g., there were 241 positive examples of news-reports, and only 2 of opinioneditorials), and negative exemplars were extremely scarce (e.g., only two documents were judged as inappropriate for the searcher’s level of familiarity with the topic). Therefore, no useful training was reported by any of the participants. We ourselves were unable to test our hypotheses or to tune our formulae on the training data. The results reported

here are based entirely on 45 of the 50 test topics3 and on our official TREC submissions. Our general approach to the issue of taking account of context is one of personalization, based upon knowledge of the user and the user’s context which could, at least in principle, be gained through implicit sources of evidence, such as past or present searching behavior [9]. Familiarity with a topic is one characteristic of a person's context which clearly can be inferred in this way. In this paper, we report on research which takes advantage of the TREC HARD track context in order to test the usefulness of various methods for taking account of familiarity with a topic in order to improve retrieval results. Therefore, other metadata such as genre, geography and granularity will be ignored. Since our interest is in implicit sources of evidence, we did not use clarification forms. We understood that there would be, in general, two ways in which to take account of the metadata. One would be to modify the initial query from the (presumed) searcher, before re-submitting it to the search engine; the other would be to search with the initial query, and then to modify (i.e. re-rank) the results before showing them to the searcher. While we used both strategies in our TREC work, the results reported here are based on the latter approach.

2. Familiarity, its implications, and document characteristics The basic problem with taking account of familiarity with a topic in modifying search results is that, unlike metadata such as genre, which specifies a particular type of document to be retrieved, level of familiarity has no such direct relationship with documents. Therefore, to take account of familiarity, it is necessary to hypothesize about what a particular level of familiarity implies of a user's desired search results. We generated three different hypotheses about general properties of documents that may be plausibly correlated with a user's expressed familiarity with a topic. A discussion of the rationale and implementation of each hypothesis follows.

2.1. Familiarity and readability Our first speculation with respect to familiarity is that people who are familiar with a topic will want to see documents which are detailed and terminologically specific, and people who are unfamiliar with a topic will want to see general and relatively simple documents. This was operationalized in two ways. One was that readability, 3

For five of the topics no relevant documents were retrieved by any of the participating groups, so NIST removed these topics from the test.

2

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

as measured by the Flesch Reading Ease Score4, would approximate terminological specificity or generality, with documents of low readability being suitable for people with high familiarity with a topic, and documents with high readability being suitable for people with low familiarity with a topic. Terminological specificity can be related to the Flesch Reading Ease Score by focusing on one component of the score, the mean number of syllables per word in the document. There is good evidence that the number of syllables in a word is strongly anti-correlated with a word's frequency in English, which we take as a measure of terminological specificity. Our second operationalization was to use this value alone to characterize a document, rather than the readability as a whole. These arguments lead us to the hypotheses: H1: Assessors with low familiarity with a topic are more likely to find that documents on topic that have high readability are relevant, and those with low readability are not relevant; assessors with high familiarity, vice versa. H2: Assessors with low familiarity with a topic are more likely to find that documents on topic with a low average number of syllables per word are relevant, and those with a high average number of syllables per word are not relevant; assessors with high familiarity, vice versa.

The RID is a taxonomy of words and word stems arranged in accordance with a psychological theory of consciousness and expression. The theory entails a definition of states of consciousness that lie along a continuum, from cognition as regressive, analogical, and concrete (Primary) to cognition as analytical, logical, and abstract (Secondary). For our purposes, we utilized only the sub-segments of the RID related to expression characterized specifically as concrete and abstract. The concrete terms are in the part of the dictionary related to “deep regression,” and specifically connote spatial references, such as at, where, over, out, and long). Abstraction is in the part of the dictionary termed “secondary process,” which is “an inverse indicator of regression”; the list includes terms such as know, may, thought, and why [10]. Our application of the RID was relatively straightforward. Texts were analyzed for term frequency using the known concrete / abstract word lists. We investigated various formulae for generating concreteness / abstractness scores based on frequencies of concrete / abstract terms. We adopted the following formulae:

concreteness = log and

abstractness = log 2.2. Familiarity and abstractness or concreteness The second idea concerning familiarity is based on research indicating differences in the processing of concrete and abstract words in texts [2, 4, 6, 12, 13]. One research finding is that people are more easily able to provide distinct contexts for concrete words as compared to abstract words and that comprehension of concrete words takes place more quickly. This led us to hypothesize that people who are unfamiliar with a topic will have difficulty understanding documents that treat a topic abstractly, and will prefer documents that treat the topic with concrete terminology, hence, people with low familiarity with a topic will prefer documents which have a high proportion of concrete terms. Similarly, we hypothesize that people with high familiarity with a topic will prefer documents that have a high proportion of abstract terms. Martindale’s Regressive Imagery Dictionary (RID) was used as our model of concrete and abstract expression [11]. 4

Flesh is a standard readability score based on the average number of syllables per word and words per sentence. It is widely used in U.S. high schools to assess the appropriateness of texts assigned for reading, and for automatically grading the readability of essays. See http://www.foulger.info/davis/papers/SimplifiedFleschReadingEaseFormu la.htm. We computed Flesch scores with Perl scripts available at: http://aspn.activestate.com/ASPN/CodeDoc/Lingua-EN-Fathom/ Fathom.html# SYNOPSIS.

1 + nb _ of _ concrete _ terms 1 + nb _ of _ abstract _ terms

1 + nb _ of _ abstract _ terms , 1 + nb _ of _ concrete _ terms

which have the advantages that (i) the two sets of scores are perfectly correlated; (ii) scores are positive or negative, according to whether a document is measured as more abstract or more concrete; and (iii) a value of 0 indicates a neutral document. This line of reasoning yields our third hypothesis: H3: Assessors with low familiarity with a topic are more likely to find that documents on topic which have a high concreteness score are relevant, and those with a high abstractness score are not relevant; assessors with high familiarity, vice versa.

3. Evaluation methodology 3.1. Computing scores In order to conduct the searches, we used the Lemur IR toolkit5 with BM25 weights to generate the baseline results, using both title and description fields to generate the queries for each topic. For re-ranking the search results based on familiarity, we used a simple normalized weighted average of the baseline score for a document and 5

http://www-2.cs.cmu.edu/~lemur/

3

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

“metadata scores” determined by, respectively, the document’s readability, terminological specificity, and abstractness/concreteness:

run _ score = baseline _ score + ¦ wi * metadata _ scorei , with baseline scores and metadata scores normalized:

normalized _ score =

score − min_ score . max_ score − min_ score

Our focus was on investigating research hypotheses rather than on just improving effectiveness. So, each set of metadata scores was separately combined with the baseline. In all cases, wi was set to values lower than 1.0 (0.05, 0.1, 0.2, 0.4), in order to give priority to topicality over context. For H1 and H2, we computed the Flesch Readability Ease Score and mean syllables per word per document for each document in the baseline. For H3 we computed abstractness and concreteness scores as described in section 2.2. We normalized and combined these sets of metadata scores with the metadata, according to the formulae described above.

3.2. Evaluation procedure Our approach to personalizing search results has two steps: 1. Compute readability and concreteness/abstractness scores for each document in the baseline. The underlying assumption of our hypotheses is that these scores can predict whether a document is suitable for someone with high, or with low familiarity with the search topic. 2. Combine these metadata scores with the baseline scores in order to re-rank the baseline and produce a list of hits more suitable for someone with a known level of familiarity to the topic. Therefore, in order to evaluate our hypotheses, a twostage approach is logical: 1. Evaluate the quality of the metadata scores. This can be done by applying a t-test (or its non-parametric equivalent if the data is not normally distributed) to verify that documents that are hard-relevant tend to have better scores than documents that are soft-relevant. Alternatively, the problem can be viewed as classification into relevant and non-relevant documents, so Receiver Operating Characteristic (ROC) curves can be used to verify how well the metadata scores can predict the metadata category for each document. The metadata category for documents can be deduced from the relevance judgments. If an assessor judges a document as hard relevant, then the document is appropriate for the searcher’s familiarity level and can be labeled as “high familiarity” or “low familiarity”,

according to the searcher’s declared familiarity. If a document is judged as soft relevant, with familiarity as the reason for the mismatch, then the document’s label will be the opposite of the searcher’s level of familiarity. For documents judged non-relevant, no inference can be made regarding the familiarity metadata. 2. Combine each of the metadata scores separately with the baseline scores and check whether the new scores provide better ranking. The quality of ranking can be estimated based on the assessors’ relevance judgments, (i) by using the standard trec_eval program, provided by NIST for TREC experiments6, and looking at various measures of retrieval effectiveness, in particular Rprecision, which pays particular attention to the documents at the top of the ranked list; (ii) by using a measure somewhat more sensitive to effects on individual topics, such as the difference in the sum of the ranks of the relevant documents in the original list, and in the re-ranked list, or, in order to give more prominence to documents at the top of the result list, the mean reciprocal rank (MRR). It is important to separate the two stages of evaluation, in order to distinguish between (1) the quality of metadata scores; and (2) the formulae for combining baseline and metadata scores in order to obtain final rankings. Of the two, the former test is obviously more important. If the metadata scores are not good, then the latter test is rather pointless. If the metadata scores are good, then further work can be conducted to optimize the weights in the combination of scores formula.

3.3. ROC curves As we are interested in evaluating the capability of the metadata scores to capture and predict a certain metadata binary value for each document (hard relevant or hard non relevant) ROC curves are an excellent tool. They are more widely used in classification experiments than in general IR evaluation, so a brief explanation of their use follows7. Imagine a simple burglar alarm made from several empty cans on top of each other, placed at the entrance door. If a burglar tries to enter, the cans are expected to fall and raise the alarm. The more cans are used, the more difficult it is to open the door without raising the alarm; in other words, the sensitivity of the alarm increases. On the other hand, what also increases is the probability that the alarm will be set off by a truck passing in front of the door; in other words, the specificity of the alarm decreases.

6

http://trec.nist.gov/trec_eval/ See http://www.cmh.edu/stats/ask/roc.asp or http://gim.unmc.edu/dxtests/ for additional background.

7

4

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

Let us say that one such alarm catches 4 out of 10 illegal accesses; its detection rate is 40%. Also, let us say that, for each 10 trucks that pass in front of the door, two set off the alarm; the false alarm rate is 20%. Adding more cans will probably increase the detection rate, but also the false alarm rate; removing some cans will probably have the opposite effect. Alternative terminology, especially in the field of diagnostic testing consists of sensitivity of a test (the ratio of positive cases that test positive; it is the same as the detection rate) and specificity (the ratio of negative cases that test negative; it is the complement of the false detection rate). The ROC curve is obtained by plotting the detection rate of a test (sensitivity) versus the false alarm rate (1 – specificity), as some sensitivity threshold is changed. The closer the curve follows the left-hand border (i.e. it has a good detection rate before starting to throw false alarms, or the false alarm rate is negligible for acceptable detection rates) and then the top border of the ROC space (i.e. its detection rate is excellent, for acceptable false alarm rates), the more accurate the test The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. A curve that is above the diagonal indicates a test that is better than random over the range considered. Compared to T-tests, ROC curves have the advantage of indicating the quality of the prediction at different thresholds of separation between the two outcomes. It is tempting to use the area under the ROC curve as the one-value summary of the curve, especially when tests are compared. A value close to 1 indicates an accurate test, while a value of 0.5 indicates a test that is no better than random guessing. However, when the curve is irregular, and especially when it crosses the diagonal, this value is hard to interpret. The curve itself is a better indication of how well a test performs.

3.4. ROC curves for evaluating classification performance ROC curves are well suited to evaluate binary classification procedures where the instances are ranked by a scale variable. In this case, each object has been assigned a score that predicts that class membership of an object. If a score threshold is systematically decreased from high values, close to the maximum score (high specificity), to low values, close to the minimum score (high sensitivity), the classification effectiveness of the scored property is the number of objects accepted in that class over a range of scores. Measures that are ineffective classifiers will approach the performance of a random classifier. Assuming that the experimenter knows the true class of each object, the detection rate and false positive rate can be calculated for each threshold value and plotted in an ROC curve. The

curve indicates how well the scores predict the class of each object. If one knew a particular measure was an effective classifier, one could test the effectiveness of procedures used to generate the scores using the ROC curve. For example, in order to test H1, we use readability (Flesh) scores computed for each document, and evaluate whether these scores are effective predictors of assessor judgments of document relevance. If the ROC curve indicates that the prediction is better than a random guess, indicated by the diagonal line, we can deem the approach successful and expect that ranking documents using a combination of the readability score and the baseline score will improve retrieval effectiveness.

4. Initial results 4.1 Document readability and concrete/abstract word frequency In order to verify the capacity of some metadata score (such as readability, in the form of Flesch Reading Ease) to predict some judgment outcome (such as document relevance or familiarity level), the documents are ranked based on the metadata scores, a threshold is chosen to separate high scores from low scores, and this separation is compared with the assessors’ judgments. If the assessors judge a document with a high metadata score as relevant, the prediction is counted a true positive; if the assessors judge the document as not relevant, then the ranked document is counted as a false positive. As the separation threshold is moved from the minimum to the maximum value, the ROC curve plots the sensitivity of the test (the detection rate) vs. its specificity (the complement of the false alarm rate). Table 1: Relevance-labeled document pools HIGH LOW FAMILIARITY FAMILIARITY POSITIVE

1,891

1,633

NEGATIVE

151

84

Using the assessor’s familiarity with a topic, the test topics from the HARD2004 collection were partitioned into high familiarity topics (N=19) and low familiarity topics (N=26). For each document assessed by NIST, a familiarity label was inferred from the judgments of hard or soft relevance using the reasoning described above. The familiarity property could not be inferred for documents judged not relevant. Pooling of the documents yielded the collections in table 1.

5

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

Using the NIST specification, assessors indicated their familiarity with a topic as a binary value. A binary conceptualization of familiarity may be inadequate to capture its potentially complex dimensions; we nonetheless take its operationalization as defined by the HARD track, as a basis for the inference of a set of corollary judgments. When a document was judged non-relevant only because it did not meet the familiarity requirement, for the assessor that document must have had the opposite familiarity value. Therefore, a negative judgment of hard relevance for readers with low familiarity allows inference that the document is suitable for readers with high familiarity, and vice versa. A unified group of cases was formed of the original and inferred examples. In pooling the documents by high or low familiarity, we assumed that the suitability of a given document is not a function of the topic under which it is considered. This seems reasonable in view of the observation that relevant documents deemed suitable for readers with high topic familiarity are unlikely to be judged unsuitable on the basis of familiarity for another topic requiring high familiarity. ROC curves were generated for each hypothesis. We used ROC curves to examine ranked documents using two classes of predictors: readability scores and concreteness / abstractness scores. As we reported in our TREC paper [5], all three hypotheses were rejected overall. However, it was interesting to observe that they were partially supported by our data. More precisely, high readability (expressed as high Flesch scores, low number of syllables per word or low average number of syllables per word) and high concreteness (expressed as high concreteness score) appeared to predict relevance for subjects with low familiarity with the topic. On the other hand, low readability and high abstractness did worse than random guessing at predicting document relevance for subjects highly familiar with the topic.

concrete documents may be preferred by all searchers, no matter what their familiarity with the topic is. The rest of this section describes our investigation of these ideas.

5.1. Readability and relevance Concerning readability and relevance we formulated the new hypotheses: H4.1: (conceptual) Documents that are more readable are more likely to be relevant. H4.2: (practical) Boosting the baseline scores for documents with higher readability will improve soft effectiveness (Blind readability feedback). We tested the H4.1 by building ROC curves (figs. 1, 2, 3) that show how well readability scores classified the relevant documents meeting the appropriate familiarity criteria. It is apparent from figure 1 that H4.1 fails: readability does rather badly at predicting the soft relevance, or judgment of topicality match. Figures 2 and 3 show failure for searchers both familiar and unfamiliar with the topics, with the latter doing slightly worse. Unfortunately, the implication is that we cannot rely on readability to help improve the retrieval effectiveness of the baseline, by improving the ranks of highly readable documents. However, as he had expected a positive outcome for H4.1 and had the entire experimental setup prepared, we tested the H4.2 by combining the baseline scores with readability scores, at different weight levels.

1.00

.75

5. Plan B – Alternative hypotheses

.25

Sensitivity

Our original hypotheses postulated specific effects of the level of topic familiarity on preferences for readability and terminological specificity. For H1, we found weak evidence that searchers with low familiarity with the topic preferred more readable documents, but we found no evidence that searchers with high familiarity preferred less readable documents. That caused us to consider the possibility that familiarity has no specific effect on users’ document readability preference. Instead, it may be the case that users have a general preference for readable documents. Similarly, we were compelled to consider the possibility that all searchers experience readable documents as more familiar than less readable documents. Also, more

.50

0.00 0.00

.25

.50

.75

1.00

1 - Specificity

Figure 1: Readability as predictor of relevance over all topics

6

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

1.00

.75

.50

Sensitivity

.25

0.00 0.00

.25

.50

.75

1.00

1 - Specificity

Figure 2: Readability as predictor of relevance for high familiarity topics

The Wilcoxon test was applied for various standard measures of IR retrieval effectiveness (AvgP, P@10 and RP in the left column). The baseline run was compared against retrieval results using the weighted combination of evidence at several different weights (the w values in the top row). The results are displayed in table 2 in a standard T(p) format for reporting Wilcoxon results. T is an indication of the difference in ranks between the two samples where higher numbers indicate greater differences, and p is the usual statistical significance measure. Rather surprisingly, we did obtain a significant retrieval performance increase, although the improvement only happened for average precision and for R-precision at certain weight levels, not for precision at 10 documents. This indicates that taking readability into account can in fact improve the overall ranking of the results, even if it does not affect the very top of the ranked list.

5.2. Readability and familiarity 1.00

Concerning readability and familiarity the new hypotheses were: H5.1: (conceptual) Documents that are more readable are more likely to be accepted by the user on familiarity grounds. H5.2: (practical) Boosting the baseline scores for documents with higher readability will improve hard effectiveness (Blind readability feedback).

.75

.50

Sensitivity

.25

0.00 0.00

.25

.50

.75

1.00

1 - Specificity

Figure 3: Readability as predictor of relevance for low familiarity topic

Table 2: System performance improvement for soft-relevance judgments using readability-weighted retrieval w = 0.05 AvgP P@10 R-P

w = 0.1

w = 0.2

w = 0.4

304

256

292

277

(0.0330)

(0.0048)

(0.0146)

(0.0034)

1

1

1.5

8.5

(0.9772)

(0.9773)

(0.6813)

(0.3749)

8

9

9

16

(0.6062)

(0.2234)

(0.0332)

(0.0039)

To investigate these hypotheses we generated the ROC curves in figs. 4, 5 and 6. The ROC curve in figure 4 weakly confirms H5.1. The more readable a document, the less likely it is to be rejected on familiarity grounds, i.e. the more likely the user to judge the familiarity level of the document as appropriate, given their understanding of the topic area. As readability is an attribute of the document, and familiarity is a relationship between the reader and the topic of the document, we conclude that assessors are subjective: they are going to be more lenient when judging easier to read documents. It is interesting to look at the ROC curves for two distinct and disjoint sets of data. The curve in figure 5 concerns assessors that have indicated much familiarity with the topic and that in figure 6 assessors that have indicated little familiarity with the topic. It is apparent that, although readability is proven to be a (somewhat weak) predictor of document acceptance on familiarity grounds for both classes of searchers, there are significant differences: the prediction for searchers familiar with the topic works better at high specificity, while for searchers unfamiliar with the topic it works better at high sensitivity.

7

1.00

1.00

.75

.75

.50

.50

.25

.25

Sensitivity

Sensitivity

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

0.00 0.00

.25

.50

.75

0.00 0.00

1.00

.25

.50

.75

1.00

1 - Specificity

1 - Specificity

Figure 4: Readability as predictor of familiarity acceptance

Figure 6: Readability as predictor of familiarity acceptance for assessors unfamiliar with the topic

1.00

It would be interesting to interview or observe the NIST assessors in their evaluation process. Even a log of their actions, especially including the time spent on each document, might provide evidence for this interpretation of the ROC curves. H5.2 was tested in the same way as H4.2, with the difference that hard relevance judgments were used to test effectiveness (table 3).

.75

.50

Sensitivity

.25

0.00 0.00

.25

.50

.75

1.00

Table 3: System improvement for hardrelevance judgments using readability-weighted retrieval

1 - Specificity

Figure 5: Readability as predictor of familiarity acceptance for assessors familiar with the topic

AvgP P@10

Intuitively, we could explain this difference based on specific behavior: searchers familiar with the topic are tougher judges, more likely to read or scan a document before taking a decision, and more likely to reject documents. They are right to reject some inappropriate documents, but may be too tough and will reject some good documents. On the other hand, assessors unfamiliar with the topic will probably base their relevance judgments on a superficial match between query terms and the document content. They are less likely to read or scan the documents when making relevance judgments, and will base their judgment more on readability. They are right in accepting many appropriate documents, but are too lenient and will accept some inappropriate documents.

R-P

w = 0.05

w = 0.1

w = 0.2

165

152

189

w = 0.4 201

(0.0015)

(0.0008)

(0.0026)

(0.0015)

0

0

0

3

(0.5)

(0.5)

(0.1729)

(0.2919)

0

0

0

13

(0.0180)

(0.0295)

(0.0071)

(0.0126)

The increase in hard-relevance-based performance is significant for average precision and R-precision, but not for precision at 10 documents. Again, the very top of the ranked list was unaffected, but taking readability into account can improve the overall ranking of the results.

5.3. The effect of document concreteness In a way similar to the investigation of readability, the results from H3 suggested that we should further investigate the effect of concreteness scores on the

8

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

assessors’ perception of document appropriateness for a certain familiarity level. We formulated the following hypotheses: H6.1: Documents that are more concrete are more likely to be relevant. H6.2: Documents that are more concrete are more likely to be accepted by the user on familiarity grounds. 1.00

.75

.50

6. Discussions and conclusions

Sensitivity

.25

0.00 0.00

.25

.50

.75

1.00

1 - Specificity

Figure 7. Concreteness as predictor of relevance 1.00

.75

.50

.25

Sensitivity

scores. We were surprised to obtained a very low (rho = 0.021) value of the Spearmann correlation, even if this was found to be consistent (p = 0.01). This result indicates that the ease or, on the other hand, the difficulty to read a document can be expressed at various levels. Scores such as Flesch may indicate the ease to read words and sentences, but not the level of abstractness of a document. Moreover, neither of these scores captures adequately the level of familiarity required to really comprehend the content of a document. One can argue that some topics are more abstract than others, and therefore more abstract documents are needed in order to satisfy the searcher’s information need. We plan to conduct further experiments to investigate this idea, by trying to match the abstractness level of documents and topics.

0.00 0.00

.25

.50

.75

1.00

1 - Specificity

Figure 8. Concreteness as predictor of familiarity acceptance Figures 7 and 8 indicate weak support for these hypotheses, which suggests that concreteness scores may also be able to improve retrieval effectiveness, when combined with effectiveness scores. Intuitively, more concrete documents are expected to be easier to read and comprehend, so we expected a relatively high correlation between readability and concreteness

We were surprised to find that none of our initial, seemingly reasonable hypotheses was confirmed by the experiments. Taking account of a user’s familiarity with a topic by relating the readability or concreteness / abstractness score of a document to familiarity level, in an attempt to personalize the search output, had no effect on retrieval performance. Despite the failure to confirm our intuition, we are extremely encouraged by the evidence that positively weighting high readability documents improved search results independently of a user's familiarity with a topic. This novel result appears to be important and allows for practical application. Retrieval effectiveness may be improved by taking into account object readability, an objective and topic-independent document measure that can be computed offline, for example at indexing time. More experiments, including ones to test topic-effects, are needed before we can safely generalize this result. It is important to note that, in our experiments, performance improvement was significant only for average precision. R-precision and precision at the top ten documents were not enhanced significantly. We speculate that readability scoring may serve to demote documents that do not conform to the structural features of common text, such as a news story. Another possible explanation may be that that the difference between baseline scores is larger at the top of the ranked list, and therefore the effect of the metadata in terms of re-ranking the baseline is reduced. In parallel work we are investigating this issue by looking at the distribution of relevant documents on score values and by exploring various weighting schemes and formulae for combining scores. The observation that both soft and hard effectiveness can be improved by promoting documents with a high

9

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

readability score to higher ranks is striking, but we are not sure of the cognitive process that causes assessors to estimate as more on topic the documents with higher readability. One possible explanation is that users are more likely to dismiss less readable documents quickly. It would seem that one consequence would be that the searchers spend more time on readable documents, and are therefore able to make more accurate judgments about the documents. However, in a related experiment that was part of our HARD TREC work, we looked at agreement between annotators when assigning genre categories to documents. Surprisingly, there was more disagreement in the case of more readable documents. Another potential explanation is that the improvement may partially be due to the removal of “bad documents” (such as spam) that contain topical keywords, and only partially to the fact that users are more likely to pay attention to, and judge as relevant, documents that are easy to read. In any case, this work provides support for a significant connection between an objective measure associated to a document, namely readability, and subjective measures such as the estimated relevance of the document to a topic, or the estimated familiarity of a searcher with the topic of the document. While not sufficiently strong for readability to predict relevance or familiarity level, the connection has potential to improve search results ranking through “blind readability feedback”.

[5] Belkin, N.J. , Chaleva, I., Cole, M., Li, Y.-L., Liu, L., Liu, Y.H., Muresan, G., Smith, C.L., Sun, Y., Yuan, X.-J., and Zhang, X.-M. Rutgers' HARD Track Experiments at TREC 2004, in Proceedings of TREC 2004, Gaithersburg, 2004. [6] Burgess, C., Livesay, K, and Lund, K.. Explorations in context space: Words, sentences, discourse. Discourse Processes, 25, (1998), 211-257. [7] Ingwerwsen, P. and Järvelin, K.. Information retrieval in contexts. In: Information in Context: IRiX:ACM-SIGIR Workshop 2004 Proceedings. Ingwersen, P., Van Rijsbergen, C. J., Belkin, and Nick, Larsen, B. [eds.]. Sheffield: Sheffield University, 2004. pp. 6-9 Conference paper – web location: http://ir.dcs.gla.ac.uk/context/ [8] Kelly, D. and Cool, C. The effects of topic familiarity on information search behavior. In JCDL 2002. Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries, G. Marchionini and W. Hersh [eds.]. (pp. 74-75). ACM, New York, 2002 [9] Kelly, D. and Teevan, J. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2) 2003 [10] Martindale, C. Romantic progression: The psychology of literary history. John Wiley & Sons. New York, 1975 [11] Martindale, C. The clockwork muse: The predictability of artistic change. Basic Books, 1990 [12] Schwanenflugel, P. Why are abstract concepts hard to understand? In The psychology of word meaning, P.J. Schwanenflugel [ed.,] (pp. 223-250), Erlbaum., Mahwah, NJ, 1991

7. Acknowledgments

[13] Schwanenflugel, P., Harnishfeger, K., and Stow, R.. Context availability and lexical decisions for abstract and concrete words, Journal of Memory & Language, 27, (1988), 499-520

Prof. David J. Harper, of the Robert Gordon University, Aberdeen, Scotland, helped us with useful suggestions, Perl scripts, and many hours of fruitful discussions.

[14] Sparck Jones, K. (2000). Further reflections on TREC. Information Processing & Management, 36(1), 37-85.

8. References [1] Allan, J. HARD track overview in the TREC 2004 - High Accuracy Retrieval from Documents. In: The Thirteenth Text REtrieval Conference (TREC 2004). E.M. Voorhees and L.P. Buckland [eds.], Gaithersburg, Nov 2004.

[15] Vakkari, P. Task-based information searching. Annual Review of Information Science and Technology, 37, (2003), 413464. [16] Voorhees, E. M., and Harmann, D. K. (2005). TREC – Experiment and Evaluation in information retrieval. MIT Press.

[2] Audet, C and Burgess, C. Using a high-dimensional memory model to evaluate the properties of abstract and concrete words. In Proceedings of the 20th Annual Conference of the Cognitive Science Society, Vancouver, BC, Canada, (1999) pp 37-42. [3] Baeza-Yates, R., and Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press. [4] Barsalou, L., and Wierner-Hastings, K. Situating abstract concepts. In D. Pecher and R. Zwaan (Eds.), Grounding cognition: The role of perception and action in memory, language, and thought. Cambridge University Press, New York, 2004

10

Suggest Documents