Web Users and Web Document Classifiers: Emergent ... - CiteSeerX

0 downloads 0 Views 165KB Size Report
The interface to the user employs an automatic Web-document classifier, inspired by .... The classifier employs a lexical database/thesaurus for text classification ...
Maya Dimitrova1, Nicholas Kushmerick2, Ivan Terziev3 and Alexander Gegov4 1

Institute of Control and System Research, Bulgarian Academy of Sciences, Bulgaria Computer Science Department, University College Dublin, Ireland 3 Institute of Water Problems, Bulgarian Academy of Sciences, Bulgaria 4 Department of Computer Science and Software Engineering, University of Portsmouth, UK 2

Web Users and Web Document Classifiers: Emergent Cognitive Phenomena Abstract: We are currently developing an approach to introduce heuristics based on facts and phenomena from cognitive science in the design of automatic Web document classifiers. The analysis of the results of the study has revealed new and emergent cognitive phenomena of users interacting with present-day Web systems. The interface to the user employs an automatic Web-document classifier, inspired by the “word-length and word-frequency” effects. The result is faster and efficient document search, classification and retrieval. The paper discusses the encountered insights like leasteffort strategy in user assessment of Web documents, the implicit user expectation for interaction with “intelligent” interface, as well as the increasing demand for document summaries. Some interface design guidelines, falling out from the current study, are outlined.

1. Introduction Cognitive science is appealing for Web environments’ design with that its purpose can be to reveal various subtle aspects of human cognition and behaviour from current user experience with the Web. The research, presented in this paper, has applied some intuitions and heuristics, inspired by established facts and phenomena in cognitive science. It has attempted to optimise user-oriented search and retrieval of Web documents to better assist the individual user in finding the appropriate document. “Appropriateness” here is understood as a multifaceted from a cognition perspective attitude towards the retrieved documents - topicrelevant, but also written in the preferred style, or “genre”, with the “right” amount of detail and at the “proper” technical or expert level of description. There is increasing user demand for improved “clarity” about the type as well as content of the returned document Urls in response to user queries, however work in this direction has shown that classifiers of this kind is still difficult to build. The main idea we explore is if “smart” or cognitive heuristics can be helpful in research. The second main idea is that these classifiers can be meaningful and lead to more individualised Web interface. In building the Web document classifier, the employed heuristic from cognitive science is the word-frequency effect and its differential processing by human perception and cognition (Kucera & Francis, 1967), complemented with the word-length effect in sentence processing (Hearle, 2001; Lee, 1999). The studied dimensions of Web documents are the technicality of the document (expert level) and the amount of detail in presenting the topic of the document. We present the general idea for building “multifaceted” Web document classifiers, the implemented “cognitive science heuristics” and most similar related work. In figure 1, illustrating the idea of the multifaceted user preferences, the “orthogonal” dimensions represent document features for which we assume there is cognitive/linguistic potential to be

implemented in automatic Web document classifiers.

on-topic

expert

brief

detailed

off-topic popular

synthesized

fact

negative

positive

illustrative oppinion

Figure 1. A “multifaceted” conceptualisation of user Web search preferences with linguistic potential to be implemented in automatic Web document classifiers

We have hypothesised that users perceive these dimensions as independent, so they can be meaningfully displayed as orthogonal projections. We have conducted structured tests (Webbased survey) and observational studies of the ways users perceive Web documents and their “independent” features – brainstorming session, post-hoc interview with a Web design expert and a subsequent case-study in the orthopedic domain. The evaluation section summarises the current results, the insights from the analysis of the study and future research directions. 2. Automatic Classifier for Web Documents along Expert and Detail Dimensions The classifier employs two simple formulas based on the ratio of the high-to-low word frequency count of long English words gathered from the “Brown corpus”, (Kucera & Francis, 1967). Indices of “technical elements” (HTML tags like , , etc.) and the ratio of long words to the encountered by the classifier tokens are used as common-sense heuristics. The experimental corpus consists of 430 words equal to or longer than 9 characters of natural language frequency higher than 49 per million. The X dimension is computed as: P ( D) = P( L) + P(W ) + P (G ) where P(D) is the detail dimension, P(L) is an index of document length, P(W) is the ratio of long words to HTML tokens, and P(G) is an index of the presence of images. The Y dimension is computed as: P( E ) = P ( F ) + P (T ) where P(E) is the expert dimension, P(F) is an index of high-to-low word frequency ratio, and P(T) is an index of technical HTML elements. The motivation for the classifier has come from studies both in cognitive science and information retrieval (Hearle, 2001; Lee, 1999; Karlgren, 1999; Kesslet, Nunberg, & Schütze, 2000). The expert level of the document is represented in its larger part by the relative ratio of high-to-low frequency long words present in the text of a given Web document. The intuition is that long words of low frequency can be descriptive of expert level scientific texts, whereas the abundance of high-frequency long words are more descriptive of popular texts intended for more general audience. Examples of high vs. low natural language frequency words used in building the Web document classifier are: High-frequency words: Beautiful (127), Different (312), Difficult (161), Experience (276), Literature (133), Opportunity (121), Particular (179), Possible (373), Technical (120);

Low-frequency words: Laboratory (40), Institution (41), Investigation (59), Marginal (25), Medicine (30), Observation (27), Personality (48),Tendency (49), Technique (60). User-Web Interface Documents for classification (evaluation)

?

..

Classifier, euristic methods for classification, BOW, POS, …

..

Lexical Database (LDB i)

Examples of classified documents on one criterion

∆∆



..



..



..



..





Ω Ω Ω ..

Examples of classified documents on two criteria

∆ ∆ ∆

∆ ∆ Legend ? - unclassified text - popular text - expert text ∆ - brief text Ω - detailed text

..



Ω Ω Ω .. Classified documents

Figure 2. Illustration of the process of classification of Web documents according to some text criterion (feature, parameter, etc.). The classifier employs a lexical database/thesaurus for text classification (e.g. high frequency long words, etc.)

The combination of the index of high-to-low frequency long words in the text plus some heuristics for technical elements from shallow natural-language processing is set to represent the expert level dimension of the classifier. We are also investigating the idea of the

continuum nature of the ratios of high-to-medium-to-low frequency words. The detail dimension is based on document length plus the ratio words-to-tokens and an index of presence of gif files. The resulting values are normalised and plotted along the expert and detail dimensions. The idea for implementing an interface with enhanced perceptual cues for features of documents, other than topic relevance, is to plot documents along different dimensions in an orthogonal coordinate system. The classifier plots the Urls of the retrieved Web documents in response to the query of the user in a rectangular coordinate system. The top location of the Url denotes that it contains a technical document, the bottom location denotes that the Urls contain popular documents, the summaries are plotted to the left and the extended documents - to the right (Dimitrova & Kushmerick, 2003; Dimitrova, 2003; Dimitrova et al, 2003). 5. Related work on Building Multifaceted Web Document Classifiers The most closely related work on building multifaceted Web document classifiers is the ongoing work on automatic Web document genre classification (Kushmerick, 2001; Finn, Kushmerick, & Smyth, 2002). In (Finn, 2002) an “orthogonal” classification system is investigated, dealing with the following dimensions – subjectivity (opinion vs. fact) and review (positive vs. negative). A comparison of different classification approaches is made and it has shown that the ”part-of-speech tagging” (POS) outperforms the “bag-of-words” (BOW) approach for the subjectivity dimension. It also shows a good domain transfer, for example, from football to finance and vice versa. The review classifier, however performed less satisfactory with POS being even inferior to BOW (Finn & Kushmerick, 2003). In building our classifier we have applied the BOW approach for the reason of applying “smart” (i.e. cognitive science heuristics). We are using a thesaurus of high-frequency long words, thus reducing about twice the number of items to be processed, hence our feature vector is much smaller. We assume, however that a combination with POS (e.g. adjectives vs. nouns) may improve the classifier performance for the same reason that the items are much less and the processing time can be optimised. These studies show that there is some generality in cognition, independent of the classifier architecture and processing principle which requires some degree of “universality” of approaching as illustrated in the conceptualisation of user search preferences in figure 1. 3. User Evaluation of Classifier Performance: Emergent Cognitive Phenomena The evaluation was performed indirectly, rather than by explicit evaluation of system performance via a Web-based survey designed especially for this purpose. Participants were asked to rate Web documents on several successive pages, randomly generated from the experimental corpus. The instruction to rate either the amount of detail or the expert level of the document was randomly generated, too. The results of the study support the main ideas of the classifier. The user evaluation of the expert and detail dimensions showed that these can be assumed independent from a user point of view, which was supported by the outcome of the brainstorming session. The implemented classifier works fine for fragments of texts inside the Web document, which can be employed for highlighting certain parts of it as more expert or popular. This can be used as a new visualisation design heuristic. The classifier works better for long documents than in retrieving short summaries. The demand for summaries of various documents, especially in scientific domains, is of increasing importance in presentday Web. The feasible way to deal with this problem is to implement automatic summarisation approaches, built on the grounds of “smart” heuristics in combination with, for

example, the “computing with words” formalisation approach (Zadeh, 2004). This theory is newly emergent, yet ambitious, to help formalise cognitive/linguistic ways of people dealing with problems. Some of the insights gained from the research to this point are related to the specificity and potential of the Web. Some of the emergent phenomena are: “Perceptive” effect - users apply perceptual “least-effort strategy” to the fast assessment of Web documents - first length, then technicality, then lexical features. Learning effect - documents that seem of higher expert level at the beginning of a survey seem more popular towards the end. Design influence users rate the overall HTML design of the document, not just the text itself. User expectations - present-day users expect some kind of final reward or feedback on how they performed, rather than just “Thank you for participation”. Users generally seek interaction with intelligent interface, which means that adaptive features have to be implemented early in system design even when the interface is assumed as simple as possible. Personalisation - multiple modes are observed in the statistical analysis of the results, suggesting that users employ individual criteria in assessing Web document dimensions. 4. Emergent Cognitive Demands of Web Users: A Case Study The follow-up case study is performed on translation of a course in orthopedics from Bulgarian into English and consists mainly in correct translation of anatomical bone structure, diseases and medication. We interviewed the translator on the actual search on the Web and subtracted ideas about the process of finding the relevant information. The user was acquainted with the idea and an example of our multifaceted graphical interface and the question was how this can help in the actual search process. The main difficulty in the search process was that general and specialised search engines give little clues about the nature of the retrieved documents – these turned out to be either university sites with course headings, or books from sites like amason.com, or requirements for position candidates (all this in a “flood” of Urls). More helpful information was contained in sites of clinics or medical centers, giving more content about the decease and the treatment. Insightful “encounter” for out user was the discovery of the site of Wheeless' Textbook of Orthopedics (suggested by Google) (http://www.ortho-u.net/med.htm), where the bone structure of the skeleton is given in graphical format and by clicking on the bone the site displays all the relevant and needed information. In addition, helpful were the specialised dictionaries on the Web with definitions and summaries. In general, this again confirms the current need for both interactive textgraphics interfaces and more salient cues about the content and style of the returned document Urls. 5. Conclusions The results of the experiments show that interface systems employing phenomena from cognitive science tend to be simple and efficient, as well as perceptually salient to the users in their daily experience with the Web. Investigating user interaction with novel interfaces can reveal new and subtle emergent cognitive aspects like opinions, interests, nuances, expectations and cognitive style within the specific Web context. Our results show that there is a level of “universality” of the solution of the problem of building multifaceted user interfaces that needs to be further investigated. By including memory and “autobiographical” knowledge we have also attempted to make our Web “agents” more “brain-like” in performance and more adapted to the user (Dimitrova et al, 2004). We feel that this as a step towards building more self-explanatory, perceptually salient and conceptually understandable

Web interfaces. We are currently extending our approach to more practical domains, such as medical diagnosis, DSS system design, adaptive interface design and learning from the Web. 6. Acknowledgement Substantial part of this research was carried out while the first author was at the Computer Science Department of the University College Dublin, supported by grants SFI/01/F.1/C015 from Science Foundation Ireland, and N00014-00-1-0021 from the US Office of Naval Research. Subsequent tests are being carried out while the first author is on an individual scholarship granted by the Universitata Autonoma de Barcelona, 2004. 7.

References Dimitrova, M. & Kushmerick, N. (2003). Dimensions of Web genre, Poster at WWW2003, Budapest, Hungary, [http://www2003.org/cdrom/papers/poster/p143/ p143-dimitrova.htm]. Dimitrova, M. (2003). Cognitive modelling and Web search: Some heuristics and insights, Journal "Cognition, Brain, Behaviour" Vol. VII, No 3, Pp. 251-258. Dimitrova, M., Barakova, E., Lorents, T. & Radeva P. The Web as an “Autobiographical Agent”, Submitted to 11th International Conference on Artificial Intelligence: Methodology, Systems, Applications: “Semantic Web Challenges”, AIMSA´2004, Sept. 2-4, 2004, Varna, Bulgaria. Dimitrova, M., Kushmerick, N. Radeva, P. and Villanueva, J. J. (2003). User assessment of a visual Web genre classifier, Proc. 3rd IASTED International Conference on Visualization, Imaging and Image Processing (VIIP), Benalmadena, Spain, Pp. 886-889. Finn A., Kushmerick, N. & Smyth, B. (2002). Genre classification and domain transfer for information filtering, Proc. European Colloquium on Information Retrieval Research, Glasgow, Pp.353-362. Finn, A. (2002). Machine learning for genre classification. MSc thesis (University College Dublin). Finn, A. & Kushmerick, N. (2003). Learning to classify documents according to genre. IJCAI03 Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Pp. 35-45. Hearle, N. (2001). Sentence and word length. Sentence.html]. Accessed 03/02/04.

[http://hearle.nahoo.net/Academic/Maths/

Karlgren, J. (1999). Stylistic experiments in information retrieval. In Natural Language Information Retrieval, Tomek Strzalkowski, (ed.) Kluwer. Kesslet, Nunberg, G. & Schütze, H. (2000). Automatic detection of text genre, Proc. 35th Meeting Assoc. for Computational Linguistics, Pp. 32-38. Kucera, H. & Francis, W. N. (1967). Computational analysis of present-day American English, Brown University Press. [www.psy.uwa.edu.au/ MRCDataBase/uwa_mrc.htm]. Kushmerick N. (2002). Gleaning answers from the Web. Proc. AAAI Spring Symp. on Mining Answers from Texts and Knowledge Bases, Palo Alto, Pp. 43-45. Lee C.H. (1999). A locus of the word length effect on word recognition. Journal of Reading Psychology, 20, Pp. 129-150. Zadeh, L. (2004). Soft computing and computing with words in systems analysis, decision and design. [http://www.cs.berkeley.edu/~nikraves/zadeh/zadeh6.doc]. Accessed 03/11/03.

Suggest Documents